Signal-to-Noise Ratio
9 min read
Filter "maybe" setups from "must take" ones by measuring and scaling the clarity of your trading signals.
9 min read
Filter "maybe" setups from "must take" ones by measuring and scaling the clarity of your trading signals.
Your edge doesn't live in every signal — it lives in the clarity. Learn to measure it, focus on it, and scale it.
Signal-to-Noise Ratio (SNR) in trading is the ratio of the mean return of a setup to the standard deviation of its returns. It's mathematically the same family as the t-statistic and the unannualized Sharpe ratio — a per-trade SNR multiplied by sqrt(n) is exactly the t-stat of your edge. Information theory gives the underlying form in decibels.
SNR = mean(R) / stdev(R) = mu_signal / sigma_noise = 10 log10(P_s / P_n)
This lesson is the capstone of Advanced Statistical Thinking. SNR is structurally the t-stat that powers Sharpe; its denominator is corrupted by outliers; and high-SNR tags decay through edge degradation. We tie those threads together here.
Three operational forms exist, and they are not interchangeable:
| Metric | Formula | When to use | Min sample | Pitfall |
|---|---|---|---|---|
| Per-setup SNR | mean(R) / stdev(R) within a tag | Comparing setup tags within a single strategy | n ≥ 30 | Non-robust to outliers in σ |
| Sharpe (annualized) | (R_p − R_f) / σ_p × √(periods/yr) | Whole-strategy risk-adjusted return | n ≥ 100 periods | Hides skew/kurtosis |
| Information Coefficient (IC) | corr(forecast, realized R) | Validating a 0/1 or graded score as a signal | n ≥ 50 forecasts | Ruined by retroactive scoring |
This lesson uses per-setup SNR for tag triage and IC for validating scoring rubrics. We'll flag explicitly which one is the right tool at each step.
The colloquial framing — "clean vs messy setups" — is intuition, not measurement. Two traders looking at the same chart will disagree on "clarity." Two traders running the same R-vector through mean(R) / stdev(R) will get the same number. If you want to manage edge, you need the number.
The earlier "looks vague vs visually obvious" framing collapses into trader feeling. Replace it with measurable features, recorded before the trade closes:
| Feature | High-SNR signature | Low-SNR signature |
|---|---|---|
| HTF alignment | Trend agrees on 4H + 1H | Conflicting timeframes |
| Liquidity context | Sweep + reclaim | Mid-range entry |
| Volume confirmation | ≥ 1.5× 20-bar average | Below average |
| Spread vs ATR | ≤ 1.0 × ATR(14) | > 2.0 × ATR(14) |
| Confluence count | 3+ independent factors | Single indicator |
| t-stat over n ≥ 30 | ≥ 2.5 | < 1.5 |
| Inter-rater agreement | Cohen's κ ≥ 0.6 | Cohen's κ < 0.4 |
Each row is observable in advance and reproducible by a second trader. If your scoring system can't be reproduced, it isn't a signal — it's your mood.
Even if your system has 3 great setups and 2 average ones, taking all 5 lowers your overall EV. You're padding win rate with noise while hiding underperformance from the tags that actually carry signal. Most pros don't trade more setups — they trade fewer setups better, sized larger.
The math: if tag A has mean +0.6R with stdev 1.5R (SNR = 0.40) and tag B has mean +0.05R with stdev 1.2R (SNR = 0.04), blending them at equal frequency gives a weighted mean of +0.325R but a stdev around 1.35R — pulling your aggregate SNR from 0.40 down to 0.24. You lost 40% of the signal-per-risk by adding the mediocre tag.
Adding a low-SNR tag halves your aggregate signal-per-risk.
In your journal, tag each trade by setup name (e.g., "liquidity sweep + FVG", "pullback to VWAP"). For each tag, log:
Tag "sweep + FVG", last 40 trades: n = 40, mean(R) = +0.42R, stdev(R) = 1.6R, SNR = 0.42 / 1.6 = 0.26, t-stat = 0.26 x sqrt(40) ~ 1.65. A t-stat of 1.65 is below the 2.0 threshold and is not yet a confirmed edge — it's plausibly noise. Compare against Tag A and Tag B:
| Tag | n | mean(R) | stdev(R) | SNR | t-stat | Verdict |
|---|---|---|---|---|---|---|
| sweep + FVG | 40 | +0.42 | 1.6 | 0.26 | 1.65 | below threshold |
| Tag A | 120 | +0.9 | 2.4 | 0.375 | 4.10 | core tag |
| Tag B | 200 | +0.1 | 0.8 | 0.125 | 1.77 | likely noise |
Win rate alone is misleading. The lower-win-rate setup carries more signal per unit risk:
Lower win-rate, higher signal-per-risk.
| Setup | Win rate | mean(R) | stdev(R) | SNR |
|---|---|---|---|---|
| Scalp | 90% | +0.1 | 0.5 | 0.20 |
| Breakout | 30% | +0.6 | 1.5 | 0.40 |
The old "5 = perfect confluence, no hesitation; 1 = FOMO" scale collapses signal magnitude into trader emotion. "No hesitation" is an after-the-fact feeling, not a pre-trade observable. Replace it with a sum of binary features recorded before entry:
Sum gives a 0–5 score. Validate the rubric with IC = corr(score, realized R) over n ≥ 50 trades. If IC ≈ 0, the rubric carries no information and you're scoring noise.
Pitfall — retroactive scoring. Scores must be recorded BEFORE the trade closes (ideally before entry). If you re-score after seeing the outcome, your IC will be ~1.0 by construction and meaningless. This is a textbook look-ahead bias — see biases in backtesting. Hindsight-scored "edge in 4–5 buckets" is selection bias dressed up as analysis.
Have a second trader score 30 of your setups blind. Compute Cohen's κ on the agreement:
If two competent traders can't agree on what a "high-quality setup" looks like, you don't have a rubric — you have a habit.
Don't ask "does this setup look clean?" Ask: "Does HTF trend alignment, encoded as a 0/1 input to my score, lift the IC of the rubric on out-of-sample data?" If yes, keep it. If no, drop it. Clarity that doesn't survive falsification isn't signal — it's confirmation bias.
Use the t-stat (SNR · √n) and a minimum sample size, not the SNR alone:
| t-stat band (n ≥ 30) | Action | Risk allocation |
|---|---|---|
| t < 1.5 | Prune | 0 — remove from rotation |
| 1.5 ≤ t < 2.0 | Probation | Half size until n ≥ 60 |
| 2.0 ≤ t < 3.0 | Standard | Full size, monitor quarterly |
| t ≥ 3.0 | Core tag | Full size, prioritize |
Action: prune any tag with t-stat < 1.5 after n ≥ 30. Reallocate the freed risk budget to tags with t-stat ≥ 2.5.
Caveat: false precision. With n < 50 per quality bucket, the gap between "4–5" and "2–3" buckets is dominated by sampling noise. Confirm pruning decisions with bootstrap confidence intervals (resample your R-vector 1000× with replacement, take the 5th–95th percentile of SNR) before you cut a tag. A tag with point-estimate t = 1.8 might have a CI of [0.4, 3.2] — the data hasn't decided yet.
Trade fewer, clearer, repeatable setups with higher statistical confidence.
Even if your system has:
Taking all 5 lowers your overall EV. You're padding win rate with noise while hiding underperformance — and outliers can corrupt the noise estimate in either direction, making the dilution invisible until a regime change exposes it.
Most pros don't trade more setups. They trade fewer setups better.
The standard deviation in SNR's denominator is non-robust: a single 8σ event in your sample can either crush or rescue your SNR depending on its sign. Use a winsorized stdev (clip top/bottom 5%) or report SNR alongside median absolute deviation (MAD) as a robustness check.
A high-SNR tag in 2023 can collapse in 2024 as the regime changes and other traders crowd the same setup. The lesson next door — edge degradation — is the right home for this. Re-test SNR on rolling 60-trade windows; if it trends down, you're watching an edge die.
The tags you kept are the ones that worked in your historical sample. Some of that performance is real edge; some is sampling luck. Forward SNR will mean-revert. Plan for at least 30% of your kept-tag historical SNR to evaporate on out-of-sample data; if it doesn't, you got lucky on the prune itself.
Same family — Sharpe is an annualized portfolio-level SNR with a risk-free-rate offset in the numerator. Per-setup SNR is the unannualized within-tag version: mean(R) / stdev(R) where R is in R-multiples. Multiply per-setup SNR by √n and you get the t-statistic of the edge. The metrics solve the same problem at different scopes.
Per-trade SNR above 0.30 over n ≥ 30 is the floor; ideally you want the t-stat (SNR · √n) ≥ 2.0 before you treat the tag as a confirmed edge, and ≥ 3.0 before you call it a core tag. Below t-stat 1.5, the data hasn't decided yet — keep the tag on probation at half size, don't prune yet.
30 trades for direction, 100+ for confidence in the point estimate. The standard error on stdev shrinks like 1/√(2n), so doubling sample size cuts uncertainty by ~30%. Below n = 30, your SNR is mostly noise. Confirm with bootstrap confidence intervals before any pruning decision.
No, they're decoupled. A 90%-win-rate scalp with mean +0.1R and stdev 0.5R has SNR = 0.20. A 30%-win-rate breakout with mean +0.6R and stdev 1.5R has SNR = 0.40. The lower-win-rate setup carries more signal per unit of risk taken.
No — that's retroactive scoring, a textbook look-ahead bias. If you label a setup "5/5" only after it works, your scoring system's information coefficient becomes 1.0 by construction and means nothing. Scores must be locked in before entry, ideally written into the trade ticket itself.
SNR measures how strong the signal is per trade (mean over std of realized returns). IC measures how well a forecast or score predicts realized returns (correlation between score and R). Use SNR to triage setup tags; use IC to validate that your scoring rubric carries any information at all.
You've now covered Sharpe and Sortino, outliers, edge degradation, backtest biases, and signal quality. Together these are the toolkit for separating real edge from sampling artefacts.
Pruning low-SNR tags should improve aggregate Sharpe over n ≥ 100 forward trades — but a single quarter of underperformance from a pruned tag may be noise, not death of edge. Re-test annually, and revisit the edge degradation lesson when a previously-strong tag's rolling t-stat starts trending down.