The shadow trader

Domenico Giordano2025-12-1112 min · read

Backtests on cryptocurrency strategies, in my limited experience and in the broader literature, tend to be optimistic by an amount that is large enough to flip the sign of the result. There is by now a substantial body of work on this: transaction-cost adjustment, slippage models, latency simulation, all bolted on after the fact in an attempt to reconcile a backtest that says profitable with a live run that says broke.

I want to write here about a different approach, which I do not think is novel (the idea must have been reinvented many times) but which I have not seen written down cleanly enough that I can point a colleague at it. I will call it the shadow trader. It is an architectural pattern: a simulator that runs in parallel to a live detection pipeline, shares the same in-memory state, and writes one record per detected opportunity describing what the live executor would have done with it.

The headline result of running a shadow trader against three days of triangular arbitrage detections on Binance Spot, across roughly 28,000 candidate opportunities, was a median bias of -65 bps relative to the naive backtest on the same events. That is large: roughly three times the size of the fees on a three-leg cycle. It is more than enough to convert a strategy from "looks profitable" to "is structurally a loser." And it does not come from latency, or from slippage, or from market impact. It comes from the part of the backtest that I had not thought to write.

What the backtest looked like

The standard backtest in this domain has three steps. Compute the gross profit ratio of a three-leg cycle from the top-of-book prices on each leg. Subtract the cumulative fee. If the remainder is above some threshold, count it as a profitable opportunity.

This is what I had. I had run it for weeks. It produced numbers that, taken at face value, said the strategy was tradeable about 20% of the time on certain paths. The numbers were precise to four decimal places. They were reported in basis points. They had statistical tests around them.

The numbers were also, all of them, wrong by an amount that exceeded the strategy's edge.

The reason was specific. A live execution path on Binance applies several order-routing filters at the moment a trade is sent. The most important are stepSize, which rounds the quantity down to the asset's tradable increment; minNotional, which rejects orders whose notional value is below an asset-specific floor; minQty, which rejects orders below an asset-specific minimum quantity; and the top-of-book depth, which caps any order's effective fill at the visible liquidity. Each of these silently changes the quantity that would actually be filled, in a direction that costs the trader money.

A backtest that takes the top-of-book quantity and divides by the budget gives an upper bound on what the cycle could yield under conditions where the filters do not bite. That upper bound is what I had been computing. It is not what the live executor would produce.

The shadow trader as a structural choice

The naive fix, which I tried first and rejected, was to post-process the backtest output and apply the filters to it. Take the JSONL of detections, walk through each one, recompute the effective quantity after stepSize truncation, check minNotional, check minQty, recompute the profit ratio on the truncated amount.

This works for static analysis. It is not what I needed. What I needed was a runtime artifact (produced by the same code path that would execute the live order, against the same in-memory state at the same instant) that I could compare against the detector's claimed opportunity. The fact that the two pieces of code produce different answers, in the same moment, on the same input, is the result of the experiment. Post-processing makes that comparison impossible because the post-processor reads a snapshot of state that has already drifted from what the live executor would have seen.

So the shadow trader is structural. It is a function that takes the same (triangle, orderbook_snapshot) pair the live executor would take, and returns the basis points it would have achieved. It uses the same Fixed8 arithmetic, the same fee model, the same filter chain. It writes one JSONL line per detected opportunity with detected_bps (what the naive top-of-book calculation said) and shadow_bps (what the filter-replicating execution said) on the same row.

The cost of this, at runtime, is a few microseconds per detected opportunity. It is, by a wide margin, the cheapest architectural change I have made on this codebase, and the highest-information one.

What the comparison showed

On a continuous 72-hour window I logged 27,579 detected opportunities, every cycle the naive detector flagged as profitable at the top of book. Of these, 85% passed all three exchange filters; the remaining 15% were rejected before the shadow could compute a profit. So the headline comparison is on the 23,571 events that survived the filters.

Two distinct measurements come out of this comparison. The median of shadow_bps minus detected_bps on the survivors (the within-opportunity loss from replicating the filters) was -65 bps; the mean was -101. That is the bias: how much the naive backtest overstates the typical opportunity. The shadow distribution itself, taken in absolute terms, has a median of -42 bps and a mean of -71, against a naive detected median of +16 on the same survivors. The gap between +16 and -42 is, modulo a small rounding, the bias.

Three tests on the shadow distribution converge on the same direction. A sign test, with 80% of survivors having strictly negative shadow_bps, rejects the null of a non-negative median at a one-tailed p-value below 10⁻³⁰⁰. A non-parametric bootstrap on the median, 1,000 resamples, returns a 95% confidence interval of -42.85 to -41.90 bps around a point estimate of -42.38. A Wilcoxon signed-rank statistic of -112.96 corroborates the direction, with the caveat that the heavy left tail of the shadow distribution violates the Wilcoxon symmetry assumption: the sign test and the bootstrap, both fully non-parametric, are the primary statistics.

In plain words: the strategy, executed under conditions identical to those used in the backtest except that the filters were respected, was losing roughly 65 bps more per cycle than the backtest claimed.

Of the opportunities that survived the filters, only 20% had a positive shadow_bps at all. Of those positive opportunities (the ones that could in principle be tradeable) almost all of them clustered on a small number of paths.

The clusters and the rule

The visual was striking. Among the 4,677 positive-shadow opportunities, the top ten paths by count accounted for the majority of the volume, and the same paths kept recurring across days. Most were anchored on a small set of micro-cap tokens (LUNC, BROCCOLI714, MANTA, ME) bridged through either a low-liquidity stablecoin or a jurisdictionally-segmented fiat pair such as IDR or TRY. The LUNC-anchored paths are the same ones that, in the previous essay in this series, the Rust parse-error bug had concentrated all 244 phantom hits on. Convergence between a measurement failure mode and a structural backtest gap is not an accident; the assets where the order book is hardest to read in real time are also the assets where the filter chain is most punishing.

Both of those signal patterns are recognizable, in a quiet way, to anyone who has watched a Binance order book for a while. Micro-caps have shallow books and stale quotes; the bookTicker stream updates them less often than the spread between bid and ask is moving. Segmented fiats (Indonesian rupiah, Turkish lira) are not freely accessible to most participants. A retail trader cannot route capital through TRY without satisfying jurisdictional checks. Whatever the order book shows on those legs, the trade is not available to be taken.

So I wrote down a rule, before looking at which paths it would catch, that said: an opportunity is classified as a ghost if either (a) the minimum 24-hour quote volume across the three legs is below $1 million, or (b) any leg is denominated in a segmented fiat. The thresholds were chosen from the Binance public volume statistics and from a brief read of the exchange's jurisdictional documentation. The rule has two parts, each operationalizing a different constraint from the limits-to-arbitrage literature: the first a liquidity constraint, the second an access constraint.

Applied to the top ten positive paths in the dataset, the rule classified all ten as ghosts. Applied to the full positive subset, it caught 95% of records globally and 100% of records on the retail-accessible stablecoin-anchored subset. Applied out of sample to an earlier two-day window (one collected under an unfiltered version of the shadow that I had been using before) the rule still caught 99.13% of the positive records, including paths that were not in the in-sample top ten. The rule generalized.

The interpretation is: the apparent positive-PnL opportunities on this strategy were not market signals. They were artifacts of conditions that the order book accidentally records but that no retail participant can act on. Removing them by rule, before looking at the data, leaves essentially nothing.

The fee-recovery temptation

I will be honest about the next step, because it is the natural one and because I want to flag that it does not work. I spent an afternoon convinced this was the unlock. Looking at a shadow distribution that is sixty-five basis points underwater, an engineer's first instinct is to recoup the fees. There is a stack of standard tools for this on Binance. BNB pay reduces the spot fee by 25%. The referral programme returns a further kickback. VIP1 status, accessible at $100,000 of monthly volume, brings the maker fee close to zero on certain pairs. A maker-hybrid execution strategy, where two of the three legs sit as resting limit orders and only one crosses the spread, can in principle access the maker rebate.

A full filter sweep across the combinations of these tools gives a cumulative shadow PnL of -$13,811 on the 23,571 filled events, at $100 of notional per cycle, over the 72-hour window. The best individual configuration is full maker-hybrid on VIP1, with effective fees of about 10 bps cumulative; even at that floor the strategy bleeds money. No setting of the parameters that I could find produces positive cumulative PnL.

The reason is the same as before, rephrased: the strategy's average gross edge, after replicating the filters the live executor would respect, is negative. Fees are not the binding constraint. The binding constraint is that the filtered shadow distribution has a structurally negative mean, and a fee improvement of a few basis points cannot move the average above zero when the median is at -42 and the tail is heavy.

What the shadow trader is, in slower words

The deepest thing I took from this experiment was not the result about triangular arbitrage on Binance, which the literature already knew. It was the recognition that the gap between a backtest and a live run is not a calibration problem to be papered over with a slippage model. It is, more often, a category mistake about what the backtest was measuring.

A naive backtest in this kind of market microstructure problem is computing the answer to a question that is almost (but not quite) the one the trader cares about. The question the naive backtest answers is if I could trade arbitrary quantities at the displayed top-of-book price, what is my edge. The question the trader cares about is if I send an order through the live exchange's order-routing pipeline, what would I get. These are different questions, and the difference is large enough to invert the answer.

The architectural pattern that closes this gap is not a slippage model. It is a piece of code that is the live execution path, hooked up to a snapshot of the order book at the moment of detection and emitting its answer side-by-side with the detector's answer. The shadow trader does not predict anything. It computes what the production code would do, right now, with this opportunity. The difference between that and what the detector said is the result of the experiment.

I think this generalizes beyond crypto arbitrage. Any quantitative strategy that runs against a live system with non-trivial execution semantics (order books, rate limits, retry policies, idempotency keys) is at risk of a backtest that measures the wrong thing. The defense is structural. Run the production code path against the historical inputs, in parallel to whatever you would otherwise be doing, and write its answer down. Once you have the column of what the production code would have done, the rest of the analysis is straightforward.

What this changed for me

I no longer trust an unfiltered backtest as evidence about a live system. The cost of that trust, in my own work, was several weeks of confidence in a result that was off by 65 bps, on a strategy whose total edge (before any of this work) was bounded above by a number smaller than the bias.

The thing I built instead is now a small piece of every quantitative analysis I do on this codebase. It is not a paper or a framework. It is two hundred lines of code that share state with the production pipeline, write JSONL, and answer the only question I have learned to ask first: if the production code were the backtest, what would the answer be?

Whatever else this experiment was (and it was, on balance, a confirmation of what the previous literature had already concluded) it left me with an architectural habit that I now use everywhere. The shadow trader, I think, is the smallest piece of production code that has ever changed the way I read a number.

In this series · Notes from a triangular arbitrage bot

← Previously

The most useful bug I shipped