Three bugs before the market

Domenico Giordano2025-11-0711 min · read

In early November 2025, I ran a triangular arbitrage bot for ten hours on an AWS instance in Tokyo. The bot evaluated 6.88 million possible cycles. It found zero profitable opportunities. The first conclusion was obvious: the market is efficient, retail can't compete, lesson learned, move on.

That conclusion was wrong. Not in its destination (retail triangular arbitrage on Binance Spot really isn't profitable, as a separate body of literature confirms), but in the route I took to reach it. Before I could honestly say "the market said no", I had to discover that three of the most fundamental measurements in my own code were broken.

This essay is about that discovery. More precisely, it's about why I almost shipped the wrong conclusion, and what the near-miss says about how engineers reason when the data confirms what they already expected.

What I built, briefly

The setup was unremarkable. A Python bot that opens a WebSocket connection to Binance, maintains a live top-of-book snapshot for around 1,400 trading pairs, and looks for three-leg cycles where the round-trip yields more than the cumulative fees. A small process pool to parallelize the evaluation. A few hundred lines of glue code. The textbook strategy in its textbook form.

I ran it from my apartment in Naples first: 273ms round-trip to Binance over residential fiber. After 47 minutes and roughly seven million evaluations, the bot reported zero opportunities. The honest interpretation was "you can't see what the market makers see; try closer." So I rented a t3.micro in the AWS Tokyo region for $0 under the free tier, and the round-trip dropped from 273ms to 21ms. Thirteen times faster on the network alone. If latency was the bottleneck, this should have moved the needle.

It didn't. Ten hours later the log told the same story: zero opportunities, billions of evaluations, no triangle anywhere in the universe ever crossed the profitability threshold.

I was about to call it.

The first conclusion, the wrong one

I wrote in my notes that evening: "Tokyo speedup doesn't unlock anything. HFT-saturated market. Retail completely out of the game."

It was a satisfying conclusion. It matched the priors I had walked in with: that crypto market microstructure has been picked clean by colocated market makers, that retail latency disadvantage is structural, that "free lunches" decay in milliseconds. It matched the broader literature I half-remembered. It felt like the kind of finding a careful, skeptical engineer arrives at: ambitious experiment, negative result, intellectual honesty.

So I started drafting the post-mortem. Tier 1 of the document was titled "what would a Tier-2 upgrade need to look like." I was already past the conclusion, planning the sequel.

I could have stopped there. I almost did. Three different times in the same evening I told myself the dataset was clean enough, the conclusion was honest enough, and that the next sensible step was to write it up and move on to a different experiment. The only thing that kept me from publishing was a small detail in the log I couldn't quite let go of.

The bot had seen 10.5 million triangle evaluations across the run. The profit ratio distribution was reported in buckets: below -1%, -1% to -0.5%, and so on. The "zero-to-threshold" bucket, the one immediately below profitability, was empty. Not "small". Not "occasionally non-zero". Empty across the entire dataset.

I had described that, in my notes, as "consistent with an HFT-saturated market." I now think that was the most generous interpretation I could give myself for not having looked harder. A market with measurable jitter, on ten million samples, should produce some triangles in the near-positive band even if none cross the threshold. Zero positives on ten million samples is not a market truth. It is a measurement that should make an engineer stop everything and ask why.

The question that changed it

I have a habit of running my conclusions past a peer review model (usually Gemini Flash on its highest reasoning setting) before I write anything down as final. This time I had skipped it, on the reasoning that the conclusion was "obvious" and the code had been audited. The audit, in retrospect, was a checklist: are the bid and ask sides correct? Yes. Is the fee count three? Yes. Is the profit ratio dimensionless? Yes. End of audit.

What broke the routine was a single sentence from the person I was talking to about the experiment: "Are you sure, one hundred percent sure, that there's nothing wrong with the algorithm?"

I wasn't sure. I had a satisfying conclusion and a thin audit. And the empty bucket was still there in the log, refusing to be a coincidence. So I sat with it for an hour and finally did what I should have done at the start: I ran the peer review.

I gave the model the relevant files, the data anomaly, and the question I had quietly been suppressing: why is the zero-to-threshold bucket empty?

It took one prompt to find three bugs.

The three bugs

The first was the most consequential. The bot calculated a target trade size in the starting asset (say, 0.000366 BTC) and then the first leg of the cycle truncated it down to the exchange's step size, leaving 0.00033 BTC. The remaining $2.20 of nominal value was never spent. The profit ratio, however, was computed as (final - starting_budget) / starting_budget, using the full intended budget as the denominator, not the truncated amount that the trade had actually consumed.

The effect was a systematic negative bias of around 2.5% on every BTC-anchored triangle in the universe. Not 2.5% in some cases. 2.5% in every case where the step-size rounding bit, which was essentially all of them at retail budgets. The "missing" near-positive band wasn't missing. It had been pushed below zero by an arithmetic mistake.

The second was simpler. The function that converted a USDT budget into an equivalent amount in the starting asset of the triangle used the wrong side of the order book. To buy BTC with USDT, you cross the ask. The code was reading the bid. The bid is always lower than the ask, so the function was reporting that I could buy more BTC than I actually could, inflating the denominator of the profit ratio and pushing the result further into the red.

The third was the one I should have caught from the logs. The bot used Python's Decimal type for the profit calculations and Python's float type for the WebSocket book updates, on the assumption that the multiplication would be handled cleanly. It is not. Decimal × float raises a TypeError in Python, every time. The reason I had never seen it crash was that the result aggregation code was wrapped in try / except Exception: and the exception was being silently swallowed. Every time the bug fired, a candidate profitable triangle was discarded as "an error" and counted in a different bucket. The 51 spurious entries clustered in a 4-minute window of the log (the ones I had filed as "noise") were the bug firing.

The peer review's verdict, in its own words, was that the bot "is currently blind. It is accurately reporting that it cannot make money, but it doesn't realize it's because it's throwing away its own money, via truncation, before it even starts the race."

What changed after the fix

I had to read the verdict twice before it sank in. Until that moment I had been carrying around a quiet pride in the cleanliness of the dataset: ten million samples, no noise, a beautifully consistent negative result. The cleanness was the bug. The bot was so broken it had produced a smooth, confident, completely meaningless distribution.

I patched the three bugs and re-ran for five minutes with full distribution telemetry. The maximum observed profit ratio settled at exactly -0.21% across every cycle, and the cumulative fee on a three-leg triangle at my tier is 0.225%. The bot was now measuring the market accurately, and the answer it was giving was: the best triangle in the universe is, on average, just inside break-even-on-fees territory, and nobody crosses it.

That is a real measurement. The earlier "zero opportunities" was not a real measurement: it was a coincidence between a broken bot and a market that happened to be, separately, hard to exploit. Two different things had arrived at the same headline, and I had been about to credit the wrong one.

I will not claim the corrected result is a triumph. It says, with much more rigour, the same thing the broken result said: retail triangular arbitrage on Binance is not profitable. But the rigorous version answers a different question. The broken version answered "is this strategy profitable?" with "no, because I see no opportunities." The corrected version answers it with "no, because the gross edge on the most attractive triangle in the entire universe, before fees, is 1.5 bps, and the fee structure is 22."

The second answer is useful. It tells you, quantitatively, how far the market is from being something a retail participant could exploit, and what would have to change for the strategy to clear break-even. The first answer was a slogan dressed up as data.

What I think went wrong, in slower words

There is no single villain here. There is a sequence of small, individually-defensible decisions that, taken together, produced an audit that endorsed a wrong conclusion.

I had a prior I was comfortable with. "HFT has saturated this market" is true in many adjacent contexts. It is a respectable-sounding answer. When the data confirmed it, I didn't probe.

I ran a checklist audit instead of a measurement audit. "Bid and ask in the right places, fee count is three" is a perfectly fine smoke test. It is not the audit that catches "the denominator of the metric I'm reporting is the wrong number". The latter requires you to trace one specific number (a winning triangle, an actual order) end to end, and see what the bot would do with it. I never traced anything end to end. I traced patterns.

I treated a statistical impossibility as a stylized fact. Zero positives, anywhere in the distribution, on millions of samples, is not "consistent with HFT saturation." It's a measurement so clean that something has almost certainly removed the noise. The bug had removed the noise. I had described the absence of noise as evidence for the market hypothesis I already preferred.

I had a try / except Exception block in production code. This one is technical, but I think it generalizes. Broad exception handlers in code paths that produce metrics are equivalent to silently changing the unit. You think you are measuring a thing; in fact you are measuring "the subset of the thing that didn't trigger one of a hundred possible failures." The errors aren't loud. They're systematic.

I treated peer review as something I do after writing the conclusion, not before. That is the deepest mistake. Peer review is cheap. The model's time is free; my time, when I'm about to publish a wrong number, is not. The asymmetry is enormous, and I had inverted it.

What I changed in how I work

Three things, applied since:

I do not let a zero-or-near-zero observation in a high-sample experiment pass as a market truth. The default interpretation is "your measurement is broken." If the data fails to find what you were looking for, the failure mode you should investigate first is the measurement, not the territory.

I do not write a try / except Exception: over any code path that produces a logged metric. If the path can fail, it fails loudly, the failure is in the log, and the metric is missing, not silently zero, not silently capped, not silently rebucketed.

I ask the peer-review model the same question I would ask a colleague before I ask it: not "is this code correct" but "what is the most obvious thing this analysis could have wrong, even if the code is correct?" The model is much better at the second question than I would have guessed. The cost of asking is one prompt. The cost of not asking, in this case, was a ten-hour benchmark on the wrong dataset and a draft post-mortem that would have been published with the wrong headline.

The market did say no. It said no for the reasons the literature already documents, and it said no with much less drama than my broken bot suggested. The interesting finding was never that the market is efficient. It was that I had spent ten hours measuring my own bugs and almost called it a result.

In this series · Notes from a triangular arbitrage bot

Continue →

The most useful bug I shipped