Essay

Notes from a triangular arbitrage botPart 2 of 3

The most useful bug I shipped

Domenico Giordano2025-11-1913 min · read

I rewrote a Python crypto bot in Rust over the course of a single late-night session, with AI peer review on every non-trivial fix. The original plan budget for the port (a full rewrite of a 3,000-line bot with a custom decimal type, a lock-free order book, and a deterministic event loop) was 100 to 140 hours of focused work. I finished it in roughly ten.

The port worked. The bot booted, connected to the exchange, indexed 5,000 triangles in 70ms. The hot path that took Python 150ms finished in 12ms. Throughput at steady state was 138× higher than the Python version on the same machine.

It was also, for the first sixteen minutes after I deployed it to production, silently generating 244 arbitrage opportunities that did not exist.

This essay is about that bug. It is also about something I think I had been quietly misunderstanding about what AI peer review is and is not good at.

The port, in fewer words than it deserves

The bot watches the live order book on a centralized exchange, evaluates roughly 5,000 three-leg cycles in parallel for every price update, and flags any cycle whose round-trip is profitable after fees. The Python version worked but was bottlenecked at 150ms of reaction time per update, dominated by the cost of arbitrary-precision arithmetic in Decimal and by inter-process communication between the worker pool and the main loop.

The Rust rewrite swapped both. A custom Fixed8 type (a 64-bit integer with a fixed 8-decimal scale) replaced Decimal and ran roughly 30× faster on the hot path while preserving exact-decimal semantics within the value range we cared about. A lock-free seqlock-protected order book replaced the IPC bus. The 5,000 triangles ran in a single CPU-pinned thread that did nothing else.

None of this is novel; all of it is risky. The risk in this kind of port is not the algorithm (which is well-understood) but the long tail of small choices about types, lifetimes, memory ordering, and serialization, any one of which can quietly produce data that is wrong instead of code that is slow.

I had a recent reason to take peer review seriously: skipping it on the Python version of the same bot, described in the previous essay in this series, had cost me ten hours of misinterpreted benchmarks. Sixteen rounds was the overcorrection.

To manage that risk, I ran an AI peer-review pass on every non-trivial commit. Sixteen rounds in total before I deployed, all on the same model on its highest reasoning setting. The output of each round was a written critique with severity-tagged findings. I applied or rebutted each finding individually, and the code went out only after the model returned "no further critical or high issues found" on the final pass.

I thought, by the time I clicked deploy, that the bot was as carefully reviewed as a solo-author bot could be.

What boot looked like

The first three seconds were perfect. The bot logged 1392 symbols, 5196 triangles indexed, identical down to the digit to the Python version that had been running on the same instance for hours. The lock-free order book filled with quotes. The triangle evaluator was reporting 53,000 triangles evaluated per second on a free-tier t3.micro, against 400 per second in Python.

The first sixty seconds were also clean. No errors. No backpressure. Memory steady at 67 MB. The warning counter, however, was logging something: 48 Parse error: overflow lines, all on micro-priced shitcoins (LUNC, SHIB, PEPE, BONK) where the top-of-book quantity is a number with twelve zeroes in it.

I read the lines, satisfied myself that they were not affecting the main path, and made a note to add a wider integer or a saturating-cast fix when convenient. The bot was producing two orders of magnitude more throughput than the Python version. The parse warnings were noise on the margin. I left them.

This is the kind of decision that, in retrospect, I should write a postmortem about every time I make it.

The smoking gun

Sixteen minutes in, the telemetry tick reported opportunities_found: 244. The same telemetry tick on the Python version, running the same configuration on the same instance for the previous fifteen hours, had reported opportunities_found: 0.

A two-orders-of-magnitude divergence in the headline metric, against a clean comparison baseline, is the kind of result that has exactly two possible explanations. Either the Rust port has discovered something the Python version was structurally unable to see (possible in principle, since the throughput difference is real and faster reaction could in theory uncover shorter-lived spreads), or the Rust port is wrong.

I went to look at the opportunities. The bot writes one line per detected opportunity to a JSONL file, with the full leg-by-leg pricing and the resulting profit ratio. I opened the file expecting a varied distribution across the universe.

All 244 entries were on the same triangle. USDT → USDC → LUNC → USDT. The same three pairs, every time, for sixteen minutes.

I did not understand this immediately. I sat with the file for a few minutes, then went to look at what the order book actually contained on those three legs. The USDT-USDC and USDC-LUNC quotes had been updating normally throughout the run. The LUNC-USDT leg had not. The last successful update on that pair was sixteen minutes old, precisely the moment I had deployed the bot.

What had happened was this. The original LUNC-USDT quote frame had been valid when the bot booted; the order book had stored it correctly. The next frame on that pair carried a top-of-book quantity that exceeded the maximum value representable in the Fixed8 type. The parser, on encountering the overflow, had logged a warning and discarded the entire frame. The bot's local snapshot of the LUNC-USDT pair had been frozen on the original price from sixteen minutes earlier, while the rest of the market moved around it.

A 244-times-repeated arbitrage opportunity is what you get when one leg of a three-leg triangle is a stale price and the other two are live. The opportunity is not a market event. It is the bot quoting itself.

Why this would have been dangerous

The bot was running in dry_run = true mode, so no orders were placed. If it had not been (if I had been a few weeks further along, with order signing wired up and live capital authorized) the bot would have spent sixteen minutes sending 244 orders against a price that did not exist. Each one would have filled at whatever the actual top-of-book happened to be, against the bot's estimate of what it would fill at. The losses would have been bounded by the budget per cycle, but the bot would have ground through them confidently and noisily, with the telemetry reporting 244 successful executions.

This is the kind of bug that does not crash. It does not fire alerts. The metric it produces (opportunities_found) is exactly the metric a casual operator monitors for "is the bot working." A casual operator would have concluded, glancing at the dashboard, that the bot was working better than the Python version.

It was the silence that scared me, more than the dollar count, when I worked out what would have happened. Parse warnings are easy to ignore; opportunity counts going up are easy to celebrate. The two together, on a triangular arbitrage bot, are catastrophic. And nothing in the system, neither the code, nor the logs, nor the telemetry, nor the sixteen rounds of AI peer review, had connected them.

Three layers of fix, because one is not enough

The instinctive fix was to widen the integer or to cap on overflow, and I did that first. parse_qty_capped now returns Fixed8::MAX on overflow rather than discarding the frame, and a counter qty_capped_count is exposed in the next telemetry tick. The cap value is 600× larger than the maximum quantity my budget could ever consume on the worst-case asset, so the cap is operationally invisible at retail size but visible at observability time.

That fix would have prevented this particular bug, but it would not have prevented the next one. The class of failure I had hit was "a parse error on one leg of a triangle freezes that leg's price while the others move." There are other ways to land in that class: a network frame dropped at the kernel level, a malformed message from the exchange, a numeric edge case that crops up only on listing or delisting events. Capping the quantity does not address the class. It addresses one instance.

The second fix added a counter on the failure path. The first fix had made the failure recoverable; the second made it observable. If qty_capped_count rises on a major-cap symbol, that is not a shitcoin parsing edge: it is a precision bug in the Rust code, and the next telemetry tick should make it obvious. The cost of the counter is one atomic increment per frame, which on the hot path is essentially free.

The third fix was the one I should have built first. The order book grew a parallel array of atomic booleans, one per symbol, marking whether the most recent frame for that symbol had failed to parse at any stage. The triangle evaluator now reads three of those flags before reading any prices, and skips the cycle if any leg's symbol is currently in an invalid state. The flag clears automatically on the next successful frame for the same symbol. There is no longer a code path on which a stale price contributes to an opportunity decision. The bug class is closed by construction, not by case-by-case patches.

The runtime cost of the third fix was a thirty-line change and a measurable but tiny addition to per-evaluation latency. The architectural cost (designing it correctly, with Acquire/Release memory ordering between the producer that marks the flag and the consumer that reads it) was non-trivial, and it would have been impossible to add in the original sixteen rounds of review, because at that time the failure class did not yet exist in my mental model of the system.

What sixteen rounds of AI peer review missed

I want to be careful here. The peer review process found a great deal. Over the sixteen rounds, the model flagged fifteen critical or high-severity issues that I applied: memory ordering bugs on the seqlock, a fence-post error in the triangle indexer, a misuse of mem::replace in the order book writer, a subtle precision loss in the Fixed8 round-down logic, several others. Without it, I would have shipped code that was wrong in ways I would have discovered later and at higher cost.

But it missed the LUNC bug, and it missed it for a reason that is worth being explicit about.

The model reviewed the Rust code on its own terms. It checked correctness against the specification I gave it; it checked safety against the patterns it has internalized; it checked the handling of edge cases I asked it to think about. It did not, at any point in those sixteen rounds, compare the Rust code against the Python code that had been running on the same problem for the previous several months. It did not because I did not ask it to.

The Python codebase had a constant called SYMBOLS_PER_CONNECTION set to 200. That constant existed because at some point during the Python development, somebody (me, six months earlier) had discovered empirically that asking the exchange to subscribe to more than that on a single WebSocket connection produced an HTTP 414 at the gateway. The constant was not documented. The exchange documentation said the cap was 1,024 streams per connection. I had used the exchange documentation, not the codebase, in the Rust rewrite. The Rust bot's first attempt to connect produced an HTTP 414 within the first second, exactly as the Python version had six months earlier, before the constant had been introduced.

That bug, like the LUNC bug, was a constraint the codebase had learned and the model had not been shown. The pattern repeats: AI peer review can rigorously verify the artifact you give it against the constraints you explicitly state. It cannot verify the artifact against the empirical history of the previous artifact, unless you put that history into the prompt.

This is not a criticism of the model. It is a criticism of how I was using it. I was treating the review process as if it were a complete verification of the new code, when in fact it was a verification of the new code against a specification I was authoring as I went. The Python implementation was, in effect, the better specification, and I was not feeding it in.

What I now do differently

Three things, starting with the Rust port and applying everywhere since.

When porting or rewriting a system that has an existing implementation, the existing implementation goes into the prompt. Not as context, as the specification. The new code is reviewed against the old code, with the explicit instruction to flag any constant, threshold, magic number or workaround in the old code that is not represented in the new one. The model is good at this kind of diff once asked, and astonishingly bad at it when not.

A log line that fires regularly is a metric, whether or not anyone has wired it up as one. Forty-eight parse warnings per minute is not noise; it is the bot telling me that the price for forty-eight quote frames per minute is now stale. The "shall I look at the warnings" question has a default answer (yes) and the cost of looking is trivial against the cost of a silent bug that compounds for sixteen minutes before becoming visible.

For any class of failure I find in production, the fix is not done until I have written the architectural layer that would have prevented other instances of the same class. The LUNC fix is two lines for the cap, ten lines for the counter, and thirty lines for the invalid-flag skip. The first two lines were the obvious fix. The thirty lines were the actual fix. They are not the same work.

The Rust port shipped. It runs in production. The headline number (138× faster than the Python it replaced) is real and the architecture is, by every measure I have looked at, more correct than the version it replaced. None of which would have mattered if I had not, an hour after deployment, taken a closer look at a metric that was suspiciously high and a warning that was suspiciously routine. The most useful bug I shipped was the one I almost didn't notice.

In this series · Notes from a triangular arbitrage bot

← Previously

Three bugs before the market

Continue →

The shadow trader