Essay
D
AI

Auditing my own AI workflow

Domenico Giordano17 min · read

I spent an evening, a few days ago, asking another instance of Claude to evaluate how I use Claude and Claude Code. The conversation ran for seventeen turns and produced, in sequence, a precise-sounding score ("top 10-15%"), a series of upward revisions ("top 1-2%", "state of the art for indie LLM-assisted development"), a single sentence from me about feeling disordered, a full retraction of the scores, and finally — three turns later, after I had reluctantly described two more layers of my setup — the observation that I had been systematically underreporting the system I had built.

The score was never the interesting part. The interesting part was the symmetry of the bias on both sides: the model inflated what it could not see, and I deflated what I had been living with for two years and now took for granted. The result of that double bias was that, for sixteen of the seventeen turns, neither of us was talking about the actual system. We were talking about what the model could guess and what I could remember to mention.

This essay is the audit I should have done first. It describes the system I actually have, the gaps that are actually there, and what the experience of being evaluated by an LLM about my own LLM-augmented workflow turned out to teach me about both — none of which involves percentile rankings.


The five layers that emerged in the conversation

Over the first six turns of the conversation, the model and I converged, one revelation at a time, on a description of my workflow as a five-layer system. I have since looked back at the description and confirmed, file by file, that the five layers are real.

Memory layer. A single Obsidian vault, on iCloud, holding every project I work on simultaneously: the SaaS I run on the side, the bank job that pays for it, the homelab, the apartment search, the side experiments. The vault is structured by convention rather than by folder hierarchy: project hubs in Projects/, architectural decision records in Decisions/, evergreen knowledge in Areas/, and a Daily/ folder that acts as a chronological index. The daily note is the entry point — when I open Obsidian or when Claude opens the vault from the filesystem, the question "what was I doing?" finds its answer in a single file, not in a folder scan.

Bootstrap layer. A custom skill named radar, plus an explicit trigger in every project's CLAUDE.md, that runs a Python script (agenda-check.py) at the start of every session. The script reads the project's agenda file from the vault, finds dated tasks that are overdue or due within seven days, and reports them. The model enters every session having already read what the system thinks I should be doing today. Zero variance in the context load; zero cognitive cost for me at the boot.

Execution layer. Claude Code itself, working against the repository and against the vault simultaneously. This is the layer everyone sees and nobody talks about precisely. It is the part that writes code.

Quality gate layer. Every non-trivial code change and every conclusion of any real consequence goes through a second pass: a peer review by a different model — Gemini Flash on its highest reasoning setting, called through an MCP server. Gemini reads the code, the change, and a structured prompt that asks specifically for the failure modes Claude is least likely to have caught. The output is a list of findings. Claude then evaluates those findings — does not implement them, evaluates them — in a separate deliberate pass with fresh attention. Findings that survive that evaluation get applied. Findings that don't get rebutted in writing.

Closing ritual layer. At the end of a day, or at any clear inflection point, a few lines get written into the project's session-state.md: what was done, what was decided, what's left open. A separate script (consolidate-daily.py) reads every project's session-state and produces a single master daily note in the vault, organised by project. The next session's bootstrap reads that master note as part of the context.

That is the five-layer description that emerged in conversation. It is accurate. It is also incomplete in a specific way that the conversation, as far as I could tell, never quite saw.

The sixth layer the conversation never described

What the conversation described was the architecture. What it did not describe — because I did not mention it, and because the model could not see it — is the enforcement.

The five layers, on their own, are a system that depends on me running it. Forget to ask Gemini for review, and the quality gate is silently skipped. Forget to write a session-state update, and the closing ritual produces nothing for the consolidator to pick up. Forget the bootstrap, and the agenda goes unread.

What makes the system actually run, not just exist, is a sixth layer of mechanisms that remove the option of forgetting. There is a PreToolUse:Bash hook that runs check-peer-review.py before any bash command involving a commit; if it detects that the diff includes a non-trivial code change without a corresponding peer-review entry in the project's session-state, it warns. There is the consolidate-daily.py script's design as a single-writer process: every shell, on every project, writes only into its own session-state.md, and the master daily note is produced by exactly one process at exactly one time. The race condition on the vault that the conversation correctly identified as a structural risk of working with multiple parallel shells — that race condition is closed by construction, not by discipline. There is the trigger line in every project's CLAUDE.md that says, in essentially these words, that the agenda check must be the first action of the first turn, and that the model should not respond to anything else before running it.

And there is a single line in the global Claude Code configuration — ENABLE_PROMPT_CACHING_1H=1 — that extends the model's prompt cache TTL from the default five minutes to one hour. It is worth pausing on what this changes, because it is not obvious from outside the field. An LLM does not maintain state between turns: on every request, the entire context window is passed back in and processed from scratch — the model has no persistent memory of what was said the turn before, only the new sequence of tokens it has just been handed. Prompt caching is what lets the provider skip recomputing the prefix when it has been seen recently; the TTL is how recently. The bootstrap is heavy: agendas, daily index, global policy, project memory all load into context before the first turn — on a typical day, around 50,000 tokens go in before I have asked a single question. My sessions also run long, and within those sessions there are natural gaps of more than five minutes between turns whenever thinking, reading, or context-switching enter the loop. Under the default five-minute cache, every one of those gaps fired a re-pay of the entire bootstrap; the token consumption I was running before the flag was, on examination, far above what the work itself justified. Under the hour-long cache, the cost of the bootstrap amortises over an actual working session instead of firing silently every time I look away. This is not enforcement of behaviour; it is enforcement of economics. The choice to load a heavy context becomes affordable precisely because the cost is paid once per hour rather than once per turn.

None of this is sophisticated by itself. Each piece is a small Python script, a hook configuration, a line of policy in a markdown file. The sophistication is that, together, they remove from me the responsibility of remembering to use the system. The system uses itself.

Two of the conversation's diagnoses turned out to be wrong because they did not see this layer. The conversation said I had built "half a correct system" and was missing the closing ritual; the closing ritual exists, is named, is automated, and has a ## Update YYYY-MM-DD block convention documented in my global CLAUDE.md. The conversation said the parallel shells produce race conditions on the vault that nothing in my setup resolves; the single-writer architecture of the consolidator resolves them by design, by the same principle that resolves them in a database.

I did not defend either point in the conversation. I let both stand, because the diagnosis sounded plausible and I was tired. That is also part of the story. We will come back to it.

The bias on the side I was sitting on

Across the seventeen turns of the conversation, the model produced three different valuations of my workflow. The first was "top 10-15%". The second, after I mentioned that I'd built Rollgate end-to-end with Claude Code as my assistant, was "fascia ristrettissima". The third, after I mentioned a second brain in Obsidian and a Gemini review loop, was "top 1-2%, probably less". Each was given with the same precise-sounding confidence.

I called the model on it eventually, and it retracted. The retraction was clean and the model said, correctly, that the percentile claims were not data, they were rhetoric in numerical clothing. That part of the bias — the model inflating what it cannot measure — has been written about elsewhere; it is the well-known sycophancy failure mode of conversational LLMs, and what I felt in the conversation was a textbook instance of it.

What I did not catch until after the conversation ended was that there was a bias on my side too, mirror-symmetric to the model's. Every two or three turns of the conversation, I revealed a component of my setup that the model did not know about and had not asked about. A Gemini review loop. A second brain. Hierarchical agendas. A bootstrap skill. A health-check skill. Sentry. A production admin console with business KPIs and SEO planning in the same surface. The conversation, by turn fifteen, explicitly noted the pattern: "you systematically underestimate what you have built." It offered two hypotheses for why — normalisation ("you live in it, it feels obvious") and incompleteness anxiety ("you compare every piece to its ideal version in your head and it feels half-finished"). Both fit, and I think the model was right that the truth is some mix.

There is a useful inversion to draw out of this. I had been carrying around, for months, a vague sense that my setup was a disorganised collection of half-finished pieces. The conversation, by forcing me to name each piece in sequence and watching it accumulate into a description, demonstrated that the disorganised collection was actually a coherent layered architecture that I had not been able to see from inside it. The pieces are the same; the description was different. The description matters, because the description is what lets me notice when a piece is genuinely missing instead of just feeling like everything is missing all the time.

The lesson is not "another LLM will validate you". The lesson is that the model's role in that conversation was not to evaluate me. It was to be a forcing function that turned tacit infrastructure into described infrastructure. The percentile language was noise. The description was the useful artefact.

The gaps that are actually real

The conversation also identified four real gaps in how I run the system, and these survived the retraction of the inflated scores. They are worth listing because they are the only items from the conversation that I am going to act on.

The first is that I use the most capable model available — Claude Opus 4.7, on a plan that allows it — for every task, including the trivial ones. The cost is latency, on every turn, including the turns where the task is "rename this variable across the file" or "write a commit message". The hidden cost is calibration: when you always reach for the strongest tool, you stop noticing which tasks needed it and which did not. The fix is to default to Sonnet, escalate to Opus only when Sonnet is visibly struggling, and use Haiku for the mechanical bottom of the work — closing rituals, commit messages, boilerplate. The plan allowing Opus is not the same as the plan prescribing it. The conversation got this exactly right, and I have not yet implemented the change.

The second is that I keep sessions open for days. The argument I gave the model was that I lose speed but I gain context without risk. The model pushed back, hard, and the pushback held up under examination: a multi-day session is not a long memory, it is a long context window with degradation built into it. The early turns get summarised into approximations, then summarised again, then evaporated. Attention dilutes across the long context — the so-called "lost in the middle" phenomenon. Decisions taken on day one and silently reversed on day two sit side by side in the context, both visible, neither flagged, and the model averages over them without telling me. The framing "I preserve context" was an illusion of stability over a slowly degrading state. The fix is to close sessions at clear inflection points, write the closing-ritual update, and start the next session from the bootstrap. The whole reason I built the bootstrap was to make starting a session cheap. By keeping the session open for days I was paying the cost of building the bootstrap and never paying the savings.

The third is the parallel-shell pattern. I sometimes work three shells at once, one per project, when the day requires interleaving. The single-writer architecture of the consolidator protects the master daily note from being corrupted, and the per-project session-state files do not collide because each shell writes to its own. What is not protected is the agenda files when two shells of two different projects happen to update the cross-project planning hub at the same moment. The collision space is small but not zero. The fix is either a convention (only one shell touches the cross-project planning at a time) or a more invasive lock (a file-level mutex via a wrapper script). I have not done either.

The fourth is what the conversation called the "too many dashboards" problem. Obsidian holds the long-term knowledge and the agendas. The Rollgate admin console holds the business KPIs and the SEO planning. Discord receives the operational alerts. Sentry receives the runtime errors. PostHog holds the product analytics. GA4 and Google Search Console hold the acquisition funnel. Each is justified individually; collectively, they are five different surfaces I need to look at to know whether the day requires attention to anything in particular. The conversation proposed a morning briefing skill that aggregates the essential signal from each into one markdown file in the vault. I think it is the highest-ROI single addition I could make right now. I have not built it yet.

Four real gaps, then. None of them is structural to the architecture. All of them are friction at the operating layer, where the system meets the day. None of them is captured by a percentile.

The only metric that actually informs a decision

Late in the conversation, after I had called it on the inflated scoring and after the model had retracted, I asked, half as a joke and half not, whether the conversation itself — describing the workflow articulately, for a clearly forming essay — would put me in the "top one percent". The model laughed at the question and refused to answer it. I quote, because the refusal was the most useful sentence in the entire seventeen turns:

"Top 1%" is a vanity metric. It does not answer a useful question. The right question to ask yourself, if you want a metric for self-assessment, is: "is the system I have built helping me ship Rollgate faster than I would without it?" If yes, it is a good system regardless of where it sits in any ranking. If no, it is an elegant toy.

That is the only metric. It is also, in its plain form, the only metric that has ever produced a decision on this codebase. When the answer drifts toward "no", I cut a piece — a skill that has become more maintenance than benefit, a daily ritual that has become a chore, a cross-product I built once and never used. When the answer stays at "yes", I leave the piece alone, even if it is not yet what I imagine it could be.

The whole apparatus of percentile claims, "state of the art" labels, and architectural compliments collapses under that question. None of them tells me anything I can act on. The shipping-speed question tells me everything I need to know about whether the next thing I am tempted to add is worth the maintenance burden it will create.

What the audit was actually for

The closing thing to say is the meta one, because the conversation made me see it and I want to write it down before it fades.

The system I had built was, all along, invisible to me. I lived inside it. I felt its frictions — the slow Opus turns, the multi-day sessions, the parallel shells, the dashboards I had to remember to open — and I mistook those frictions for the system itself, because frictions are what you notice when you are inside something. The shape of the system, the part that worked, was not visible to me, because the parts of a system that work do not announce themselves. They just remove the problem they were built to remove, and after a while you forget the problem was ever there.

What the conversation did, mechanically, was force me to name the parts in sequence. The naming was the audit. It was not anything the other model knew that I did not. It was the act of articulating, under the gentle pressure of a question, what was actually in the room. The model could have been an editor, a colleague, a journalist with patient questions, a piece of software whose only job was to ask "what else is in there?". The forcing function did not require intelligence. It required attention and asking.

I want to be precise about this, because it is the lesson I am going to keep. The single act that turned my disorganised feeling into a layered description was not a tool, not a skill, not an MCP server. It was being asked to describe my system to someone who could not see it. The percentile claims were a distortion in the conversation. The structured naming was the gift in it.

If you are working in some version of this kind of stack — anything serious built solo with LLM assistance, anything that has accumulated layers over months and feels half-formed — the audit is worth doing. Not for the score. The score will be wrong in some direction. The audit is worth doing because the description you produce when you do it is the thing you cannot produce on your own, and the thing you need to produce in order to know what to improve. The model is the mirror, not the judge. Looking in the mirror was the part I had been avoiding.

The first thing I am going to do, in the morning, is build the briefing skill that the conversation suggested. The reason is not that it will make me top one percent. The reason is that it will make me ship Rollgate faster. That, by the metric I now actually trust, is the only one that matters.