The High Cost of an LLM Taskmaster

Automating the Corrector

In the last article, I described the frustrating, manual reality of using an LLM to generate code. I had become a full-time, human debugger for a machine, stuck in a loop of prompting, reviewing, and patching. My engineering instincts screamed that this was the wrong approach. If a task is repetitive, it should be automated.

My next idea felt like a breakthrough: what if I could automate the correction process itself?

I got to work building a suite of "LLM Checkers." My thinking was that if one LLM could write flawed code, perhaps another, more focused LLM could be tasked with finding and fixing those flaws. I set up a pipeline where, after the initial code was generated, it was passed to a series of specialized agents:

The Linter Agent: Was fed the code and our team's style guide, with instructions to identify and fix any violations.
The Test-Writer Agent: Analyzed the code and was prompted to write comprehensive unit and integration tests to ensure sufficient coverage.
The Security-Scanner Agent: Inspected the code for common vulnerabilities like injection attacks or improper error handling.
The Build-Master Agent: Attempted to compile the code and run the generated tests, reporting any failures.

On the surface, it worked. The pipeline would churn, and out would come code that was linted, tested, and seemingly ready to go. I had automated the quality checks. I had, in effect, built an AI taskmaster to whip the creative but sloppy coding AI into shape.

The New Problems: Cost and Chaos

My initial satisfaction quickly faded as I confronted two new, unexpected problems.

Problem 1: The Soaring Bill

My cloud billing alerts started to trigger. Then they triggered again. My token usage was going through the roof. I was no longer just paying for one LLM to generate code; I was paying for an entire committee of them to review, debate, and rewrite it. Each step in my pipeline—linting, testing, scanning, fixing—was another expensive API call. The cost of generating a single function was becoming alarmingly high. This wasn't the efficient, automated future I had imagined. It was just a different kind of expensive.

Problem 2: The Illusion of Consistency

The second problem was more insidious. Despite my army of checkers, the final code was still wildly inconsistent. One day, the testing agent would generate tests using a fluent assertion style; the next, it would use simple assert statements. The main coding agent would sometimes generate a clean, functional implementation, and other times it would produce a verbose, object-oriented one for the exact same prompt.

My checkers could enforce that the code worked, but they couldn't enforce a consistent philosophy or style. The system had no memory and no architectural taste. It was just a collection of probabilistic models doing their best on a case-by-case basis.

This led me to the most important question of my journey, a moment of genuine epiphany that changed my entire approach. I looked at the inconsistent, expensive boilerplate code my factory was producing—the database clients, the API handlers, the server setup—and asked myself:

"Why am I paying a non-deterministic, creative AI to be completely deterministic?"

The answer was, I shouldn't be. I was using the wrong tool for the job.

In the next article, I'll share the breakthrough that came from this question: the move away from an "all-LLM" approach to a hybrid model that combines the best of deterministic generators and creative AIs.

My Software Factory Series

The High Cost of an LLM Taskmaster

Automating the Corrector

The New Problems: Cost and Chaos

My Software Factory