Genairus logoGenAI-R-Us
Genairus logoGenAI-R-Us
Why Machines Don't Get Bored
AIAutomationSoftware EngineeringQuality GatesToken Economics
Part 2 of 10 in Fully Functional Factory

Why Machines Don't Get Bored

Scott

In the last post, we talked about the best practices we've been trying—and failing—to implement for decades. Not because we don't know better, but because the system was too costly and difficult for humans to maintain consistently.

We identified the gap: human capacity is the bottleneck, not human knowledge.

Now comes the uncomfortable part: admitting what kind of work we're actually bad at.

The Work We Hate

There's a reason Boston Dynamics built robots for the "Dirty, Dangerous, or Dull" jobs. Not because humans can't do those tasks—we did them for centuries—but because we're poorly suited for them. We get injured. We get tired. We get bored and make mistakes.

Software engineering has its own version of this problem. There's a category of work that's essential but soul-crushing:

Brittle, Boring, or Buggy.

Let's break that down.

Brittle: The Work That Demands Precision

Some software tasks are brittle—they're security-critical, precision-required, and easy to break when done carelessly:

  • Security scanning — Checking every dependency for CVEs, every commit for leaked secrets, every API for injection vulnerabilities
  • Input validation — Ensuring every edge case is handled, every boundary condition is checked, every error path is tested
  • Compliance verification — Validating that GDPR requirements are met, that PII is handled correctly, and that audit trails are complete
  • Data migration — Moving data between schemas where a single mistake corrupts production

These tasks require exhaustive attention to detail. Miss one edge case, and you have a production incident. Skip one security check, and you have a breach.

Humans can do this work. But we do it badly when we're rushed, distracted, or context-switching. And we're almost always rushed, distracted, or context-switching.

Boring: The Work That Never Changes

Some software tasks are boring—tedious, repetitive, monotonous work that follows the same pattern every time:

  • Code formatting — Running Prettier, ESLint, organizing imports, fixing whitespace
  • Dependency updates — Bumping package versions, running tests, checking for breaking changes
  • Boilerplate generation — CRUD endpoints, ORM models, API client stubs, test scaffolding
  • Documentation updates — Keeping README files current, updating API specs, maintaining architecture diagrams

These tasks are necessary but mind-numbing. The first time you format code, it feels productive. The 500th time, you're questioning your career choices.

Humans can do this work. But we cut corners. We skip steps. We tell ourselves, "I'll do it later," and then never do. Because it's boring, and we have finite tolerance for boredom.

Buggy: The Work Prone to Human Error

Some software tasks are buggy—error-prone work that requires exhaustive checking, not best-effort:

  • Test coverage — Writing tests for every function, every branch, every error path, every integration point
  • Regression testing — Ensuring that every old feature still works after every new change
  • Performance benchmarking — Running load tests, profiling hotspots, measuring latency under realistic conditions
  • Integration validation — Verifying that Service A still works with Service B after either one changes

These tasks demand consistency. You can't test "most" of the critical paths and call it good. You can't validate "most" of the integrations and hope for the best.

Humans can do this work. But we get fatigued. We lose focus. We make assumptions. We trust our intuition over exhaustive verification. And sometimes our intuition is wrong.

The Human Problem

Here's the brutal truth: we're not good at Brittle, Boring, or Buggy work because we're optimized for something else entirely.

Humans are extraordinary at:

  • Creative problem-solving — designing architectures, inventing algorithms, finding elegant solutions
  • Judgment calls — weighing trade-offs, making decisions with incomplete information
  • Contextual understanding — reading between the lines, understanding user needs, navigating ambiguity
  • Adaptation — responding to changing requirements, learning new domains, improvising under pressure

These are the skills that matter for building software. They're what make us valuable.

But the work of maintaining discipline around building software—the comprehensive testing, the security scanning, the documentation updates, the systematic validation—requires different skills:

  • Infinite patience (we don't have it)
  • Perfect consistency (we can't maintain it)
  • No fatigue (we get tired)
  • No boredom (we get very bored)

We've been trying to force humans to do machine work. And then we blame ourselves when we can't keep up.

What Machines Are Good At

Now let's talk about what machines are good at.

Machines don't get bored. Run the same linter check 10,000 times? No problem. They'll execute it the same way every time, with the same thoroughness, with zero complaints.

Machines don't get tired. Run a comprehensive test suite at 3 AM after the 12th deployment of the day? Sure. They don't need sleep. They don't lose focus. They don't make sloppy mistakes because they're exhausted.

Machines don't take shortcuts under pressure. You can't convince a machine to skip the security scan "just this once" because the deploy is late. It doesn't care about your deadline. It runs the process you told it to run.

Machines don't forget steps. There's no "oops, I meant to run the integration tests, but I was in a hurry." The checklist is the checklist. Every time.

This is what machines were designed for: repetitive, exhaustive, never-ending process adherence.

For decades, we've been trying to automate everything but the creation of code and the reviews. Recent innovations are making us able to automate even those tasks.

The Economics Shift

"Okay," you might be thinking, "but AI is expensive. Running comprehensive checks with LLMs on every commit will cost a fortune."

Let's talk about that.

The Token Cost Curve

In January 2023, GPT-3.5 cost $0.002 per 1K tokens.

By January 2024, GPT-4 Turbo cost the same $0.002 per 1K input tokens—but was dramatically more capable.

By January 2025, Claude 3.5 Sonnet cost $0.003 per 1K tokens and could handle complex reasoning tasks that earlier models couldn't.

In January 2026, frontier models are 10x cheaper than they were three years ago, 100x more capable, and the curve isn't flattening—it's accelerating.

What cost $100 in 2023 now costs $10 and delivers better results.

And here's the kicker: most of the work in a well-designed AI factory doesn't use tokens at all.

The Zero-Token Advantage

Remember the "Brittle, Boring, or Buggy" framework? Here's the secret: most of that work is deterministic.

  • Code formatting? Zero tokens. Run Prettier. It's a deterministic algorithm.
  • Dependency vulnerability scanning? Zero tokens. Query a CVE database. It's a lookup.
  • Compilation and type checking? Zero tokens. Run the generator. It's a state machine.
  • Test execution? Zero tokens. Run the test suite. It's code execution.
  • Linting? Zero tokens. Run ESLint. It's pattern matching.

The only time you need an LLM is when you need judgment:

  • "Is this architectural decision consistent with the design document?"
  • "Does this business logic actually implement the requirement?"
  • "Is this error handling appropriate for this context?"

For a well-architected system, here's the reality: you can build a comprehensive quality pipeline where 60-70% of the checks are fully deterministic (zero tokens), 20-30% are hybrid (programmatic first, LLM only for what remains), and only 10% require full LLM analysis.

Let's say you run 20 checks on every feature:

  • 12 are programmatic (zero cost)
  • 4 are hybrid (minimal cost—maybe $0.50 per feature)
  • 4 are full LLM (maybe $5 per feature)

Total cost per feature: ~$5.50.

Compare that to the cost of:

  • A production incident from a missed security vulnerability ($10,000–$1,000,000 depending on severity)
  • A deployment rollback because tests weren't comprehensive ($5,000 in engineer time + customer impact)
  • Onboarding delays because documentation is wrong ($2,000 per new engineer)
  • Technical debt from skipped refactoring ($50,000 over the lifetime of the codebase)

The economics have flipped. It's now cheaper to run comprehensive checks than to deal with the consequences of skipping them.

The Performance Shift

"But won't all these checks slow down development?"

This is the second objection. And it would have been valid five years ago.

In 2020, running a comprehensive security scan on a large codebase took 30–45 minutes. Running a full integration test suite took an hour. Getting an LLM to analyze architectural consistency? Not possible—the models weren't good enough.

Recently, the landscape has changed:

  • Static analysis tools are fast. Semgrep scans a million lines of code in under 10 seconds. Trivy scans container images in seconds.
  • Test runners are parallel. Modern test frameworks (Jest, pytest, go test) run in parallel across cores. A suite that took an hour now takes 5 minutes.
  • LLMs are fast. Claude Sonnet can analyze a 50KB code file and provide architectural feedback in under 2 seconds.
  • Caching is everywhere. Most checks only run on changed files. If you modified 3 files, you're not re-checking the entire codebase.

A comprehensive 20-check pipeline now runs in under 3 minutes on a modern CI system.

That's faster than most humans can context-switch, open a PR, and mentally review the changes.

The Consistency Advantage

But speed and cost aren't the real win. The real win is consistency.

Humans have good days and bad days. Monday morning after a full weekend? Fresh, focused, thorough reviews. Friday afternoon before a long weekend? "LGTM, ship it."

Machines don't have good days and bad days. They have the same day, every day. The same thoroughness. The same standards. The same checklist.

This means:

  • No "just this once" exceptions — The security scan runs. Every time. No shortcuts.
  • No "I'll check it later" — Tests run now. Coverage is measured now. Not "when we have time."
  • No "oops, I forgot" — Every step in the process executes. Automatically. Without human memory.

Over time, this consistency compounds. After 100 features, you don't have 87 that were checked thoroughly and 13 that snuck through with shortcuts. You have 100 that were checked thoroughly.

Consistency isn't just about avoiding bugs. It's about building a foundation you can trust.

What This Enables

So here's where we are:

  1. Cost is no longer the barrier. Token prices are dropping 10x per year. Most checks are zero-token anyway.
  2. Speed is no longer the barrier. Modern tools + caching + parallelism = minutes, not hours.
  3. Consistency is no longer the barrier. Machines don't have off days.

Which means the best practices we've been abandoning for decades—comprehensive testing, continuous security scanning, systematic quality gates, always-current documentation, post-deployment validation—are suddenly economically and operationally practical.

Not in some distant future. Not after some mythical "AI breakthrough." Right now.

The question is no longer "can we afford to do this?" The question is "what becomes possible when we actually do it?"

The Checker System

In the next post, we'll dive into the mechanics: the checker system.

It's a structured pipeline of quality gates—some programmatic, some hybrid, some LLM-driven—that runs automatically on every feature, every commit, every deployment.

It's not magic. It's not revolutionary. It's just the discipline we always knew we should have, finally executed by something that never gets tired of executing it.

We'll break down:

  • The 4 tiers of checkers (Foundation → Quality Gates → Intelligence Layer → Continuous Improvement)
  • Why some checks are zero-token, and some aren't
  • How to architect a pipeline where machines do the tedious work and humans do the creative work
  • What a real implementation looks like (not theory—actual tooling and patterns)

Because here's the thing: the best practices didn't fail. Humans did. And that's okay—we were never built for this kind of work.

But machines were.


Next in the series: The Anatomy of a Self-Checking System — Introduction to the 20-checker registry and the philosophy of structured quality gates.