Genairus logoGenAI-R-Us
Genairus logoGenAI-R-Us
Standing on Giants: The Composable Stack
AIOpen SourceToolingArchitectureComposability
Part 8 of 10 in Fully Functional Factory

Standing on Giants: The Composable Stack

Scott

We've covered a lot of ground. Twenty checkers. Self-healing pipelines. Observability infrastructure. Root cause analysis. Chaos engineering.

You might be thinking: "This sounds like years of engineering work to build from scratch."

Here's the revelation: You don't build it from scratch. You compose it from existing pieces.

In one real-world implementation, combining open-source and commercial tools, approximately 70% of the checker system was built with off-the-shelf tooling. The remaining 30% was custom integration and domain-specific logic.

The wheels are already built. You just need to teach machines to turn them.

The Composability Principle

Here's the architectural insight that changes everything:

Separate concerns into independent layers. Make each layer swappable. Wire them together with standard interfaces.

This isn't new. It's how Unix tools work. It's how microservices work. It's how cloud infrastructure works.

The same principle applies to your factory:

Layer 1: Schema Validation

Purpose: Validate that your definitions are correct before any code generation begins

Off-the-shelf:

  • Smithy CLI — AWS's open-source tool for validating, building, and diffing API definitions
  • Standard language compilers (tsc, javac, rustc) for generated code

Custom:

  • Validators for proprietary meta-languages (if you have them)
  • Cross-reference checks between different definition types

Why this layer exists: If your definitions are wrong, everything downstream is wrong. Catch it early.

Layer 2: Security Scanning

Purpose: Find vulnerabilities in code and dependencies

Off-the-shelf:

  • Semgrep — Fast pattern-matching SAST with custom rules (OSS)
  • Trivy — Container, filesystem, and dependency vulnerability scanner (OSS)
  • TruffleHog — Secret detection in code and commits (OSS)

Custom:

  • LLM-based logic-level security analysis for business logic

Why this layer exists: Security can't be retrofitted. Build it into the pipeline.

Layer 3: Code Quality

Purpose: Enforce style, conventions, and best practices

Off-the-shelf:

  • ESLint + Prettier — JavaScript/TypeScript linting and formatting (OSS)
  • Ruff — Python linting and formatting, extremely fast (OSS)
  • Biome — All-in-one JS/TS formatter/linter (OSS)

Custom:

  • None. Linting is a solved problem. Use existing tools.

Why this layer exists: Consistency reduces cognitive load and prevents style drift.

Layer 4: Testing & Coverage

Purpose: Verify code behavior and measure how much is tested

Off-the-shelf:

  • Istanbul/nyc — JavaScript/TypeScript coverage (OSS)
  • JaCoCo — Java coverage (OSS)
  • coverage.py — Python coverage (OSS)
  • Codecov — Coverage aggregation and enforcement (Free for OSS)

Custom:

  • LLM test generation for uncovered paths (only when coverage falls below threshold)

Why this layer exists: Untested code is untrustworthy code.

Layer 5: Post-Deployment Validation

Purpose: Verify the deployed service actually works

Off-the-shelf:

  • Schemathesis — Auto-generates API tests from OpenAPI/GraphQL specs (OSS)
  • k6 — Load testing and functional API checks (OSS)
  • Keploy — Records real traffic and replays as regression tests (OSS)
  • Pact — Contract testing framework (OSS)

Custom:

  • Integration with your deployment pipeline

Why this layer exists: "Tests pass locally" ≠ "works in production"

Layer 6: Observability & Metrics

Purpose: Instrument the pipeline and visualize performance

Off-the-shelf:

  • OpenTelemetry — Instrumentation SDK, vendor-neutral (OSS, CNCF)
  • Prometheus — Time-series metrics database (OSS, CNCF)
  • Grafana — Visualization and dashboarding (OSS + Commercial)
  • Langfuse — LLM observability and prompt tracking (OSS)
  • LiteLLM — LLM cost tracking and budget control (OSS)

Custom:

  • Dashboard templates specific to your factory's checkers

Why this layer exists: Can't improve what you don't measure.

Layer 7: Resilience & Chaos

Purpose: Validate the system handles failures gracefully

Off-the-shelf:

  • LitmusChaos — Kubernetes chaos engineering (OSS, CNCF)
  • Chaos Mesh — Kubernetes chaos with diverse fault scenarios (OSS, CNCF)
  • ToxiProxy — Network condition simulation (OSS)

Custom:

  • LLM-based experiment design from architecture documents

Why this layer exists: Production will fail. Better to find out on your terms.

Layer 8: Compliance & Audit

Purpose: Generate evidence that your process was followed

Off-the-shelf:

  • in-toto — Software supply chain integrity (OSS, CNCF)
  • Sigstore/Cosign — Artifact signing and verification (OSS)

Custom:

  • Aggregation logic to link code → requirement → decision

Why this layer exists: Regulated industries need audit trails. Everyone else benefits from traceability.

The 70/30 Split

Here's the breakdown from one real implementation:

Off-the-Shelf (70%):

  • Schema validation: Smithy CLI, standard compilers
  • Security: Semgrep, Trivy, TruffleHog
  • Code quality: ESLint, Prettier, Ruff
  • Testing: Istanbul, JaCoCo, coverage.py, Codecov
  • Post-deployment: Schemathesis, k6, Keploy, Pact
  • Observability: OpenTelemetry, Prometheus, Grafana
  • LLM tooling: Langfuse, LiteLLM
  • Chaos: LitmusChaos
  • Compliance: in-toto, Sigstore

Custom (30%):

  • Proprietary meta-language validators (if applicable)
  • LLM-based checkers (#9, #10, #14, #16, #20):
    • Requirement traceability
    • Architectural consistency
    • Escaped defect analysis
    • Documentation freshness
    • Factory self-assessment
  • Pipeline orchestration (wiring tools together)
  • Integration with ticketing/project management systems

The majority of the work is integration, not invention.

A Concrete Tool Map

Let's map the 20 checkers from Post 3 to specific tools:

CheckerTool(s)LicenseCustom?
#1 Schema ValidationSmithy CLI, tsc/javacOSSPartial
#2 Pipeline MetricsOpenTelemetry, PrometheusOSSConfig
#3 Contract CompatibilitySmithy Diff, PactOSSPartial
#4 Generated Code IntegrityStandard compilersOSSConfig
#5 Build & Compiletsc, javac, rustc, etc.OSSConfig
#6 Test Coverage GateIstanbul/JaCoCo + CodecovOSSHybrid
#7 Style & ConventionESLint, Prettier, RuffOSSConfig
#8 Security ScannerSemgrep, Trivy, TruffleHogOSSHybrid
#9 Requirement TraceabilityLLM + LangfuseOSS/CustomCustom
#10 Architectural ConsistencyArchUnit, LLMOSS/CustomCustom
#11 Cost GuardLiteLLM or HeliconeOSSConfig
#12 Post-Deployment HealthSchemathesis, k6OSSConfig
#13 Regression DetectorKeploy, PactOSSConfig
#14 Escaped Defect AnalyzerLLM + LangfuseOSS/CustomCustom
#15 Prompt EffectivenessLangfuse or LangSmithOSS/CommercialHybrid
#16 Documentation FreshnessLLM + Smithy CLIOSS/CustomCustom
#17 Performance Benchmarkk6, LocustOSSConfig
#18 Chaos ResilienceLitmusChaos, Chaos MeshOSSHybrid
#19 Compliance & Auditin-toto, SigstoreOSSConfig
#20 Factory Self-AssessmentGrafana + LLMOSS/CustomCustom

Summary:

  • Fully off-the-shelf with config: 12 checkers (60%)
  • Hybrid (OSS + light custom): 3 checkers (15%)
  • Custom LLM logic required: 5 checkers (25%)

The foundation is already built. The custom work is domain-specific judgment.

Why This Wasn't Possible 5 Years Ago

Here's the important historical context: This convergence is recent.

2019:

  • OpenTelemetry didn't exist (formed 2019, matured 2021)
  • Smithy was internal to AWS (open-sourced 2020)
  • Semgrep was just launched (2019)
  • LangChain didn't exist (2022)
  • Claude and GPT-4-class models didn't exist (2023-2024)
  • Schemathesis was early-stage (2019)
  • LitmusChaos was nascent (2019, matured 2020-2021)

2021:

  • The CNCF observability stack matured (OTel, Prometheus, Grafana)
  • Chaos engineering tools moved from Netflix-internal to OSS-standard
  • SAST tools became fast enough for CI/CD (Semgrep <10s scans)

2023-2024:

  • Frontier LLMs became capable enough for code reasoning
  • Token costs dropped 10x (GPT-3.5 → GPT-4 Turbo → Claude 3.5)
  • LLM observability tools emerged (Langfuse, LangSmith)
  • Prompt engineering became a discipline

2026 (Now):

  • All the building blocks exist
  • All the tools are mature
  • All the patterns are documented
  • The ecosystem is ready

This wasn't possible to build 5 years ago because the ecosystem wasn't ready. It's possible now because everything converged at once.

The Composability Advantage

Here's why building on existing tools beats building from scratch:

1. Swap Without Rewrite

Don't like Semgrep? Swap in CodeQL. Don't like Prometheus? Use Datadog. Don't like k6? Use Locust.

Each layer has standard interfaces:

  • Security scanners output SARIF format
  • Coverage tools output standardized reports
  • Observability uses OpenTelemetry protocol
  • Chaos tools target Kubernetes APIs

Change one tool without touching the rest of the pipeline.

2. Scale Incrementally

Start with the foundation (Tier 1: checkers #1-4). Add quality gates when you need them (Tier 2: #5-8). Add intelligence when you're ready (Tier 3: #9-13). Add optimization when it matters (Tier 4: #14-20).

You don't need all 20 checkers on day one.

A minimal viable factory might run:

  • Schema Validation (#1)
  • Build & Compile (#5)
  • Style (#7)
  • Security Scanner (#8)
  • Pipeline Metrics (#2)

That's 5 checkers, all off-the-shelf, deployable in a weekend.

3. Optimize Cost vs. Depth

Some checkers are free and fast (ESLint: <1 second). Some are expensive but thorough (CodeQL: minutes, deep data-flow analysis).

You decide the trade-off based on your needs:

  • Early-stage startup: Fast and free (Semgrep, Ruff, Istanbul)
  • Enterprise with compliance needs: Thorough and paid (CodeQL, Snyk, Drata)
  • Hybrid: Free for most code, paid for critical paths

Composability means you can mix and match.

4. Leverage Community Improvements

When Semgrep adds a new security rule, you get it automatically. When Schemathesis improves fuzzing, you benefit immediately. When Grafana releases a new dashboard template, you can import it.

You're not maintaining the tools. The community is.

The Honest Assessment: What's Still Custom

Let's be clear about the 30% that isn't off-the-shelf:

1. Domain-Specific Validation

If you have proprietary definition languages (like Capacitor or Flux in the example), you need custom validators. Smithy CLI's architecture is a good reference, but you're writing the validators yourself.

Time investment: 2-3 weeks per meta-language for basic validation, ongoing maintenance for new features.

2. LLM Judgment Layers

Five checkers require LLM reasoning:

  • Requirement Traceability (#9): "Does the code implement the requirements?"
  • Architectural Consistency (#10): "Does the code follow architectural patterns?"
  • Escaped Defect Analyzer (#14): "Why did this bug escape? What should we fix?"
  • Documentation Freshness (#16): "Does the API documentation (OpenAPI) and README match the actual code?"
  • Factory Self-Assessment (#20): "How is the factory performing? What should improve?"

These are genuinely custom. You're writing prompts, handling LLM API calls, parsing responses, and feeding results back into the factory.

Time investment: 1-2 weeks per checker for initial implementation, ongoing prompt tuning.

3. Orchestration Logic

The tools exist, but you need to wire them together:

  • When does each checker run?
  • What happens if a checker fails?
  • How do results get aggregated?
  • Where do bug tickets go?

This is pipeline-as-code. GitHub Actions, GitLab CI, Jenkins, or custom orchestration.

Time investment: 1-2 weeks for basic pipeline, ongoing refinement.

Total custom work: 6-10 weeks of engineering for a full 20-checker factory with custom meta-languages. Less if you use off-the-shelf definition formats.

Compare that to building everything from scratch: 6-12 months.

The Economic Reality

Let's talk about what this actually costs:

Option 1: Build Everything Custom

  • Time: 6-12 months (2-3 engineers)
  • Cost: $300K-$600K in engineering time
  • Maintenance: Ongoing (every tool you built needs updates)
  • Risk: High (building security scanners, observability platforms, chaos tools from scratch is hard)

Option 2: Use Commercial All-In-One

  • Time: 2-4 weeks (integration time)
  • Cost: $50K-$200K/year in subscriptions (Datadog + Snyk + PagerDuty + etc.)
  • Maintenance: Low (vendors handle it)
  • Risk: Low (proven tools)
  • Flexibility: Low (locked into vendor ecosystem)

Option 3: Compose OSS + Selective Commercial

  • Time: 4-10 weeks (integration + custom checkers)
  • Cost: $0-$50K/year (free tier or self-hosted for most, commercial for specialized needs)
  • Maintenance: Medium (community handles tools, you handle integration)
  • Risk: Medium (OSS tools are mature but require expertise)
  • Flexibility: High (swap any piece)

For most teams, Option 3 is the sweet spot.

A Starter Stack

If you're starting from zero, here's a minimal stack that covers the foundation:

Core Pipeline:

  • Schema validation: Smithy CLI or OpenAPI tools
  • Build: Standard compilers
  • Tests: Standard test runners + coverage tools
  • Security: Semgrep (free)
  • Style: ESLint + Prettier or Ruff (free)

Observability:

  • Instrumentation: OpenTelemetry (free)
  • Storage: Prometheus (free, self-hosted)
  • Visualization: Grafana (free tier or self-hosted)

Deployment:

  • Health checks: Schemathesis (free)
  • Monitoring: Grafana synthetic checks (free tier)

Total cost: $0-$50/month (VM for Prometheus + Grafana)

Time to set up: 1-2 weeks

Checkers covered: 8 of 20 (the critical ones)

From there, you add:

  • LLM cost tracking (LiteLLM)
  • Chaos engineering (LitmusChaos)
  • Advanced security (Trivy, TruffleHog)
  • LLM judgment checkers (custom prompts)

Incremental growth as you need it.

The Meta-Lesson

Here's the broader insight:

Best practices have always existed. The tools have always existed. What was missing was the orchestration layer that runs them consistently.

Humans knew to:

  • Validate schemas before generating code
  • Run security scanners before deploying
  • Check test coverage before merging
  • Monitor production after deploying

We just couldn't sustain the discipline of running all these tools, every time, without shortcuts.

The factory isn't inventing new tools. It's orchestrating existing tools with machine-level consistency.

And because the tools are composable, you can:

  • Start small (5 checkers)
  • Grow incrementally (add 1-2 checkers per sprint)
  • Swap tools (replace Semgrep with CodeQL if needed)
  • Mix free and paid (optimize cost vs. capability)

You're not building a factory from scratch. You're assembling one from mature, battle-tested components.

What's Next

We've talked about 19 of the 20 checkers. There's one we've referenced but not yet fully explored:

Checker #10: Architectural Consistency

This checker compares generated code against architectural design decisions. It checks whether the code respects service boundaries, follows prescribed patterns, and matches the data flow described in the architecture.

The challenge: Most architecture documentation is prose. Prose is ambiguous. LLMs can interpret it, but slowly and inconsistently.

The solution: What if your architecture wasn't documentation—but a machine-readable, queryable, always-current definition?

In the next post, we'll dive into Architecture as Code—the fourth meta-language that makes Checker #10 (and several others) dramatically more powerful. It's the piece that shifts architectural consistency from "LLM interprets prose" to "diff structured definitions against actual code."

Because if your architecture is defined in code, the factory can validate against it. Automatically. Every time.


Next in the series: Architecture as Code: The Living Architecture — Machine-readable, queryable, multi-dimensional, temporal architecture that validates itself. The fourth definition that makes the other three more powerful.