Quality Engineering Is 80% of AI Agent Deployment

We’re building autonomous AI systems that run entire business functions. Not copilots. Not chatbots. Systems that execute 37-task research backlogs, make routing decisions, and operate on $20/day budgets without a human watching. The part that takes the longest isn’t building those systems. It’s making them stop producing garbage.

AI agent quality engineering consumes roughly 80% of our deployment time. The agent logic, the prompt design, the integration plumbing — all of that is the easy 20%. The remaining calendar goes to building systems that detect when an agent has produced something worthless and prevent it from being treated as real work.

Here’s what that looks like in practice.

The Three Rewrites

Our legal research daemon is a 10-agent system running as a Flask application on a DigitalOcean VPS. It handles a backlog of legal research tasks for an Ecuadorian legal tech venture. 10 agents, 4 team configurations, autonomous sprint cycles, $20/day budget cap. We’re on our third major version of the quality layer. Each rewrite happened because we discovered a new category of failure that the previous version didn’t catch.

Version 1: No quality gates. We pointed 10 agents at a research backlog and let them run. The output was technically fluent and substantively worthless. Agents produced 2,000-word legal analyses with correct terminology arranged in meaningless patterns. One agent wrote a detailed “market analysis” that was actually a restatement of its own instructions, padded with generic observations about the legal tech sector. We couldn’t distinguish this from real work by scanning the output quickly. It read well. It said nothing.

Version 2: Basic output validation. We added checks for the obvious failures: empty responses, explicit refusals (“As an AI, I cannot…”), placeholder text. This caught the bottom 20% of garbage but introduced a new problem. We started calling it “sophisticated garbage”: outputs that passed superficial checks but contained no real information. An agent would produce a well-formatted competitive analysis with headers, bullet points, and percentage figures, where every percentage was invented and every company name was real but the claims about them were fabricated.

Version 3: Three-layer quality control. This is what’s running in production right now. It catches the sophisticated garbage that version 2 missed.

Each rewrite added roughly a week of development time. The total quality engineering effort on this single system exceeds the time spent on the agent logic, the integration layer, and the deployment infrastructure combined.

50+ Garbage Detection Patterns

The first layer of our quality system is pattern-based garbage detection. We maintain a list of 50+ string patterns that indicate the agent has failed to produce useful output. Each pattern was added after a specific production failure. None of them were predicted in advance.

The patterns fall into six categories.

Meta-commentary. The agent describes what it would do instead of doing it. Patterns include: “let me search,” “I will now proceed,” “below is the plan for executing,” “I’ll attempt to,” “I need to search.” These are particularly common when agents run in synthesis-only mode (no live tool access) and haven’t been prompted firmly enough to work from context rather than requesting external actions.

Empty search results. The agent reports finding nothing and presents that report as its deliverable. Patterns: “no results found,” “could not find,” “unable to find,” “search returned no,” “no data available,” “returned 0 results.” We see this when agents are prompted to research a topic but the context window doesn’t contain relevant material.

Refusals and capability limits. The agent declines to do the work. Patterns: “as an AI,” “as a language model,” “I cannot access,” “outside my capabilities,” “I’m unable to.” These are the most obvious failures but they still account for roughly 10% of blocked outputs.

Tool and function confusion. The agent tries to invoke tools that don’t exist in its current execution context. Patterns: “function call,” “tool_call,” “function_call,” “```tool_code.” This happens when we run agents with tools=[] (synthesis-only mode) and the agent’s training data overrides its instructions.

Planning without doing. The agent produces a plan for future work rather than actual output. Patterns: “the next step would be,” “the following steps should,” “we would need to,” “this would involve.” This is the most common category. Agents default to planning when they’re uncertain.

Permission complaints. The agent explains that it lacks permissions to complete the task. Patterns: “I need write permissions,” “what I would accomplish,” “what I would deliver,” “given the constraints,” “if I had access.” We see these when agents encounter a task that they interpret as requiring file system or API access they don’t have.

The detection function is straightforward: any output shorter than 50 characters is automatically garbage. For everything else, lowercase the text and check for substring matches against the pattern list. If any pattern matches, the output is rejected.

def _is_garbage_output(text: str) -> bool:
    if not text or len(text.strip()) < 50:
        return True
    lower = text.lower()
    return any(p in lower for p in GARBAGE_PATTERNS)

This catches roughly 30% of bad outputs. The remaining 70% of quality failures are sophisticated garbage that requires the scoring system.

Content Scoring: 0.0 to 1.0

Every output that passes the garbage detection layer gets scored on a 0.0 to 1.0 scale. Anything below 0.4 is rejected. The scoring algorithm is deliberately simple because we need it to be predictable and debuggable.

Base score: 0.3. Any content that exists and has more than 30 words starts at 0.3. Below 30 words, the score is 0.1 regardless of content.

Word count bonus: up to +0.2. Content with 100+ words gets +0.1. Content with 300+ words gets another +0.1. This rewards substantive output without requiring a specific length.

Substance markers: up to +0.3. We check for six types of concrete content: dollar amounts ($[\d,.]+), percentages (\d+%), URLs (https?://), dates (\d{4}[-/]\d{2}), table formatting (|...|), and structured headers (^#{1,3}\s). Each marker found adds 0.05, capped at 0.3. This rewards outputs that contain specific, verifiable information rather than abstract prose.

Planning penalty: -0.15. If the output contains 3 or more planning phrases (“we should,” “the plan is,” “recommend that we,” “steps to take,” “action items include”), it gets penalized. An output full of recommendations about what to do next is less valuable than an output that does the work.

The scoring function returns both the numeric score and a human-readable reason string. When an output gets rejected, the log shows exactly why: “too short (28 words)” or “planning-heavy (4 phrases)” or “baseline” (meaning it passed the 0.4 threshold on word count and substance markers alone).

The 0.4 threshold was calibrated empirically. We scored 200 historical outputs by hand, marking each as “useful” or “garbage.” 0.4 produced the best separation. Below 0.35, we were rejecting some useful short outputs. Above 0.45, we were accepting some sophisticated garbage.

In our content production engine, we’re testing a more aggressive scoring system for blog articles. That one checks for specificity (numbers, dollar amounts, dates), AI anti-pattern compliance (banned vocabulary, em-dash punchlines, unnecessary juxtaposition), keyword optimization, voice consistency, and actionability. Articles score on a weighted composite and get rejected below 0.7. We’ve had articles fail on the anti-AI compliance check alone, which triggers a revision pass with explicit instructions about what to fix.

The Retry Cap

Both layers feed into a retry mechanism. When an output fails quality (either garbage detection or scoring below 0.4), the task goes back to “pending” status and gets attempted again in the next sprint cycle. But we cap retries at 3. After 3 garbage outputs for the same task, the task is permanently blocked and flagged for human review.

This is the budget protection mechanism. Without the retry cap, the system would spend unlimited API calls on tasks it fundamentally cannot complete. We’ve seen tasks fail 3 times because the context window lacked the necessary source material, because the task description was ambiguous, or because the task required real-time data that the system couldn’t access in synthesis-only mode.

Of the 37 tasks in our legal research backlog, 33 completed successfully. The 4 that blocked were all in the funding category, waiting on external grant timelines. None of the 4 were blocked by quality gate failures. But we’ve seen other deployments where 15-20% of tasks hit the retry cap. The percentage depends on how well the backlog tasks match the agent’s actual capabilities.

Separation of Roles

The second quality system we’re running in production is structural: the agent that writes can never be the agent that reviews.

We learned this from our research systems. We run 6 agents across three research areas (Latin American economics, AI governance, AI and crypto) with 16 custom skills and a 4-gate quality control pipeline. The gates run in strict sequence: source verification, voice check, adversarial review, publication approval. Each gate is owned by a different agent.

The research-analyst writes drafts. The source-reviewer checks every citation for accuracy, relevance, data freshness, and format compliance. The quality-controller runs adversarial reviews, voice audits, and makes the final publication decision. These three roles are deployed as separate agent instances with separate system prompts, separate memory, and separate tool access.

When we first deployed the research system, we had the same agent write analyses and review them. The agent rubber-stamped its own work every time. The reviews were a paragraph of praise followed by “approved for publication.” We caught this after one poorly sourced article made it through the pipeline with fabricated citations that the writing agent described as “verified” in its own review.

The separation is enforced at the pipeline level. The source-reviewer’s instructions explicitly state: “You review sources. You don’t write drafts.” The quality-controller’s instructions start with: “You never produce original research content. You test it, break it, and decide whether it ships.” When a draft needs rewriting, the quality-controller sends it back to the research-analyst with specific instructions. The quality-controller identifies problems. It does not fix them.

This mirrors how functioning research departments work. The person who writes the report doesn’t sign off on their own fact-checking. The difference is that with AI agents, the temptation to collapse these roles is stronger because it’s “the same technology.” The technology doesn’t matter. The incentive structure matters. A reviewer that produced the original work has every reason to approve it.

The 4-Gate Quality Pipeline

For our research systems, the quality pipeline has four sequential gates. A gate failure stops the pipeline. Content cannot skip gates or pass them out of order.

Gate 1: Source Verification. Every citation uses inline hyperlink format. Every URL is fetched and confirmed live. Every statistic is traced to a Level 1-3 source (primary legislation, peer-reviewed research, or institutional reports). No [CITATION-NEEDED] or [NEEDS-VERIFICATION] tags remain. The source-reviewer agent runs this gate using WebFetch to verify every URL and WebSearch to check for newer data.

Gate 2: Voice Check. Authenticity score of 4 or higher on five dimensions: paragraph length variation, sentence length variation, template uniformity, specificity, and position-taking. Zero banned vocabulary. No structural AI tells. We maintain a list of anti-patterns: punchline em dashes, unnecessary “not X, but Y” juxtaposition, triads with an escalating third item, abstract paragraphs with no concrete detail, bland hedging (“it’s worth noting,” “interestingly”), fake sensory language (“robust solution,” “seamless platform”), and forced callbacks. The voice check runs programmatic scans (Grep against the banned list) and LLM-based assessment for the subtler patterns.

Gate 3: Adversarial Review. The quality-controller runs a red team pass on five dimensions: factual accuracy, logical coherence, selection bias, scope creep, and steelman counterarguments. For each major claim, the agent writes the strongest counterargument it can find, backed by real sources (not hypothetical objections). The agent then runs a pre-mortem: “If this research is wrong in 2 years, the most likely reasons are…” Every challenge must be either addressed in the text or documented in the limitations section.

Gate 4: Publication. Gates 1-3 all pass. Format matches the output template. Metadata is complete. Catalogue entry is prepared. Cross-team relevance tags are added so findings are discoverable by other research areas.

The 4-gate system adds 2-3x the time per document compared to a write-and-ship workflow. We consider this the cost of not publishing garbage. Our research covers policy-relevant topics where one fabricated citation can discredit months of work.

Anti-Patterns List vs. Positive Instructions

One counterintuitive finding from building these systems: negative instructions work better than positive ones.

Telling an agent “write naturally” produces generic output. Telling an agent “do not use punchline em dashes, do not use unnecessary juxtaposition, do not use triads with escalating third items, do not use bland hedging phrases including ‘it’s worth noting’ and ‘interestingly’” produces measurably better writing. The agent needs concrete examples of what to avoid.

The same applies to garbage detection. Telling an agent “produce substantive output” doesn’t prevent planning-mode responses. Listing 50 specific patterns that constitute garbage does. Each pattern in our list exists because an agent produced that exact failure in production.

We’re maintaining a “banned words” list for our content engine with 19 entries: “leveraging synergies,” “disrupting,” “delve,” “utilize,” “game-changer,” “paradigm shift,” “move the needle,” “cutting-edge,” “state-of-the-art,” “revolutionary,” among others. We also maintain regex patterns for AI anti-patterns: em-dash punchlines followed by dramatic reveals, triads with escalating third items, bland hedging phrases. Articles that match any pattern get flagged for revision.

The overhead of maintaining these lists is modest. We add 2-3 new patterns per week as we discover new failure modes. The ROI is high because each pattern prevents a class of failure rather than a single instance.

What This Costs

The quality engineering layer adds roughly 35-40% to the API cost of a deployment. For the legal daemon running at $20/day max, the quality checks (garbage detection + content scoring + retry cycles) account for approximately $7-8 of that budget. The quality-controller and source-reviewer agents in our research system consume about 30% of total tokens.

We consider this a good trade. The alternative is publishing garbage, which costs more in rework, reputation, and client trust than the API spend on prevention.

The human cost is more significant. Quality engineering is 80% of the development calendar for a new deployment. For the legal daemon, that meant 3 full rewrites of the quality layer over the first 2 weeks of operation. For the research system, it meant 7 days of a 10-day build sprint dedicated to standards documents, quality gates, agent separation, and testing. For the content engine, it meant building a scoring system with 5 weighted dimensions and a revision pipeline before writing a single article.

If your AI consultancy is showing you a demo and calling it a deployment, find a different consultancy. The demo is the first 20%. The quality engineering that makes it safe to run without supervision is the other 80%.

Synaptic turns businesses into AI-native organizations. We start where the demo ends. synaptic.so