The $50 mistake in 20 minutes: why budget controls aren't optional for AI agents

In the third week of operating our legal research pipeline, an analyst agent entered a revision cycle with the reviewer agent. The analyst submitted a draft. The reviewer suggested changes. The analyst revised. The reviewer found new problems in the revision. The analyst revised again.

Twelve iterations later, the output was worse than the original draft. And the accumulated cost of API calls was 40x the expected budget for that task.

There was no bug. No agent failed. The system did exactly what it was instructed to do: revise until quality was satisfactory. The problem is that “satisfactory” never arrived, because each revision introduced new elements that the reviewer questioned.

This type of failure doesn’t show up in demos. It shows up in production, when agents operate without continuous supervision and the only signal that something went wrong is the API invoice at the end of the day.

The real cost of uncontrolled agents

We’re building multi-agent systems for enterprise operations. 10 specialized agents covering legal research, market analysis, compliance, data engineering and commercial strategy. The system runs as a daemon, executing autonomous sprints against a task backlog.

The target operational cost is $20/day. At that budget, the system completes between 4 and 8 tasks per sprint, using Claude Sonnet 4 ($3.00/$15.00 per million tokens) for heavy reasoning tasks and Claude Haiku 4 ($0.80/$4.00 per million tokens) for routine tasks.

Without controls, this cost can scale unpredictably for three reasons:

Unlimited revision loops. An agent generates output, another evaluates, the first revises. If there’s no iteration limit, the cycle continues indefinitely. Each iteration consumes input tokens (the accumulated context grows with each round) and output tokens (the complete revision is regenerated).

Cascades between agents. In systems where agents trigger other agents’ actions, an upstream failure can generate cascading retries. Agent A fails, agent B tries to compensate, agent C receives inconsistent data and retries, each consuming API calls.

Ambiguous tasks without a stop condition. When the instruction is “research until you find sufficient data” and the data doesn’t exist, the agent keeps researching. Each attempt consumes tokens. Without an explicit cap, the agent runs until the API returns a rate limit error or the account balance runs out.

Control 1: Daily cap with automatic cutoff

The first control we implemented is the simplest: a daily spending ceiling in dollars.

In our system, the BudgetGate tracks every API call in a SQLite database. Every call logs the agent that made it, the model used, input and output tokens, and the calculated cost. Before each sprint, the system queries the total spent that day.

Daily cap: $20.00
Spent so far: $14.32
Available: $5.68
Status: within budget

When the day’s spending hits the cap, the next sprint is canceled. The message is direct: “Sprint skipped: daily budget exhausted ($20.00 cap reached).” The system doesn’t try to negotiate, doesn’t run “just one more task,” doesn’t make exceptions. It stops.

This hard cutoff has already prevented at least three incidents in the first 30 days of operation. In all cases, the pattern was the same: a sequence of tasks generated low-quality output, the system tried to reprocess, and reprocessing consumed more tokens than the original execution.

$20/day may seem conservative for a system with 10 agents. It’s intentional. In a system under validation, the goal isn’t to maximize throughput. It’s to understand cost per task and calibrate controls before increasing volume.

Control 2: Cost per task and per agent

The daily cap prevents total cost from spiraling. Per-task tracking shows where the money is going.

The system generates reports with per-agent breakdown:

Budget Report (2025-08-11)
Spent: $16.2340 / $20.00 ($3.7660 remaining)

Per-agent breakdown:
  legal-orchestrator: 12 calls, 45200 in / 8900 out, $7.4820
  corpus-engineer: 6 calls, 22100 in / 5400 out, $3.8730
  market-researcher: 4 calls, 18300 in / 3200 out, $2.9340
  compliance-specialist: 3 calls, 9800 in / 2100 out, $1.9450

This level of visibility reveals patterns that an aggregate cap hides. In the example above, the orchestrator consumes 46% of the daily budget. That’s expected: it plans sprints, coordinates teams and generates reports. But if the orchestrator starts consuming 70% of the budget, we know there’s a design problem in the planning system.

We’re testing per-department budget allocations. The idea is that each team (market validation, technical, funding, compliance) gets a slice of the daily budget. If the market team consumes its entire slice, the other teams keep operating. A localized problem doesn’t bring down the entire system.

Control 3: Retry limit with permanent blocking

The control that solved the revision loop problem was the simplest to implement: a limit of 3 attempts per task.

The flow works like this:

The agent tries to execute the task.
The output passes through a quality gate (50+ garbage detection patterns + content score from 0 to 1).
If the output is rejected (score below 0.4 or garbage pattern detected), the task goes back to the queue.
On the next execution, the system checks the retry counter.
If the counter has reached 3, the task is marked as “blocked” and removed from the active queue.

Blocked tasks escalate to human review. The system doesn’t try to solve what it couldn’t solve in 3 attempts. This rule exists because we observed that if an agent fails 3 times on the same task, the fourth attempt almost never produces a different result. The problem is usually in the task (ambiguous, dependent on unavailable external data, or poorly defined), not in the execution.

In the first 37 tasks of the legal backlog, 33 were completed successfully. Of the 4 blocked, all failed due to external funding dependencies. The retry limit didn’t block any viable task. It only blocked the ones that genuinely couldn’t be completed with available resources.

Control 4: Garbage output detection

Before counting retries, the system needs to identify when output is garbage. We built two detection layers.

The first layer is a list of 50+ text patterns. Each pattern was added after a real failure. Examples:

“as an AI, I cannot” (model refusal)
“let me search for” (agent planning instead of executing)
“no results found” (empty search presented as a result)
“I don’t have access to” (permission complaint)
“the next step would be” (planning without execution)
“function_call” (agent trying to invoke tools when in synthesis-only mode)

The second layer is a quality score from 0 to 1. The function analyzes:

Length: output under 30 words gets 0.1. Over 100 words, +0.1. Over 300, +0.1.
Substance markers: presence of dollar values, percentages, URLs, dates, formatted tables and structured headers. Each marker adds 0.05, up to a maximum bonus of 0.3.
Planning penalty: if the output contains 3 or more phrases like “we should,” “the plan is,” “steps to take,” the score drops by 0.15.

Any output scoring below 0.4 is rejected. In practice, this threshold eliminates output that looks plausible on the surface but contains no concrete information. An agent that produces 500 words of market analysis without a single number, date or specific source gets a score of 0.3 and goes back for reprocessing.

Control 5: Circuit breaker between sprints

Each sprint consumes between 2 and 4 execution tracks. Between each track, the system checks the remaining budget. If the cap was reached during the previous track, the next track is canceled with status “skipped_budget.”

This is the circuit breaker: the ability to stop mid-sprint when the numbers don’t work. Without it, a sprint with 4 tracks would execute all 4 even if the first consumed 80% of the daily budget.

The system also automatically resets tasks that got stuck in “in_progress” from previous sprints that failed. If the daemon crashes mid-sprint (crash, API timeout, server restart), stuck tasks return to “pending” on the next cycle. Without this mechanism, ghost tasks would occupy backlog slots indefinitely.

What the numbers show

Initial results from the first 30 days of operation with all controls active:

Metric	Value
Tasks completed	33 of 37
Tasks blocked (retry limit)	4
Tasks blocked by garbage output	0 (all rejections resolved within 3 attempts)
Average daily cost	$14-18
Average cost per completed task	$6-9
Sprints canceled for budget	3
Uncontrolled cost incidents	0 (after controls implementation)

The $6-9 cost per completed task replaces work that would cost $50-200 if done by a junior analyst (considering salary, management, infrastructure). The ROI is clear. But it only exists because the controls keep cost predictable.

Without the controls, the 40x incident would have happened repeatedly. Not through malice from the system. Through absence of explicit limits.

Implementing controls in practice

For anyone building multi-agent systems in production, the minimum controls we’re validating:

Day 1: Daily cap in dollars. Any amount. The exact number matters less than the limit’s existence. Adjust after observing the actual consumption pattern.

Week 1: Per-agent and per-task tracking. Log every API call with the agent name, model, tokens and cost. Without this visibility, optimizing cost is impossible.

Week 2: Retry limit. 3 is a reasonable number. Tasks that fail 3 times need human intervention, not a fourth attempt.

Week 3: Garbage output detection. Start with 10-15 patterns based on your system’s real failures. The list will grow organically. Ours has 50+ and keeps increasing.

Week 4: Circuit breakers between execution stages. Check budget before each stage. Stop mid-process if necessary.

None of these controls is sophisticated. They’re simple checks with basic conditional logic. The sophistication is in having all of them operating simultaneously, with enough logging to diagnose problems after the fact.

What comes next

We’re working on three extensions of the current controls:

Spending anomaly detection. Today, the system only reacts when the cap is reached. We’re building alerts for when cost per task deviates more than 2x from the historical average. If a task that normally costs $5 is at $12 after the second attempt, the system should flag it before the third.

Per-department budget. Instead of a single daily cap, each agent team gets a proportional allocation. The market team gets 30%, technical gets 35%, funding gets 20%, compliance gets 15%. Failure in one team doesn’t compromise the others.

Pre-execution cost forecasting. Before starting a sprint, estimate the likely cost based on the history of similar tasks. If the forecast exceeds the available budget, reduce the number of tracks or defer more expensive tasks.

The central point is that cost control for AI agents is not a “nice to have” feature you add when the system is mature. It’s basic infrastructure that needs to exist before the first autonomous execution. The cost of not having controls isn’t a high invoice at the end of the month. It’s the inability to trust the system to operate without constant supervision. And if you need constant supervision, the system isn’t really autonomous.

Synaptic transforms companies into AI-native organizations. We start where the demo ends. synaptic.so