The $20/Day AI Legal Department: System Architecture

33 of 37 tasks completed. $20/day operating cost. Zero human intervention during autonomous execution. The 4 remaining tasks blocked on an external funding dependency, not a system failure.

Those are the numbers from our autonomous legal research engine, a system we built in 3 days and ran in production for weeks. It replaced a research function that would have required 3-5 junior analysts at $2-4K/month each.

This is how it works, what broke along the way, and what it would look like deployed for a client.

The Problem

A legal technology venture needed to validate five kill criteria simultaneously: market demand for Ecuadorian legal search tools, corpus feasibility (could the data even be scraped and indexed), AI accuracy thresholds, sales cycle length, and product-market fit. The founding team had zero legal researchers and no budget for full-time analysts.

The traditional path: hire 3-5 junior researchers, manage them through weekly standups, wait 3-6 months for results. Total cost: $6-12K/month in salary alone, plus management overhead.

We took a different path.

The Architecture

The system is a Flask application running as a systemd service on a DigitalOcean VPS (2 vCPU, 4GB RAM, $24/month). It uses APScheduler for 6 scheduled jobs and a continuous backlog worker thread for sprint execution. The codebase is ~3,750 lines of Python across 26 files, plus 35 agent definition documents.

The sprint loop works like this:

Load memory context from agent-memory markdown files and pipeline tracking documents
The orchestrator agent plans 2 research tracks from the backlog
Execute tracks sequentially (budget gate checked between each)
Score output quality, reject anything below 0.4
Write sprint history, update the kill criteria dashboard
Post an hourly digest to Slack (suppressed if nothing happened)

Sequential execution was a deliberate choice. Our local API proxy cannot handle parallel requests, and sequential processing simplifies debugging. When something fails, the error trace points to exactly one agent doing exactly one thing. Parallel execution is a straightforward upgrade when infrastructure supports it.

10 Agents, 4 Teams

Each agent has a specific role and a specific model assignment. The heavier reasoning tasks run on Claude Sonnet 4 ($3.00/$15.00 per million tokens). The more routine tasks run on Claude Haiku 4 ($0.80/$4.00 per million tokens).

The legal-orchestrator handles sprint planning, coordination, and dashboard updates. The corpus-engineer manages the data pipeline: scraping, OCR, indexing. The product-architect owns system design, API specs, and architecture decisions. The sales-strategist runs go-to-market analysis, pricing research, and design partner outreach. The compliance-specialist covers LOPDP (Ecuador’s data protection law), SOC 2 preparation, and regulatory requirements. The market-researcher handles competitive analysis and market sizing. The grant-writer produces funding applications and pitch materials. The legal-domain-expert validates Ecuadorian law interpretations and corpus quality. The ux-designer researches lawyer workflows and interaction patterns. The growth-hacker develops user acquisition and retention strategies.

These 10 agents form into 4 team configurations depending on the sprint focus. Market validation teams pair the orchestrator with sales, market research, and legal domain agents. Technical teams combine the orchestrator with corpus engineering, product architecture, and compliance. Funding teams bring together the orchestrator, grant writer, market researcher, and sales strategist. Full-launch sprints use all 10.

The Quality Engineering Story

This is the part that matters most and that most AI demos skip entirely.

The first version of the daemon had no quality gates. We pointed 10 agents at a research backlog and let them run. The output was technically fluent and substantively worthless. Agents produced 2,000-word analyses with correct legal terminology arranged in meaningless patterns. Confident-sounding nonsense. One agent wrote a detailed “market analysis” that was actually a restatement of its own instructions, padded with generic observations about the legal tech sector.

The second version added basic output validation. It caught the worst garbage (empty responses, explicit refusals, obvious placeholder text) but still let through a category we started calling “sophisticated garbage”: outputs that passed superficial checks but contained no real information. An agent might produce a well-formatted competitive analysis with headers, bullet points, and percentage figures, where every percentage was invented and every company name was real but the claims about them were fabricated.

The third version, the one running in production, has three layers of quality control.

Garbage detection: 50+ string patterns catch refusals (“as an AI, I cannot”), capability complaints (“I don’t have access to”), planning-without-doing (“let me search for”), empty search results presented as findings, and tool/function call confusion (the agent trying to invoke tools when running in synthesis-only mode). Each pattern was added after a specific failure. The list grew over the first week of operation as we identified new failure modes.

Content scoring: Every output gets scored on a 0.0 to 1.0 scale. Baseline of 0.3 for any content that exists. Bonuses for word count (100+ words = +0.1, 300+ words = +0.1), substance markers (dollar amounts, percentages, URLs, dates, tables, structured headers = up to +0.3 combined). Penalty of -0.15 for planning-heavy language. Anything below 0.4 gets rejected.

Retry cap: 3 attempts per task. After 3 garbage outputs, the task is permanently blocked and flagged for human review. This prevents the system from burning budget on tasks it cannot complete. Of the 37 total tasks, 33 completed successfully. The 4 that blocked were all in the funding category, waiting on external grant timelines rather than failing quality gates.

One architectural decision worth explaining: agents run with tools=[] during sprint execution. This means they synthesize from the context loaded into their prompt rather than attempting live tool calls. Early versions tried to invoke web search and file operations mid-sprint, which caused permission errors and confused output. By restricting agents to synthesis-only mode during execution (real tool usage happens in the orchestrator’s planning phase), we eliminated an entire category of runtime failures.

Budget Controls

The system enforces a $20/day cap via a SQLite api_calls table. Before each track execution, the daemon checks cumulative daily spend. If the cap is reached, the sprint pauses until midnight UTC.

Every API call logs its cost: input tokens multiplied by the model’s per-token rate, plus output tokens multiplied by the output rate. End-of-day summaries are logged without deleting data (date-filtered queries handle the accounting). The health endpoint at /health reports current daily spend, remaining budget, backlog statistics, and worker state.

Monthly operating cost: $224-624 depending on sprint frequency. The VPS costs $24/month. The API costs range from $200-600/month. Slack and Notion are $0 incremental since the workspace already exists.

Compare that to the alternative: 3 junior researchers at $2-4K/month each, plus a manager spending 5-10 hours per week on coordination. $6-12K/month, minimum.

Self-Healing

After 3 consecutive sprint failures, the daemon runs a diagnostic sprint. It reads recent error logs, diagnoses the root cause, recommends fixes, and posts the diagnosis to Slack. If the diagnostic sprint also fails, the system pauses completely and waits for a manual POST /reset-failures call.

This happened twice during the first week. Once when the API proxy hit a rate limit we hadn’t anticipated, and once when a memory consolidation job ran during a sprint and created a file lock conflict. Both times, the diagnostic sprint correctly identified the issue and the fix was applied in under 10 minutes.

Scheduled Operations

The daemon runs 6 scheduled jobs beyond the core sprint loop:

Hourly digests go to Slack with a clean status update. These are suppressed if nothing happened in the previous hour, so the channel stays readable. Daily reports at 13:00 UTC include an AI-generated summary with budget data and progress metrics. Kill criteria reviews run every Monday at 14:00 UTC with a detailed GO/NO-GO assessment across all 5 criteria. Memory consolidation runs every Sunday at 06:00 UTC, merging the dozens of small memory files generated during the week into 4 category documents (market research, technical findings, funding intelligence, compliance analysis). An approval timeout checker runs every 15 minutes to handle stale Slack approval requests. A budget reset job runs at midnight UTC to log the end-of-day summary.

What a Client Deployment Looks Like

The framework is designed to be redeployed for different domains. The sprint loop architecture, quality scoring algorithm, budget tracking, SQLite schema, Flask health endpoint, APScheduler integration, Slack interaction handler, continuous backlog worker pattern, self-healing diagnostic, and memory consolidation pipeline are all fixed infrastructure.

What changes per client: the number of agents, their instructions, team compositions, quality detection patterns (a law firm needs different garbage patterns than a logistics company), budget caps, sprint frequency, Slack channels, Notion databases, kill criteria, scheduled job timing, and domain-specific scoring bonuses.

Deployment timeline:

Phase	Duration	Activities
Discovery and scoping	3-5 days	Define research questions, backlog, domain expertise, integrations
Agent customization	3-5 days	Rewrite agent instructions, adjust team compositions, configure quality patterns
Integration setup	2-3 days	Slack app, Notion databases, API keys, VPS provisioning
Testing and calibration	3-5 days	Run test sprints, tune quality thresholds, validate output
Handoff and monitoring	2-3 days	Documentation, monitoring setup, Slack training
Total	13-21 days

The Honest Assessment

This system is good at research synthesis: taking a defined question, gathering relevant context, producing structured analysis, and iterating on quality. It completed 89% of its assigned tasks autonomously.

It is not good at tasks requiring real-time external data access (the synthesis-only constraint means agents work from pre-loaded context, not live web searches during sprints). It is not good at tasks with external dependencies (the 4 blocked tasks were waiting on grant funding timelines that no amount of agent intelligence could accelerate). And it required 3 full rewrites of the quality engineering layer before it stopped producing garbage.

The 3 rewrites are the important part. Anyone selling AI agent deployments who claims their system worked on the first try is either lying or hasn’t tested it against real tasks. The quality engineering is the product. The agents are commodity infrastructure.

For a legal research function, a compliance analysis team, a market intelligence unit, or any department that runs on structured research and synthesis, this architecture works. $224-624/month instead of $6-12K/month, with output quality that improves over time as the memory system accumulates domain knowledge.

Synaptic builds autonomous AI systems that replace departments, not people. 13-21 day deployment. synaptic.so