Human-in-the-Loop Is Not a Safety Net. It Is an Architecture Decision.

The agent prepares. The human approves. The agent executes.

That three-step loop sounds simple. In practice, most teams get it wrong in one of two ways. They either remove the human entirely and discover their agent sent a hallucinated contract to a client, or they insert the human into every step and discover that nobody reads the 47th Slack notification of the day. Both failure modes produce the same result: the human oversight isn’t functioning.

We’re building agent systems across Odisea’s six business units right now. 90+ agent roles, 10 distinct systems, real tasks with real consequences. Somewhere around week three of operating our legal research daemon, we realized that human-in-the-loop is not a feature you bolt on at the end. It is a structural decision that determines whether the entire system works or generates expensive noise.

This is the framework we’re using.

Three Authority Levels

Every agent action in our systems falls into one of three tiers. The tier assignment happens at design time, not at runtime. The agent doesn’t decide how much oversight it needs. The architect does.

T1: Autonomous. The agent executes and logs. No human sees the output before it takes effect. The agent records what it did, and someone can review the log later if they want to. Most agents spend most of their time here.

Examples from our running systems: a research agent pulling market data from public sources. A corpus engineer indexing legal documents. A memory consolidation job merging the week’s findings into category files. A metrics agent calculating daily spend against budget caps.

The criteria for T1 classification: the action is reversible, the blast radius is limited to internal state, and the cost of a mistake is measured in wasted compute rather than damaged relationships or lost money.

T2: Notify. The agent executes and informs. The action happens immediately, but a human gets a structured notification showing what happened and why. The human can intervene after the fact if something looks wrong.

Examples: an outreach agent sending a first-touch email to a prospect from a pre-approved template. A sales agent updating a CRM pipeline stage. A content agent publishing a draft to an internal review queue. A monitoring agent flagging a budget threshold breach.

The criteria for T2: the action has external visibility or affects a shared workflow, but the cost of a single bad execution is low and correctable within hours.

T3: Wait. The agent drafts and awaits explicit human approval before anything happens. This is the only tier where the agent is blocked on a human decision.

Examples: sending a proposal to a client. Signing up for a paid service. Publishing content to the company website. Modifying a production database schema. Replying to an email from a prospect who has asked a question the agent hasn’t encountered before.

The criteria for T3: the action is irreversible or high-stakes, the blast radius extends to external stakeholders, or the cost of a mistake is measured in trust rather than time.

Why Three Tiers Instead of Two

The instinct most teams have is binary: either the agent acts alone, or a human approves every action. The problem with binary classification is that it forces you to choose between two bad options for a large category of actions that fall in between.

Take CRM updates. If you classify them as T1 (fully autonomous), you get an agent that silently moves a deal from “Qualified” to “Proposal Sent” without anyone knowing until the pipeline report looks wrong. If you classify them as T3 (human approval required), you get an approval queue that backs up with 30 pipeline-stage changes per day, each one trivial, each one training the human to click “Approve” without reading.

T2 solves this. The agent moves the deal stage immediately (because speed matters in sales operations) and posts a summary to the sales channel. If the move was wrong, someone catches it within hours. The action is visible without being blocked.

The three-tier model also makes the design conversation concrete. Instead of debating whether an agent “should have human oversight” (a question with no useful answer at that level of abstraction), the team asks: “Is this action reversible? What’s the blast radius? What’s the cost of a wrong execution?” Those questions have specific answers that map directly to T1, T2, or T3.

Penelope’s Approval Workflow

Penelope is our podcast production agent. She monitors email, researches potential guests, drafts replies, and manages calendar scheduling through Notion. The system handles dozens of interactions per day across multiple email threads.

Her email reply workflow is a clean T3 implementation. When Penelope drafts an email, she posts the draft to a Slack channel with three buttons: Approve, Reject, Edit. The human reads the draft, makes a decision, and the agent either sends, discards, or waits for an edited version.

This works for a specific reason: the approval request contains exactly enough context for a decision and nothing more. The Slack message shows the recipient, the subject line, and the full draft text. The human doesn’t need to understand how Penelope found this email, what search queries she ran, what alternative drafts she considered, or what her confidence score was. They need to answer one question: “Should this email go out as written?”

We’re testing a timeout mechanism too. If an approval request sits in Slack for more than 4 hours, the system marks it as stale and sends a reminder. After 24 hours, it moves to an “expired” state and the agent re-evaluates whether the reply is still relevant. The timeout exists because a forgotten approval request is worse than no approval request. It creates a false sense of safety: the human thinks the agent is waiting, the agent thinks the human is reviewing, and the email never gets sent.

The approval checker runs every 15 minutes. It is one of 6 scheduled jobs in the system, and it exists because we learned that approval workflows without timeout enforcement become dead letter queues.

The 200-Message Failure

Our legal research daemon runs 10 specialized agents executing research sprints against a 37-task backlog. Early in development, we tried a version where the orchestrator posted each agent’s output to a Slack channel before passing it to the next agent in the pipeline.

The thinking was sound: humans should review intermediate outputs to catch quality problems early. In practice, the review channel accumulated 200+ messages in two days. Each message was a 500-2,000 word research output. Nobody read them. The channel became a scrolling wall of text that people muted on the first day.

What happened next was predictable. The system produced three outputs with fabricated statistics that made it into a summary document. The human “oversight” had been running for 48 hours and caught nothing, because the humans had stopped looking after hour 4.

We replaced the per-output review with a quality scoring system: 50+ garbage detection patterns, automated content scoring on a 0-1 scale, and a 3-retry cap that blocks tasks instead of endlessly regenerating. The human review moved from “read every output” to “review blocked tasks and weekly quality reports.” The review surface dropped from 200+ messages per day to 3-5 items per week. The humans actually read those 3-5 items.

The lesson: human oversight degrades in direct proportion to its volume. Every review request that doesn’t genuinely require human judgment makes it harder for the human to engage with the ones that do. Alert fatigue isn’t just a monitoring problem. It’s the central failure mode of poorly designed human-in-the-loop systems.

Designing Escalation Points

The question isn’t “should a human be involved?” The question is “at which specific point in the workflow does human judgment add information that the agent doesn’t have?”

For Penelope’s email replies, that point is clear: the agent can’t fully evaluate whether a particular tone or phrasing will land well with a specific person. Human social calibration is genuinely better here.

For the legal daemon’s research outputs, the answer is different. A human reviewer adds little value reading a market sizing analysis that the agent produced from public data. The agent is faster and more thorough at synthesis. Where the human adds value is in the quality criteria: deciding whether the garbage detection patterns are catching the right things, reviewing the scoring thresholds, and evaluating blocked tasks where the agent explicitly failed.

We’re developing a principle from this: escalation should happen at the point of maximum human leverage, not at the point of maximum human comfort. Most people feel more comfortable reviewing every output. But that comfort comes at the cost of actual oversight quality.

Here’s how this maps to our systems:

System	T1 Actions	T2 Actions	T3 Actions
Legal Daemon	Research synthesis, memory consolidation, budget tracking	Sprint summaries, quality alerts, blocked task notifications	Kill criteria GO/NO-GO decisions
Penelope	Email monitoring, guest research, calendar parsing	Pipeline stage updates, scheduling confirmations	Email replies, meeting invitations
Sales Pipeline	Prospect data enrichment, competitive intel gathering	Pipeline updates, action item creation	Proposal drafts, pricing discussions
Research Systems	Source collection, citation indexing	Draft analyses, source verification reports	Publication approval

Notice the pattern. T1 is everything that touches internal state. T2 is everything that touches shared workflows. T3 is everything that touches external relationships or irreversible commitments. The classification follows from the blast radius, and it stays consistent across systems even when the domain changes completely.

The Ulises/Penelope Split

We run two Slack-connected agents: Penelope (personal agent, external-facing) and Ulises (internal operations, pipeline management, cross-unit intelligence). They share the same Slack workspace but have completely separate authority frameworks.

Penelope operates mostly at T2 and T3. Her actions frequently involve external communication, so nearly everything she does is either notified or approved. Ulises operates mostly at T1 and T2. His actions are internal: updating pipeline trackers, posting research summaries, routing information between business units. He rarely needs human approval because his blast radius is contained within the organization.

This separation exists because authority levels should reflect the agent’s scope of impact, not its technical sophistication. Ulises is arguably the more complex system. He coordinates across 6 business units, manages a Notion CRM with 92+ prospects, and runs competitive intelligence workflows. But his actions don’t reach outside the organization, so he operates with more autonomy than Penelope, whose individual actions are simpler but land in someone else’s inbox.

Mixing the two into a single agent with a single authority framework would force a choice: either Ulises gets too much external-facing autonomy, or Penelope gets slowed down by approval requirements on internal operations that don’t warrant them.

Implementation Details

A few specifics on how the authority framework works in practice.

Classification happens in the agent definition, not in the code. Each agent has a markdown file specifying its role, tools, and authority level per action type. When we onboard a new client, the authority mapping is one of the first design decisions. It goes into the client’s ARCHITECTURE.md before any code gets written.

T3 actions require structured approval payloads. The agent doesn’t just ask “can I do this?” It presents: the proposed action, the reasoning, the relevant context, and the specific alternatives it considered. This gives the human enough information to make a real decision rather than pattern-matching on “this looks fine.”

Every authority level has logging. T1 actions log silently. T2 actions log and notify. T3 actions log, notify, and block. But all three produce audit trails. If something goes wrong at T1, the log exists. If a T2 notification got ignored, the timeline is reconstructable. The logging is the safety net. The authority level is the operating constraint.

Tier assignments can change. When an agent has executed a specific T3 action successfully 20+ times with a 100% approval rate, that’s evidence the action should be reclassified to T2. When a T1 action produces an error that causes external impact, it gets reclassified to T2 or T3. The framework is meant to evolve as the system accumulates operating history.

The Checkbox Problem

The reason most AI agent deployments treat human oversight as a checkbox is that the buyers demand it. Procurement teams, compliance departments, and risk committees want to hear “a human reviews every output.” That sentence appears in RFPs. It appears in vendor assessments. It appears in board presentations about AI governance.

The problem is that “a human reviews every output” describes a system that doesn’t work. At 5 outputs per day, maybe. At 50, unlikely. At 200, the human is rubber-stamping or ignoring. The oversight exists on paper and fails in practice.

Designed authority levels produce better outcomes. Fewer review points, each one positioned where human judgment genuinely changes the result. Agents that operate faster because they’re not waiting for approvals on actions that don’t need them. Humans who engage with approval requests because they receive 3 per day instead of 30.

We’re testing this across every system we’re building. Early results confirm the core thesis: concentrated oversight at high-stakes decision points outperforms distributed oversight across all decision points. The former produces genuine review. The latter produces a comfortable fiction.

The agent prepares. The human approves at exactly the right moment. The agent executes. Getting that middle step right is the architecture decision that determines whether the system actually works.

Synaptic turns businesses into AI-native organizations. We start where the demo ends. synaptic.so