← Back to Guides
4

Series

Harness Engineering· Part 4

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Gated Execution: Why Your Agent Should Never Act Without Permission

Valid output is not safe output. The Gated Execution layer is the firewall between what the model proposes and what the system actually does - and it's the difference between an agent that assists and one that causes incidents.

#harness-engineering#gated-execution#human-in-the-loop#llm-agents#ai-safety#production-ai-systems

The Agent That Deleted the Database

A team deployed an agent with access to their production database. The task was routine: archive records older than 90 days.

The agent generated a DELETE query. The validation layer checked it - valid SQL, correct schema, proper syntax. The harness executed it.

The WHERE clause read WHERE created_at < NOW() - INTERVAL '90 days'. Correct logic.

What the agent missed: the column created_at was NULL for 40% of records due to a migration bug three months earlier. NULL comparisons in SQL return NULL, not TRUE or FALSE. So NULL < date evaluates to NULL, which is falsy - and those records were skipped.

Except the agent had also generated a fallback condition: OR created_at IS NULL. To handle edge cases, it said.

Every record with a NULL created_at was deleted. 40% of the database. Gone.

The SQL was valid. The logic was coherent. The output passed validation. The execution gate did not exist.

This is Part 4 of the Harness Engineering series. Part 1 introduced the seven-layer Harness Architecture. Part 2 covered Normalization. Part 3 covered Context Engineering. This article goes deep on Layer 4 - Gated Execution - the policy layer between model output and real-world action.


The Model Proposes. The Gate Decides.

That's the whole principle of Gated Execution, and it cannot be stated too many times.

Validation tells you the output is structurally correct. Gated Execution asks a different question: should this action execute at all, given what we know about its consequences?

Structural validity and semantic safety are completely orthogonal. The DELETE query above was structurally valid. It was semantically catastrophic. No validator would have caught it - because validators check format, not consequence.

The gate checks consequence. It sits between model intent and system action. Nothing with real-world side effects executes without passing through it.

Without the gate, your agent is a technically proficient actor with no judgment and no accountability. With the gate, it's a system you can trust.


Why This Layer Is Non-Negotiable for Agentic Systems

For simple LLM applications - a chatbot, a summarizer, a classifier - gated execution is less critical. The worst outcome is a bad response, which is correctable.

For agentic systems, the calculus changes entirely. Agents act. They write files, call APIs, send emails, modify databases, trigger workflows, spend money. These actions have consequences that range from annoying to irreversible.

The three properties that make agentic actions dangerous:

Irreversibility. A deleted record, a sent email, a financial transaction, a deployed infrastructure change - these cannot be trivially undone. The cost of a false positive (executing an action that shouldn't have been executed) is asymmetrically higher than the cost of a false negative (blocking an action that was safe).

Cascading effects. One agent action often triggers downstream consequences. A write to a database propagates to dependent services. An API call triggers a webhook chain. An infrastructure change affects everything running on that infrastructure. The agent doesn't see the cascade; it only sees the immediate action.

Confidence calibration failure. LLMs are not well-calibrated on their own uncertainty. A model that is 60% confident in a destructive action produces output with the same tone and formatting as one that is 99% confident. There is no built-in hesitation, no "are you sure?" - unless you build one.


The Three Gate Types

A production gate is not a single check. It's a tiered system that applies different scrutiny to actions based on their risk profile.

Gate Type 1: Policy Validation

The first gate is deterministic. It evaluates proposed actions against a defined ruleset before any execution occurs.

code
from dataclasses import dataclassfrom enum import Enumfrom typing import Callableclass RiskLevel(Enum):    LOW = "low"    MEDIUM = "medium"    HIGH = "high"    CRITICAL = "critical"@dataclassclass PolicyRule:    name: str    risk_level: RiskLevel    check: Callable[[dict], bool]    reason: strclass PolicyEngine:    def __init__(self):        self.rules: list[PolicyRule] = []    def add_rule(self, rule: PolicyRule):        self.rules.append(rule)    def evaluate(self, action: dict) -> tuple[bool, RiskLevel, str]:        violations = []        max_risk = RiskLevel.LOW        for rule in self.rules:            if not rule.check(action):                violations.append(rule.reason)                if rule.risk_level.value > max_risk.value:                    max_risk = rule.risk_level        if violations:            return False, max_risk, "; ".join(violations)        return True, RiskLevel.LOW, "all checks passed"# Example rulesengine = PolicyEngine()engine.add_rule(PolicyRule(    name="no_bulk_delete",    risk_level=RiskLevel.CRITICAL,    check=lambda a: not (a.get("type") == "sql" and "DELETE" in a.get("query", "").upper()                         and "WHERE" not in a.get("query", "").upper()),    reason="DELETE without WHERE clause is not permitted"))engine.add_rule(PolicyRule(    name="budget_guard",    risk_level=RiskLevel.HIGH,    check=lambda a: a.get("estimated_cost_usd", 0) <= 100,    reason="Action estimated cost exceeds $100 limit"))engine.add_rule(PolicyRule(    name="no_external_comms_without_approval",    risk_level=RiskLevel.HIGH,    check=lambda a: a.get("type") not in ("send_email", "send_sms", "post_webhook"),    reason="External communications require explicit approval"))

Policy validation is fast, deterministic, and auditable. Every rule is a documented business decision. Every violation is logged with a reason. This is your first line of defense.

Gate Type 2: Dry-Run Preview

For actions that pass policy validation but still carry non-trivial risk, execute a dry run before the real thing.

A dry run simulates the action and reports its projected consequences without committing them. SQL databases support this with transactions that are rolled back. Infrastructure tools support it with plan modes (Terraform plan, Ansible --check). File operations can be simulated by reporting what would change without writing.

code
class DryRunGate:    def preview(self, action: dict) -> dict:        action_type = action.get("type")        if action_type == "sql":            return self._preview_sql(action)        elif action_type == "file_write":            return self._preview_file_write(action)        elif action_type == "api_call":            return self._preview_api_call(action)        else:            return {"previewable": False, "reason": f"No dry-run support for {action_type}"}    def _preview_sql(self, action: dict) -> dict:        query = action["query"]        # Run inside a transaction, collect EXPLAIN output, rollback        with db.transaction() as txn:            try:                explain = db.execute(f"EXPLAIN {query}")                affected = db.execute(                    f"SELECT COUNT(*) FROM ({query.replace('DELETE', 'SELECT *')}) sub"                )                txn.rollback()                return {                    "previewable": True,                    "rows_affected": affected.scalar(),                    "query_plan": explain.fetchall()                }            except Exception as e:                txn.rollback()                return {"previewable": True, "error": str(e)}

The dry-run output feeds back to the human approval step - or to an automated semantic check that evaluates whether the projected consequences match the intended task.

The database incident above would have been caught by a dry run. The preview would have shown 40% of records affected - a number wildly inconsistent with "archive records older than 90 days" in any healthy dataset. An automated check comparing expected vs. actual impact would have flagged it instantly.

Without this gate: structurally valid output executes with real-world consequences that nobody previewed. The model passed the policy check. The damage happened anyway.

Gate Type 3: Human-in-the-Loop

For irreversible or high-impact actions, no automated check is sufficient. You need a human.

The design challenge is calibration. Too many approval requests and the agent becomes slower than doing the task manually - defeating its purpose. Too few and you're one confident mistake away from a production incident.

code
import asynciofrom datetime import datetime, timedeltaclass HumanApprovalGate:    def __init__(self, notifier, timeout_seconds: int = 300):        self.notifier = notifier        self.timeout = timeout_seconds        self.pending: dict[str, asyncio.Future] = {}    async def request_approval(self, action: dict, context: str) -> bool:        approval_id = self._generate_id()        # Send notification with action summary and dry-run preview        await self.notifier.send(            approval_id=approval_id,            summary=self._summarize_action(action),            preview=context,            expires_at=(datetime.utcnow() + timedelta(seconds=self.timeout)).isoformat()        )        # Wait for human response or timeout        future = asyncio.get_event_loop().create_future()        self.pending[approval_id] = future        try:            approved = await asyncio.wait_for(future, timeout=self.timeout)            return approved        except asyncio.TimeoutError:            del self.pending[approval_id]            return False  # Default deny on timeout    def respond(self, approval_id: str, approved: bool):        if approval_id in self.pending:            self.pending[approval_id].set_result(approved)    def _summarize_action(self, action: dict) -> str:        return (            f"Action: {action.get('type', 'unknown')}\n"            f"Target: {action.get('target', 'unknown')}\n"            f"Risk: {action.get('risk_level', 'unknown')}\n"            f"Description: {action.get('description', 'No description provided')}"        )

Default deny on timeout is a critical design decision. If no human responds within the timeout window, the action does not execute. The agent waits, reports the timeout, and the task is flagged for human review. Optimism is not the right default for irreversible actions.


The Risk Classification Matrix

Not every action needs every gate. A practical classification system maps action types to required gates:

Action TypePolicy CheckDry RunHuman Approval
Read-only queries--
File reads--
Low-impact writes (logs, drafts)-
Database writes (non-bulk)-
External API calls (idempotent)-
Bulk database operations
Send email / external communication
Financial transactions
Infrastructure changes
Irreversible deletes

Start conservative - require human approval for anything above the line. Relax approval requirements for specific action categories as confidence builds, based on observed error rates in those categories.

The goal is not zero human involvement. The goal is the right amount of human involvement - concentrated where consequences are highest.


The Named Pattern: Consequence-Proportional Gating

I call this Consequence-Proportional Gating: the scrutiny applied to an action scales with the severity of its potential consequences, not with the model's confidence in the action.

This distinction matters. Model confidence is a poor proxy for safety. A model can be highly confident in a catastrophically wrong action - the database incident is exactly this case. Consequence-Proportional Gating ignores model confidence and focuses entirely on what happens if the action is wrong.

High consequence + low reversibility = maximum scrutiny. Low consequence + high reversibility = minimum scrutiny.

This is the same principle underlying surgical checklists, nuclear reactor procedures, and aviation pre-flight protocols. The rigor is proportional to the cost of failure, not the probability of success.


What Observability Looks Like for This Layer

Policy violation rate by rule - which policy rules trigger most often? A high rate on a specific rule means either the rule is miscalibrated or the model is systematically attempting actions outside its sanctioned scope.

Dry-run delta - the difference between what the model predicted an action would affect and what the dry run shows it would actually affect. A consistently high delta signals the model is poor at estimating the scope of its own actions for that action type.

Approval rate and timeout rate - what fraction of human approval requests are approved vs. denied? What fraction time out? A high timeout rate means your notification system is too slow or your reviewers are overwhelmed. A high denial rate means either your gate threshold is too low or the model is systematically proposing unsafe actions.

Time-to-approval - how long do human approval requests take on average? This is your human-in-the-loop latency. If it exceeds the user's acceptable wait time, you need to either automate more approvals for low-risk actions or add more reviewers.

Log every gate decision - pass or block - with the full action, the gate that evaluated it, and the outcome. This is your safety audit trail.


What to Build First

First: Policy validation on all write operations. Before any action with side effects executes, run it through a policy engine. Start with three rules: no bulk deletes without explicit confirmation, no external communications without approval, no actions exceeding a cost threshold. Add rules as you observe the model's failure patterns.

Second: Dry-run preview for database and file operations. Add transaction-wrapped previews for SQL. Add what-would-change reporting for file writes. Surface the preview to whatever decision mechanism comes next (automated check or human).

Third: Human approval for irreversible actions. Wire up a notification channel (Slack, email, PagerDuty) for high-risk action approvals. Default deny on timeout.

Fourth: Risk classification matrix. Document which action types require which gates. Make it explicit. Update it as you learn from production.

Fifth: Automated semantic consequence check. Add a lightweight check that evaluates dry-run output against the stated task intent. Flags large discrepancies between expected and actual impact automatically, without requiring human review.


The Principle

The model doesn't know what it doesn't know. It doesn't know that your database has a migration bug. It doesn't know that the email list includes a journalist. It doesn't know that the infrastructure change will cascade to three dependent services.

The gate knows, or can find out, before it's too late.

Gated Execution is not about distrust of the model. It's about the fundamental asymmetry between the cost of blocking a safe action (temporary delay) and the cost of executing an unsafe one (potentially irreversible damage).

Consequence-Proportional Gating is the practice. The gate is the mechanism. Together they are what separate an agent that assists from one that causes incidents.

Build the gate before you give the agent write access. Because an agent without a gate is not an agent. It's a liability.


What's Next in This Series


References

  1. Weidinger, L., Mellor, J., Rauh, M., et al. (2021). Ethical and social risks of harm from Language Models. arXiv:2112.04359. https://arxiv.org/abs/2112.04359

  2. Amodei, D., Olah, C., Steinhardt, J., et al. (2016). Concrete Problems in AI Safety. arXiv:1606.06565. https://arxiv.org/abs/1606.06565

  3. Perez, E., Huang, S., Song, F., et al. (2022). Red Teaming Language Models with Language Models. arXiv:2202.03286. https://arxiv.org/abs/2202.03286

  4. Anthropic. (2024). Claude's approach to tool use and safety. https://docs.anthropic.com/en/docs/build-with-claude/tool-use

  5. OpenAI. (2023). Function calling and other API updates. https://openai.com/blog/function-calling-and-other-api-updates

  6. Fowler, M. (2018). Circuit Breaker. martinfowler.com. https://martinfowler.com/bliki/CircuitBreaker.html

  7. Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2021). Unsolved Problems in ML Safety. arXiv:2109.13916. https://arxiv.org/abs/2109.13916


AI Engineering

AI Security

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments