← Back to Guides
8

Series

Harness Engineering· Part 8

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Deterministic Constraint Systems: Building Tool Registries That Keep Agents in Scope

The model will try to use tools it doesn't have. It will call APIs with parameters that don't exist. It will invent capabilities. The constraint system is how you make the gap between what the model thinks it can do and what it can actually do exactly zero.

#harness-engineering#tool-registry#action-manifest#constraint-systems#llm-agents#production-ai-systems

The Tool That Didn't Exist

An agent was tasked with triaging customer support tickets and routing them to the appropriate team. It had four tools: get_ticket, assign_ticket, add_comment, and close_ticket.

Three days in, a ticket arrived that the agent decided was a billing issue. The agent called escalate_to_billing. This tool did not exist. The function call returned a tool-not-found error.

The agent tried again. Same error. It tried a variation: route_to_billing_team. Same error. It tried billing_escalation. Same error.

After six failed attempts, the agent added a comment to the ticket saying "Escalated to billing team" and closed it. The ticket was never escalated. The comment was false. The customer waited three days for a response from a team that never received the ticket.

The agent invented a tool that didn't exist, failed to call it, declared success anyway, and produced a false audit trail.

This is not a hallucination problem. The model didn't make up facts - it made up capabilities. And it had no way to know that escalate_to_billing didn't exist, because nobody had told it precisely which tools did exist and what they could do.

This is Part 8 of the Harness Engineering series - the final layer of the seven-layer Harness Architecture. Part 1 introduced the architecture. This article covers Layer 3 - the Constraint Layer - building tool registries and action manifests that make hallucinated tool calls structurally impossible.


The Model's Imagination Is Not Your Action Space

This is the core problem the constraint layer solves.

The model has a training-time conception of what tools an agent might have. It knows about REST APIs. It knows about databases. It knows about email. It knows about file systems. All of this knowledge informs what tool calls it might attempt.

Your agent has a runtime action space: exactly the tools you've wired up, with exactly the parameters they accept, operating on exactly the resources they're permitted to access.

The gap between the model's imagination and your runtime action space is your Injection Surface for capability hallucination. The model will reach into that gap and call tools that don't exist, pass parameters that aren't valid, and attempt actions on resources it isn't permitted to touch.

The constraint system closes that gap to zero.

Not by making the model smarter. Not by better prompting. By making the gap structurally impossible - by building a tool registry where every callable action is explicitly defined, every parameter is typed and validated, and the model is shown only what exists.


What a Tool Registry Actually Is

Most agent frameworks expose tools as a list of function signatures passed to the model. This is a start and a poor substitute for a constraint system.

A tool registry is a runtime data structure that:

  1. Defines every available action with complete, typed parameter specifications
  2. Validates every tool call before execution against those specifications
  3. Enforces scope rules that restrict which tools are available in which contexts
  4. Logs every tool call with full input/output for audit
code
from dataclasses import dataclass, fieldfrom typing import Any, Callablefrom pydantic import BaseModel, create_modelimport inspect@dataclassclass ToolParameter:    name: str    type: type    required: bool    description: str    enum_values: list | None = None    min_value: float | None = None    max_value: float | None = None@dataclassclass ToolDefinition:    name: str    description: str    parameters: list[ToolParameter]    handler: Callable    scope_tags: list[str] = field(default_factory=list)  # e.g. ["read", "write", "billing"]    requires_approval: bool = False    def to_llm_schema(self) -> dict:        """Convert to the tool schema format the LLM expects."""        properties = {}        required = []        for param in self.parameters:            prop = {"type": self._python_type_to_json(param.type), "description": param.description}            if param.enum_values:                prop["enum"] = param.enum_values            properties[param.name] = prop            if param.required:                required.append(param.name)        return {            "name": self.name,            "description": self.description,            "input_schema": {                "type": "object",                "properties": properties,                "required": required,            }        }    def _python_type_to_json(self, t: type) -> str:        return {str: "string", int: "integer", float: "number", bool: "boolean"}.get(t, "string")class ToolRegistry:    def __init__(self):        self._tools: dict[str, ToolDefinition] = {}    def register(self, tool: ToolDefinition):        self._tools[tool.name] = tool    def get_tools_for_scope(self, scope_tags: list[str]) -> list[ToolDefinition]:        """Return only tools whose scope_tags intersect with the requested scope."""        if not scope_tags:            return list(self._tools.values())        return [            t for t in self._tools.values()            if any(tag in scope_tags for tag in t.scope_tags)        ]    def to_llm_schemas(self, scope_tags: list[str]) -> list[dict]:        """Return the tool schemas to pass to the LLM - scoped to context."""        return [t.to_llm_schema() for t in self.get_tools_for_scope(scope_tags)]    def validate_and_execute(self, tool_name: str, tool_input: dict, scope_tags: list[str]) -> Any:        # Tool must exist        if tool_name not in self._tools:            raise ToolNotFound(                f"Tool '{tool_name}' does not exist. "                f"Available tools: {list(self._tools.keys())}"            )        tool = self._tools[tool_name]        # Tool must be in scope        if scope_tags and not any(tag in scope_tags for tag in tool.scope_tags):            raise ToolOutOfScope(                f"Tool '{tool_name}' is not available in the current scope. "                f"Tool scope: {tool.scope_tags}, Current scope: {scope_tags}"            )        # Validate parameters        self._validate_parameters(tool, tool_input)        # Execute        return tool.handler(**tool_input)    def _validate_parameters(self, tool: ToolDefinition, inputs: dict):        for param in tool.parameters:            if param.required and param.name not in inputs:                raise MissingParameter(f"Required parameter '{param.name}' missing for tool '{tool.name}'")            if param.name in inputs:                value = inputs[param.name]                # Type check                if not isinstance(value, param.type):                    try:                        inputs[param.name] = param.type(value)  # Attempt coercion                    except (ValueError, TypeError):                        raise ParameterTypeError(                            f"Parameter '{param.name}' must be {param.type.__name__}, "                            f"got {type(value).__name__}: {value!r}"                        )                # Enum check                if param.enum_values and inputs[param.name] not in param.enum_values:                    raise ParameterValueError(                        f"Parameter '{param.name}' must be one of {param.enum_values}, "                        f"got: {inputs[param.name]!r}"                    )                # Range check                if param.min_value is not None and inputs[param.name] < param.min_value:                    raise ParameterValueError(                        f"Parameter '{param.name}' must be >= {param.min_value}"                    )                if param.max_value is not None and inputs[param.name] > param.max_value:                    raise ParameterValueError(                        f"Parameter '{param.name}' must be <= {param.max_value}"                    )

The key behaviors:

ToolNotFound raises immediately with the list of actually available tools. This error feeds back to the model on the next turn, allowing it to correct its tool selection. No silent failure, no agent pretending it escalated something it didn't.

Scope enforcement is runtime, not prompt-based. The scope restriction is enforced in code, not in the system prompt. Telling the model "you can only use read tools in this context" is a prompt-level hint. The registry enforcing scope in validate_and_execute is a hard constraint that cannot be overridden by the model's reasoning.


The Action Manifest

The action manifest is a higher-level concept that sits above the tool registry. Where the registry defines what tools exist and how to call them, the manifest defines what the agent is authorized to do in a specific task context.

code
@dataclassclass ActionManifest:    task_id: str    task_type: str    allowed_scope_tags: list[str]    resource_constraints: dict  # e.g. {"max_records": 100, "allowed_ticket_ids": [...]}    max_tool_calls: int    requires_human_approval_for: list[str]  # Tool names that need approval    def allows_tool(self, tool_name: str, registry: ToolRegistry) -> bool:        available = registry.get_tools_for_scope(self.allowed_scope_tags)        return any(t.name == tool_name for t in available)    def needs_approval(self, tool_name: str) -> bool:        return tool_name in self.requires_human_approval_for

For the ticket triage agent, the manifest would look like:

code
triage_manifest = ActionManifest(    task_id="triage-session-abc123",    task_type="ticket_triage",    allowed_scope_tags=["ticket_read", "ticket_write", "comment"],    resource_constraints={        "allowed_ticket_ids": ["T-1001", "T-1002", "T-1003"],    },    max_tool_calls=50,    requires_human_approval_for=["close_ticket"])

The manifest would have shown the model exactly four tools: get_ticket, assign_ticket, add_comment, close_ticket. escalate_to_billing would not appear. The model cannot call what it cannot see.

This is the fundamental constraint design principle: the model's action space is what you show it, not what it imagines.


Dynamic Tool Scoping

Static tool registries give every agent the same tools in every context. Dynamic scoping restricts available tools based on the task phase, the agent's current position in a workflow, or the resource being operated on.

code
class DynamicToolScope:    def __init__(self, registry: ToolRegistry):        self.registry = registry    def tools_for_phase(self, phase: str) -> list[dict]:        phase_scopes = {            "planning": ["read"],            "execution": ["read", "write"],            "verification": ["read"],            "cleanup": ["read", "write", "delete"],        }        scope_tags = phase_scopes.get(phase, ["read"])        return self.registry.to_llm_schemas(scope_tags)    def tools_for_resource_type(self, resource_type: str) -> list[dict]:        resource_scopes = {            "ticket": ["ticket_read", "ticket_write", "comment"],            "user": ["user_read"],            "billing": ["billing_read"],  # No write - billing writes require separate approval        }        scope_tags = resource_scopes.get(resource_type, [])        return self.registry.to_llm_schemas(scope_tags)

Vercel's finding - that removing 80% of available tools improved agent performance - is the empirical validation of this principle. More tools means more ways to get it wrong. Fewer, well-scoped tools means fewer failure modes and more reliable task completion.

Never give an agent more tools than it needs for the current phase of the current task.


Error Feedback to the Agent

When a tool call fails validation, the error message the model receives is a critical part of the constraint system. A good error message enables self-correction. A bad one produces retry loops.

code
def build_tool_error_feedback(error: Exception, tool_name: str, tool_input: dict) -> str:    if isinstance(error, ToolNotFound):        return (            f"Tool '{tool_name}' does not exist.\n"            f"Available tools: {error.available_tools}\n"            f"Please use one of the available tools or adjust your approach."        )    elif isinstance(error, ToolOutOfScope):        return (            f"Tool '{tool_name}' is not available in the current task context.\n"            f"Current context allows: {error.current_scope}\n"            f"Please use only the tools available in this context."        )    elif isinstance(error, MissingParameter):        return (            f"Tool call failed: {error}\n"            f"Required parameters for '{tool_name}': "            f"{[p.name for p in error.tool.parameters if p.required]}\n"            f"You provided: {list(tool_input.keys())}"        )    elif isinstance(error, ParameterValueError):        return f"Invalid parameter value: {error}\nPlease correct and retry."    else:        return f"Tool call failed: {error}"

The ToolNotFound error lists available tools. This is what enables the model to self-correct: it knows escalate_to_billing doesn't exist and can choose assign_ticket instead. Without this feedback, the model either loops or, worse, pretends it succeeded.


The Named Pattern: Closed Action Vocabulary

I call the combination of registry + manifest + dynamic scoping the Closed Action Vocabulary: the complete, exhaustive, and non-extensible set of actions the agent is permitted to take in a given context.

"Closed" is the operative word. The vocabulary has no open-ended extension mechanism at runtime. The model cannot add new tools by describing them. It cannot call undeclared functions. It cannot reach outside the boundary of what the registry exposes.

Every production agentic system should operate within a Closed Action Vocabulary. The vocabulary may be large. It may vary by context. But it is always defined, always enforced, and never improvised by the model.

The model's imagination is unbounded. Its action space must not be.


What Observability Looks Like for This Layer

ToolNotFound rate by tool name - which non-existent tools does the model attempt most often? This tells you where your manifest is incomplete. If the model frequently attempts escalate_to_billing, you either need to add that tool or make the routing instruction clearer.

Parameter validation failure rate by tool and parameter - which parameters does the model get wrong most often? A high failure rate on a specific parameter signals a prompt-level description issue - the model doesn't understand what format that parameter expects.

Scope violation attempts - how often does the model attempt to call an out-of-scope tool? A non-zero rate means either your scope boundaries are misconfigured or the model is attempting to exceed its authorization. Both require investigation.

Tool call distribution - what fraction of calls go to each tool? An uneven distribution (90% of calls to one tool) may indicate the manifest is missing tools the agent needs, forcing it to use suboptimal substitutes.


What to Build First

First: Explicit tool schema definitions. Stop passing raw function signatures to the model. Define every tool with name, description, typed parameters, and required vs. optional fields. Use your framework's tool definition format (Claude tools, OpenAI function calling) with complete parameter documentation.

Second: Registry with ToolNotFound handling. Wrap every tool call in a registry check. If the model calls a tool that doesn't exist, return a structured error with the list of available tools. Never silently fail or let the model declare success on a failed tool call.

Third: Scope restriction. Identify which tools should be available in which contexts. Start with read/write/delete as your basic scope tags. Apply scope restrictions to agent tasks that don't need write access.

Fourth: Action manifest per task type. Define a manifest for each task type your agent performs. What tools, what scopes, what resource constraints, what approval requirements.

Fifth: Parameter validation with coercion. Add type coercion where possible (string "123" to integer 123) and explicit error messages where not. The error feedback the model receives on validation failure is as important as the validation itself.

Sixth: Tool call audit logging. Log every tool call: tool name, inputs, outputs, validation outcome, execution time. This is your agent's audit trail. You will need it.


Closing the Series: The Complete Harness Architecture

This is the final layer. With all seven in place, here is what the complete Harness Architecture looks like:

mermaid
graph TD
    A[User Request] --> B[Layer 1: Normalization]
    B --> C[Layer 2: Context Orchestration]
    C --> D[LLM Reasoning]
    D --> E[Layer 3: Constraint System]
    E -- Invalid Tool Call --> F[Error Feedback]
    F --> D
    E -- Valid --> G[Layer 5: Validation]
    G -- Repair --> D
    G -- Valid --> H[Layer 4: Gated Execution]
    H -- Blocked --> I[Human Approval / Reject]
    H -- Approved --> J[Action Execution]
    J -- Long-running --> K[Layer 7: State Management]
    K --> D
    J --> L[Final Output]
    D -- Provider Fail --> M[Layer 6: Circuit Breaker / Fallback]
    M --> D

    style A fill:#4A90E2,color:#fff
    style B fill:#7B68EE,color:#fff
    style C fill:#9B59B6,color:#fff
    style D fill:#FFD93D,color:#000
    style E fill:#6BCF7F,color:#fff
    style F fill:#E74C3C,color:#fff
    style G fill:#6BCF7F,color:#fff
    style H fill:#98D8C8,color:#000
    style I fill:#E74C3C,color:#fff
    style J fill:#4A90E2,color:#fff
    style K fill:#FFA07A,color:#fff
    style L fill:#4A90E2,color:#fff
    style M fill:#E74C3C,color:#fff

Seven layers. Each one a specific failure mode absorbed before it propagates. Together, they are the infrastructure that separates a demo from a production system.

The model is the least reliable component in your stack. The harness is how you ship it anyway.


The Complete Series


References

  1. Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629

  2. Shinn, N., Cassano, F., Labash, B., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366. https://arxiv.org/abs/2303.11366

  3. Anthropic. (2024). Tool use with Claude. https://docs.anthropic.com/en/docs/build-with-claude/tool-use

  4. OpenAI. (2023). Function calling. https://platform.openai.com/docs/guides/function-calling

  5. Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. https://arxiv.org/abs/2302.04761

  6. Greshake, K., Abdelnabi, S., Mishra, S., et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173. https://arxiv.org/abs/2302.12173

  7. Chase, H. (2022). LangChain: Building applications with LLMs through composability. https://github.com/langchain-ai/langchain


AI Security

Systems Design

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments