Skills vs Hooks in Claude Code: Enforceability Is the Design Variable

ℹ️

This article uses Claude Code's configuration layers as the reference implementation - CLAUDE.md, SKILL.md, and the hook lifecycle are Claude Code-specific. The design variable it names is not. Cursor rules, AGENTS.md, LangGraph callbacks, and every agent harness that separates "text the model reads" from "code that runs on the tool path" faces the same advisory-versus-enforced decision. The syntax changes. The category error does not.

When a PreToolUse hook exits with code 2, the tool call dies before it executes. The file is not edited. The command does not run. Whatever the model wrote to justify the action never mattered, because the hook is a separate process running outside the model's context, and the model never saw it coming. All Claude receives is the hook's stderr, injected back into the conversation as the reason the action failed.

Compare that to a skill. A skill is a SKILL.md file whose description gets matched against the task; when it matches, the body is loaded into context and the model reads it. Reads it. The skill can be precise, battle-tested, and correct, and the model can still weigh it against eighty thousand tokens of accumulated context and decide, this one time, that the situation calls for something else.

That is the entire distinction, and everything in this article hangs off its two poles: advisory versus enforced. A skill is guidance the model interprets. A hook is a gate the model cannot see and cannot route around. Engineers already own this distinction everywhere else in the stack - a lint warning versus a compile error, a review comment versus a required Continuous Integration (CI) check. The sharper version: you don't // eslint-disable a segfault. Intent doesn't get a vote.

Advisory vs. Enforced: The Category Error Behind "Skills Are Unreliable"

Here is the claim this article exists to prove: the skill failures that end with a team concluding "skills are unreliable" are usually not skill failures at all. They are category errors - a skill deployed where the job needed a gate. Skills have real failure modes of their own - descriptions too vague to route, bodies too bloated to load - but those are visible and fixable in place. The category error is the failure that survives good authoring, because no amount of authoring changes what a skill is. The team writes a beautiful SKILL.md saying "never edit files outside the requested change," watches the agent violate it in week two, and concludes that skills are unreliable. Skills are not unreliable. Skills are advisory. The team asked a suggestion to do a gate's job, and the suggestion behaved exactly like a suggestion.

The design variable underneath - the one that decides which tool the job needs - is enforceability: for each rule you want your agent to follow, is it guidance or is it a gate? I call this axis the Enforceability Axis, and the point of naming it is that every best practice in the Claude Code ecosystem is downstream of where a rule sits on it, yet almost nobody designs with it explicitly.

To be precise about what is and is not new here: the advisory/deterministic split itself is established. Anthropic's own steering guide states that a prompted rule can fail under pressure while "all hooks are deterministically triggered," and that when something absolutely must not happen, an instruction is the wrong tool. Part 4 of this series (Hooks: The Enforcement Layer That Turns Agent Policy Into Agent Fact) built the enforcement side in depth and named the shift policy-as-code for agents. What the discourse has not done is treat enforceability as the generating variable - the single question that produces the layer-selection frameworks, the skill-authoring best practices, and the failure taxonomies as corollaries. That is the job of this piece: not "here are three tools," but "here is the axis, and here is the decision procedure that eliminates a whole class of failure."

The accurate picture of Claude Code's guidance layers is a triad - CLAUDE.md (always-on), skills (retrieved-on-relevance), hooks (enforced-on-path). The decision that matters collapses to two poles - advisory or enforced. Hold both: the triad is the truth, the axis is the move.

The Three Coding-Agent Failure Modes Behind the Viral Karpathy CLAUDE.md

The clearest evidence that the industry feels this problem - without naming it - is the most-starred configuration file in recent memory.

On January 26, 2026, Andrej Karpathy posted a thread on X describing his shift to agent-heavy coding and cataloguing what still goes wrong: agents that don't manage their confusion, don't seek clarification, don't surface inconsistencies, over-engineer simple requests into bloated abstractions, and touch adjacent code they don't understand. The next day, developer Forrest Chang published a single CLAUDE.md distilling those observations into four behavioral rules. The repository, andrej-karpathy-skills (since transferred to the multica-ai organization; Karpathy himself neither authored nor endorsed it), sits at roughly 187,000 GitHub stars as of July 2026, five months after its first commit.

It did not go viral because it was a good prompt. It went viral because it correctly diagnosed three failure modes every practitioner recognizes:

Silent wrong assumptions. The agent charges ahead on an unverified guess instead of asking. Chang's counter: "Think Before Coding" - state assumptions, surface confusion, ask when unclear.
Over-engineering. The 50-line fix becomes 500 lines with a speculative abstraction layer. Chang's counter: "Simplicity First" - minimum code that solves the problem, nothing speculative.
Orthogonal edits. The agent touches code nobody asked it to touch. Chang's counter: "Surgical Changes" - touch only what you must, clean up only your own mess.

Chang's fourth rule, Goal-Driven Execution, is not a failure mode but a verification loop - hold that thought; it returns in the section on composing gates, one enforcement step from being real.

This is not anecdote-scale. The largest published analysis of developer-agent misalignment - Tang et al., covering 20,574 real coding-agent sessions across 1,639 repositories - found that the single most common failure category, at 38.33 percent of misalignment cases, was Developer Constraint Violation: the agent doing something the developer had explicitly constrained it not to do. The dominant failure mode of coding agents is not incapability. It is ungoverned capability.

Now the move the listicles never make. Put each of the three failure modes on the Enforceability Axis and they land in three different places:

Silent wrong assumptions are advisory by nature. No deterministic check can detect that the model is proceeding on a wrong interpretation of an ambiguous request - detecting it would require already knowing the right interpretation. "Stop and ask" is genuinely the ceiling, and a CLAUDE.md rule is the correct tool. Not a compromise. Correct.
Orthogonal edits are checkable. "This edit targets a file outside the declared scope of the task" is a boolean a twenty-line script can compute from the tool call's arguments. Which means it can be a hook. And if scope violations cost you real money or review time - if they matter - it should be.
Over-engineering sits in between. Diff size and file count are measurable; "this abstraction is speculative" is judgment. You can gate the measurable proxy (flag any diff over N lines for a one-function request) while the substance stays advisory.

Three famous failure modes, three different positions on one axis. That is the whole article in miniature, and it explains both the repo's success and its ceiling: 187,000 stars went to four advisory rules, and the one rule among them that could have been a gate - Surgical Changes - is still being violated somewhere right now, politely, by an agent that read it.

CLAUDE.md vs. Skills vs. Hooks: What Each Layer Actually Guarantees

Precision about the triad is the reader's payoff, so here are the primitives without the marketing gloss. (Part 5 of this series, Which Claude Code Layer Solves Your Problem?, maps all five extensibility layers to problem types and names the cost of misdiagnosis the Wrong-Layer Tax; this section is the enforceability cross-section of that map.)

CLAUDE.md is always-on context. Loaded every session, which means it competes for attention every turn against everything else in the window. Best for a small number of high-value invariants - Chang's rules are the canonical example. Its failure mode is bloat: every line you add dilutes every line already there. The mechanism predicts instruction-following decay as the file grows - a constraint stated once at the top of the context carries less attention weight under eighty thousand tokens of tool output than it did at turn one - and the recurring practitioner complaint matches the prediction: agents forgetting rules that are literally in the file.

Skills are retrieved-on-relevance. A SKILL.md with frontmatter whose description is matched against the current task; only the description is always in context (Claude Code caps each listing at 1,536 characters and budgets roughly one percent of the context window for all of them), and the body loads on match - progressive disclosure. Best for procedural knowledge that is only sometimes relevant: deployment runbooks, review checklists, migration procedures. Two failure modes: descriptions too vague to route ("a helpful skill for code quality" matches nothing reliably), and detail stuffed into the SKILL.md body instead of companion files loaded on demand.

Hooks are enforced-on-path. Deterministic code registered on the tool-call lifecycle - PreToolUse, PostToolUse, SessionStart, Stop, and others - configured in settings.json, invisible to the model. The mechanism in full, because it is the thing readers should remember: a PreToolUse hook fires after the model has committed to a tool call but before it executes; the hook receives the full tool input as JSON on stdin; exit code 0 lets the call proceed; exit code 2 blocks it and feeds the hook's stderr back to the model as the reason. The model never sees the hook - only the consequence - and it adapts to the consequence. You are not persuading the model. You are editing its reality.

The mapping onto primitives engineers already own:

Engineering primitive	Claude Code primitive	Enforcement property
Style guide / doc comment	`CLAUDE.md`	Always visible, never enforced
Lint warning	Skill	Loaded when relevant, ignorable
Compile error / required CI check	Hook	Non-negotiable, on the path

One honest caveat before the worked example, because it buys the credibility everything after depends on: enforcement is not free, and over-enforcement is worse than under-enforcement. Under-enforcement fails loudly - you see the bad diff and tighten the setup. Over-enforcement fails invisibly: the agent silently stops taking a class of correct actions, and you spend an afternoon debugging why it "won't do the obvious thing" before you remember the gate you installed three months ago. Every hook is latency on every matching tool call and rigidity in every edge case you didn't anticipate. Anthropic's Code Review feature is instructive as a deliberate design choice here: it ships advisory by default - findings as comments, severity-tagged, not blocking - and lets teams opt individual checks up to blocking. The default posture is guidance; enforcement is a decision you make per-invariant. That is the discipline, stated as product design.

Worked Example: Blocking Out-of-Scope Edits with a PreToolUse Scope Gate

Orthogonal edits are the right centerpiece because they cross the advisory/enforced line cleanly - you can watch both tools work the same problem and see exactly where one stops and the other starts.

The failure, concretely

You ask for a one-function change: add retry logic to fetchInvoices() in lib/billing.ts. The diff comes back with the retry logic - plus a reformatted lib/utils.ts (the agent's formatter disagreed with yours), a renamed variable in lib/reports.ts ("for consistency"), and a bumped dependency in package.json ("the old version had a known issue"). Each individual edit is defensible. Nobody asked for any of them. Your reviewer now owns a four-file diff for a one-function request, and the real change is the part they'll skim.

This is the Tang et al. number wearing a face: constraint violation, 38.33 percent, the top of the failure distribution.

The advisory fix, and its ceiling

You add the rule - to CLAUDE.md, or to a code-editing skill:

code

## Scope disciplineModify only files directly required by the requested change.Do not reformat, rename, refactor, or upgrade anything you were not asked to touch.If an out-of-scope change seems necessary, stop and ask before making it.

And it works - most of the time. Frequency drops noticeably; the agent cites the rule back to you; sessions feel more surgical. Then comes the long session, the context-heavy refactor, the task framing where "clean up the billing module" has already licensed six legitimate edits and the seventh - the one nobody asked for - pattern-matches as consistent with the rest. The model is not ignoring your rule. It is exercising judgment about what the rule means in context, which is precisely what advisory text asks it to do. Advisory reduces the frequency of the class. It cannot eliminate the class, because elimination is not a property text can have.

The residue that advisory leaves behind is exactly what the misalignment data shows: agents violating constraints the developer stated explicitly and the model demonstrably read.

The enforced fix

Orthogonal edits are checkable, so build the check. Declare the task's scope in a file, and put a gate on the edit path. The scope declaration:

code

# .claude/task-scope.txt - one in-scope path prefix per linelib/billing.tstests/billing.test.ts

The gate - and if this piece shows you one code block worth keeping, it is this one:

code

#!/usr/bin/env python3"""PreToolUse scope gate: block file modifications outside the declared task scope."""import jsonimport osimport reimport sysfrom pathlib import PathSCOPE = Path(".claude/task-scope.txt")  # one in-scope path prefix per linedef blocked(msg):    print(msg, file=sys.stderr)    sys.exit(2)  # exit 2 blocks the tool call; stderr is fed back to Claudetry:    event = json.load(sys.stdin)    tool = event["tool_name"]except (ValueError, KeyError):    blocked("BLOCKED: scope gate could not parse the tool call. Failing closed.")tool_input = event.get("tool_input", {})if not SCOPE.exists():    sys.exit(0)  # no scope declared for this task: the gate stays openprefixes = [p.strip() for p in SCOPE.read_text().splitlines()            if p.strip() and not p.startswith("#")]def in_scope(path):    if not path:        return False    resolved = Path(path).resolve()    for prefix in prefixes:        base = (Path.cwd() / prefix).resolve()        if resolved == base or str(resolved).startswith(str(base) + os.sep):            return True    return Falseif tool in ("Edit", "MultiEdit", "Write", "NotebookEdit"):    target = tool_input.get("file_path", tool_input.get("notebook_path", ""))    if not in_scope(target):        blocked(            f"BLOCKED: {target} is outside the declared scope for this task.\n"            f"In-scope paths: {', '.join(prefixes)}.\n"            "If this file genuinely must change, explain why and ask the user to "            "widen .claude/task-scope.txt. Do not work around this gate."        )# Writes routed through the shell would bypass the Edit/Write gate entirely.WRITE_SHAPED = re.compile(r"(>>?|\btee\b|\bsed\s+-i\b|\bmv\b|\brm\b|\bcp\b|\bgit\s+apply\b)")if tool == "Bash" and WRITE_SHAPED.search(tool_input.get("command", "")):    blocked(        "BLOCKED: write-shaped shell command during a scoped task.\n"        "File changes must go through the Edit or Write tool so they can be "        "checked against the declared task scope. If this command does not "        "modify any file, ask the user to approve running it as-is."    )sys.exit(0)

Registered in .claude/settings.json:

code

{  "hooks": {    "PreToolUse": [      {        "matcher": "Edit|MultiEdit|Write|NotebookEdit|Bash",        "hooks": [          { "type": "command", "command": "python .claude/hooks/scope_gate.py" }        ]      }    ]  }}

Two details in that script are load-bearing, and both are the difference between a demo and a gate.

First, the matcher covers every mutation path, not just the obvious one. A scope gate that only intercepts Edit and Write is a gate with holes in it: the model can modify files through MultiEdit and NotebookEdit, and - the big one - it can still write through the shell with sed -i, output redirection, or tee. That last bypass is not hypothetical; it is a filed, acknowledged Claude Code issue (agents circumventing PreToolUse edit hooks via the Bash tool). The fix is a pattern I call the Write Funnel: make the Bash gate deliberately coarse - deny anything write-shaped and tell the model to use Edit instead - so that every file modification is funneled through the one tool where the fine-grained scope check lives. You do not need to parse shell commands well. You need to make the gated path the only path.

And hold the demo above to the article's own standard, because it does not fully meet it yet. WRITE_SHAPED is a denylist, and denylists do not close: python -c, perl -i, git restore, and patch are all write-shaped and unmatched - and one interpreter one-liner can delete .claude/task-scope.txt itself, after which the gate stands open. It also over-blocks: npm test 2>/dev/null modifies nothing that matters, and that friction is a real cost you accepted at the blast-radius question, not a free lunch. A production Write Funnel inverts the logic - during a scoped task, deny any shell command that is not on a read-only allowlist - and pairs the hook with the enforcement primitives the harness already ships: permission deny rules and sandboxing gate the same paths deterministically, without hand-rolled parsing. Hooks are the most programmable enforcement layer, not the only one. A gate with a known bypass is not a weak gate. It is advisory with extra steps. Enforcement is an engineering task, not a config line, and the difference shows up exactly here.

Second, the stderr message. When the gate fires, Claude does not see a permission dialog or a stack trace - it sees your stderr text as the stated reason the action failed, and the quality of that text determines whether the agent recovers gracefully or thrashes. "BLOCKED" alone produces retry loops and creative workarounds. The message above tells the model what was out of bounds, what is in bounds, and what the legitimate escalation path is: explain the need, ask the human to widen the scope file. The same over-eager turn that produced the four-file diff now produces a one-file diff and a sentence: "I also wanted to reformat lib/utils.ts, but it's outside the declared scope - widen it if you want that change." Run the funnel properly - allowlist, not denylist, backed by permission rules - and the class of silent scope violations is not reduced. It is eliminated. What survives is the conversation about scope, which is the part that actually needed human judgment.

The judgment: when is the gate worth it?

Not always. A solo exploratory session on a throwaway branch does not want per-edit scope friction, and installing it there is the over-enforcement tax from the previous section. The usual rules of thumb for when a workflow deserves durable investment are stated in recurrence - Anthropic's trigger for skills is that you keep pasting the same procedure - or, informally, in time spent per task. Both are proxies for the real variable, which is blast radius: enforce when the cost of one failure exceeds the accumulated cost of the friction. A shared repository, a CI context, a junior-heavy team, a semi-autonomous agent running while you are in a meeting - the scope gate pays for itself the first time it converts a silent four-file diff into a one-line request for permission. Part 4 drew the same line for destructive commands: if you cannot recover from the violation on the next turn, it does not belong in a prompt. Blast radius is that rule generalized from "unrecoverable" to "more expensive than the gate."

Composing Guidance and Gates Across the Software Development Life Cycle (SDLC)

At the scale of a full Software Development Life Cycle (SDLC), the advisory/enforced distinction stops being a per-rule choice and becomes a composition problem: which stages of your pipeline are gates, and which are guidance. Three patterns from production setups make the shape visible.

The chained security audit. One invocation runs secrets scan, then dependency and supply-chain analysis, then CI/CD configuration review, then a STRIDE threat model (spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege). Look at the chain through the enforceability lens and it is not four steps of one kind: the scans are deterministic pass/fail - hooks in spirit, whatever file they live in - while the threat model is irreducibly judgment. A good chain treats them differently: a secrets hit halts the pipeline; a threat-model finding gets written up for a human. A chain that pretends all four steps are the same kind of step either blocks on judgment calls (over-enforcement) or advises on secrets (malpractice).

Spec decomposition into parallel worktrees. A skill that splits a settled spec into orthogonal mini-specs for subagents in separate git worktrees (Part 3 covers the isolation architecture). Split the pattern on the axis: the decomposition - deciding where the seams are - is judgment, and belongs in the skill. The orthogonality - the guarantee that parallel agents cannot write over each other - is exactly what worktree isolation makes deterministic. The skill proposes; the filesystem enforces. The pattern works because the two halves are assigned to the correct sides of the axis.

The composable SDLC chain. Brainstorm, plan, implement with Test-Driven Development (TDD), review - as linked single-purpose skills rather than one mega-skill. Composition only works if each link has a clean contract: the plan skill's output is the implement skill's input, checkable at the seam. A mega-skill has no contract, just vibes, and there is nowhere to put a gate because there is no seam for the gate to guard.

The established "what makes a skill work" checklist - routing-precise descriptions, lean SKILL.md bodies, companion files over inline detail, deterministic scripts for deterministic sub-tasks, one job per skill, worked examples (Part 2 covers it as production knowledge infrastructure) - is good advice, and every item on it is downstream of the same unnamed question. Routing-precise descriptions matter because skills are advisory - routing is the only enforcement a skill will ever get, so the description carries all of it. "Deterministic code for deterministic sub-tasks" is the Enforceability Axis, applied inside the skill boundary. One-job-per-skill is what keeps the guidance/gate seams visible. The framework is correct; it just never names the variable that generates it.

And TDD deserves its own beat, because it is the place where the SDLC naturally manufactures an enforcement primitive. The test is the hook. "Write the test first" means "generate the gate before the code that has to pass it" - and in Claude Code this is literal, not metaphorical: a Stop hook that runs the suite and exits 2 on failure forces the agent back to work until the gate opens. (Whether your gates themselves hold is a testable question too - Part 8 covers testing the setup beyond the skill level.) Chang's fourth rule - Goal-Driven Execution, "define success criteria, loop until verified" - is this exact move in advisory form, one enforcement step from being real. Test-first development is the human-legible version of the advisory-to-enforced shift this whole article is about, which is why TDD-with-agents keeps outperforming its reputation: it is the one methodology where practitioners were already building the gate first.

The Enforceability Test: Deciding Between Skill, Hook, or Neither

Here is the decision procedure. The explicit design goal is that it fits on an index card: three questions and an escape hatch. If you cannot reconstruct it at your next code review, it is too complicated.

mermaid

flowchart TD
    Q0{"Will this rule matter<br/>beyond the current task?"}
    Q0 -->|No| N["Neither.<br/>Write a better prompt<br/>and move on."]
    Q0 -->|Yes| Q1{"Is the invariant<br/>checkable by code?"}
    Q1 -->|No| A["Advisory is the ceiling:<br/>CLAUDE.md rule or skill.<br/>Do not fake a gate."]
    Q1 -->|Yes| Q2{"Does one failure cost more<br/>than the ongoing friction?<br/>(blast radius)"}
    Q2 -->|No| A2["Leave it advisory:<br/>over-enforcement<br/>is an invisible tax"]
    Q2 -->|Yes| Q3{"Must it hold across sessions,<br/>people, and model states?"}
    Q3 -->|No| S["Skill<br/>(retrieved on relevance)"]
    Q3 -->|Yes| H["Hook<br/>(enforced on path)"]

    classDef decision fill:#7B68EE,stroke:#5A4FCF,color:#FFFFFF
    classDef advisory fill:#98D8C8,stroke:#6FB5A3,color:#2C2C2A
    classDef hookNode fill:#E74C3C,stroke:#C0392B,color:#FFFFFF
    classDef skillNode fill:#4A90E2,stroke:#2E6DA4,color:#FFFFFF
    classDef neither fill:#95A5A6,stroke:#7F8C8D,color:#2C2C2A
    class Q0,Q1,Q2,Q3 decision
    class A,A2 advisory
    class H hookNode
    class S skillNode
    class N neither

In order, for any rule you are about to encode:

1. Is the invariant checkable by code? If no deterministic function of the observable inputs - tool arguments, file paths, diff stats, command strings - can decide whether the rule was violated, the rule is advisory by nature. "Don't proceed on wrong assumptions" is not checkable; nothing computes "wrong" without already knowing "right." A CLAUDE.md rule or a skill is the ceiling, and that is fine. What is not fine is faking enforcement you cannot compute - an LLM-judge hook that "blocks" on a vibe is a probabilistic gate, which is to say a suggestion with latency.

2. Does one failure cost more than the ongoing friction? Blast radius, not task length. A checkable invariant with a cheap failure mode - inconsistent import ordering, a suboptimal but working pattern - stays advisory, because you can fix it on the next turn and the per-call friction of a gate never earns itself back. Remember which direction fails silently: under-enforcement shows you a bad diff; over-enforcement shows you nothing and costs you an afternoon three months later. Part 10 works this question for one concrete rule - "commit before a destructive Bash command" - and shows why Claude Code's own session checkpointing can't substitute for the hook this axis says the rule needs.

3. Must it hold regardless of model state, across sessions, for other people? If the rule only needs to hold for you, in this session, while you are watching - a skill is probably enough; you are the fallback enforcement. If it must hold at hour three of a context-heavy session, for every engineer on the team, for the semi-autonomous run nobody is watching - that is what "the hook runs regardless" is for. This is Part 4's Probabilistic-to-Deterministic Boundary, located by questionnaire instead of by incident.

And the answer the listicles never mention - the flowchart runs it first, as question zero: neither. A rule that will not outlive the current task does not earn persistence. A one-off task does not earn a skill; a skill you will invoke twice does not earn a hook. The machine you do not build is maintenance you do not owe, context you do not spend, and a gate nobody has to debug around next quarter. The discipline includes knowing when a better prompt is the whole answer.

Index-card version: Doesn't recur? Prompt. Can't check it? Advise it. Cheap to fail? Advise it. Must hold for everyone, always? Hook.

The Discipline Outlasts the Tooling: Advisory vs. Enforced Beyond Claude Code

Everything syntactic in this article will rot. Hook event names will change, the skills schema will version, CLAUDE.md may be a different file with a different name next quarter - the steering surface of every agent harness is churning. The judgment does not churn: advisory where you need adaptability, enforced where you need guarantees, and the discipline to know which failure you are preventing before you pick the tool. The same goes for the engineering underneath the judgment - a scope gate without a Write Funnel is advisory with extra steps, whatever harness you build it on. That was the thesis at the top - the skill failures that end in "skills are unreliable" are mostly category errors, and enforceability is the variable that sorts them - and it will still be the thesis when the config format is unrecognizable.

Read the Karpathy episode back through that lens and it lands differently than it did in January. The complaint was never that AI is bad at coding - his thread says the opposite. The complaint was that agents make ungated decisions in places that needed gates: assumptions nobody verified, scope nobody granted, abstractions nobody requested. And the ecosystem's answer - 187,000 stars of beautifully written advisory text - treats every one of those as a persuasion problem. Some of them are. One of them was checkable all along.

Put the gates where they belong, and stop asking the model to remember what you could have made impossible to forget. The shift that matters is from instructing the model to engineering the environment it operates in. Juniors write better prompts. Seniors change what's possible.

References

Karpathy, A. (2026). Thread on agent-heavy coding workflow and failure modes [X post, January 26, 2026]. https://x.com/karpathy/status/2015883857489522876
Chang, F. (2026). andrej-karpathy-skills [GitHub repository; originally forrestchang/andrej-karpathy-skills, since transferred to multica-ai]. https://github.com/multica-ai/andrej-karpathy-skills
Tang, N., Chen, C., Xu, G., Shi, Y., Huang, Y., McMillan, C., Dong, T., & Li, T. J.-J. (2026). How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions. arXiv:2605.29442. https://arxiv.org/html/2605.29442v1
Anthropic (2026). Steering Claude Code: skills, hooks, rules, subagents and more. Claude Blog. https://claude.com/blog/steering-claude-code-skills-hooks-rules-subagents-and-more
Anthropic (2026). Hooks reference. Claude Code Documentation. https://code.claude.com/docs/en/hooks
Anthropic (2026). Extend Claude with skills. Claude Code Documentation. https://code.claude.com/docs/en/skills
Anthropic (2026). Equipping agents for the real world with Agent Skills. Anthropic Engineering. https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
Anthropic (2026). Code Review. Claude Code Documentation. https://code.claude.com/docs/en/code-review
Anthropic (2026). Claude Code circumvents PreToolUse:Edit hook via Bash tool [GitHub issue #29709]. https://github.com/anthropics/claude-code/issues/29709
MindStudio (2026). Claude Code Skills vs Hooks: What's the Difference and When to Use Each. https://www.mindstudio.ai/blog/claude-code-skills-vs-hooks-difference
Build This Now (2026). CLAUDE.md, Skills, Subagents, Hooks: When to Use Which. https://www.buildthisnow.com/blog/tools/claude-code-skills-vs-subagents-vs-hooks
systemprompt.io (2026). Claude Code Hooks and Event-Driven Workflows. https://systemprompt.io/guides/claude-code-hooks-workflows

Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

Advisory vs. Enforced: The Category Error Behind "Skills Are Unreliable"

The Three Coding-Agent Failure Modes Behind the Viral Karpathy CLAUDE.md

CLAUDE.md vs. Skills vs. Hooks: What Each Layer Actually Guarantees

Worked Example: Blocking Out-of-Scope Edits with a PreToolUse Scope Gate

The failure, concretely

The advisory fix, and its ceiling

The enforced fix

The judgment: when is the gate worth it?

Composing Guidance and Gates Across the Software Development Life Cycle (SDLC)

The Enforceability Test: Deciding Between Skill, Hook, or Neither

The Discipline Outlasts the Tooling: Advisory vs. Enforced Beyond Claude Code

References

Related Articles

Comments