← Back to Blog
For: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Four Habits from the Creator of Claude Code That Will Change How You Ship

Boris Cherny runs 10-15 parallel sessions, ships 20-30 PRs a day, and calls his setup 'surprisingly vanilla.' The gap is not configuration. It is operating model.

#claude-code#boris-cherny#production-workflow#spec-driven-development#parallelism#verification#context-hygiene#automation

Boris Cherny joined Anthropic as a senior engineer. His first pull request got rejected - not for logic errors, not for missing tests. For being hand-written. He was at the world's leading AI lab, surrounded by engineers who expected code to come from AI, and he had typed his out by hand. That rejection became the catalyst for everything that followed.

Boris is now the creator and head of Claude Code. He ships 20-30 pull requests a day, runs 10-15 parallel Claude sessions simultaneously, and calls his setup "surprisingly vanilla." He posted his workflow publicly in January 2026. It got 8 million views. Developers called it a watershed moment. Not because the configuration was exotic. Because it was simple - and it worked at a scale most engineers still think requires a larger team.

The gap is not in the tools. Every engineer reading this has access to the same Claude Code. The gap is in the operating model. Boris has four habits that, taken together, constitute a fundamentally different relationship with an AI coding agent. Not prompting it. Not steering it line by line. Running it as a fleet, briefed as a delegation, verified as a system.

This article unpacks each habit, translates it into concrete workflow decisions you can make this week, and connects it to the broader agent architecture series these principles underpin.


The Four Habits: Overview

Boris organizes his practice into four categories that build on each other:

01 - Hygiene: Treat context like a resource, not a trash can. 02 - Delegation: Brief Claude the way you'd brief an engineer. Plan mode first. Then auto. 03 - Parallelism: Five worktrees. Round-robin. Your wait-for-Claude latency drops to zero. 04 - Automation: If you're running it by hand, you're running it too slowly.

Each habit addresses a specific failure mode of naive agent use. Together they describe a complete operating model.


Habit 01: Treat Context Like a Resource, Not a Trash Can

"Treat context like a resource." - Boris Cherny, Slide 02

The most common failure mode with Claude Code is accumulation. You start a session, you work through a problem, Claude makes wrong turns, you correct them inline, and by the time you reach your destination the context is full of every wrong path, every rejected approach, every intermediate artifact from six decisions ago. The model is still running. The quality has quietly degraded.

Boris has four keystrokes for this. Each one is a different context hygiene decision:

/clear - Start fresh between unrelated tasks

/clear wipes the conversation and starts a new session. The critical detail: CLAUDE.md rehydrates automatically. Your project conventions, your stack context, your established rules - everything in CLAUDE.md loads fresh. Nothing else carries over.

When to use it: Every time you switch to a task that has no dependencies on your current session. Finishing a bug fix and moving to a new feature? /clear. Finishing a code review and starting a migration? /clear. The instinct to continue in the same session "because it already knows the context" is the instinct that produces context rot. The model does not know what it needs for the new task - it knows what you showed it for the last one.

/compact - Summarize and keep going for long threads

/compact does not clear context - it compresses it. Claude summarizes the conversation so far, retaining the working model of what has been done and decided, and resumes from that summary. You stay in the same session. The context window is freed.

When to use it: Long threads where continuity matters. A multi-hour implementation session where you need Claude to remember decisions made three hours ago but the raw transcript from those three hours is filling the window. The heuristic Boris uses: compact or clear before quality starts to drift, not after. By the time you notice degraded output, you are already paying for context rot. The correct time to compact is proactive, not reactive.

/rewind - Undo a bad turn

/rewind takes you back to the last good state and lets you re-prompt from there. This is not the same as editing your last message. It is resetting the session state to a checkpoint where Claude was still on the right track, and giving it a different direction from that point.

When to use it: When Claude goes off the rails. Not to argue your way back in the existing thread - that approach amplifies the wrong turn by adding more context on top of it. When Claude produces something wrong and you spend three messages trying to redirect it inline, you are paying the context cost of the wrong output plus the cost of the correction attempts. /rewind to the last good state and give it a better brief from there. Faster than correcting in-place.

Auto-compact - The safety net (don't rely on it)

Auto-compact fires automatically when the session approaches the context window limit. It is a safety net. It is not a strategy.

Boris's explicit note: auto-compact is fine as a fallback, but if you are letting sessions run to the limit before they compact automatically, you are letting quality drift before you address it. The proactive moves - /compact when threads get long, /clear between tasks - are what keep sessions sharp. Auto-compact catches what you missed. It does not replace the habit.

The principle underlying all four: Context is not a recording of everything that happened. It is a workspace. You manage a workspace by putting things away when you are done with them, not by letting everything accumulate until the desk collapses. Every token in the context window is a token competing for attention with the task at hand. Every token that doesn't need to be there is noise.


Habit 02: Brief Claude the Way You'd Brief an Engineer

"The single biggest predictor of output quality is the quality of the brief. Boris spends more time on the prompt than most people spend on the review." - Slide 03

The core mental model shift Boris makes explicit: treat Claude like an engineer you delegate to, not a pair programmer you guide line by line. These are categorically different working modes. A pair programmer needs continuous steering - you watch each step, you redirect continuously, you are the primary reasoning agent. An engineer you delegate to needs a clear brief, the right files, a defined success criterion, and the ability to ask if something blocks them.

Most developers using Claude Code are pair programming with it. Boris is delegating.

The Full-Context Brief

The difference between pair-programming with Claude and delegating to Claude shows up immediately in how you write the prompt.

Wrong way - pair programmer prompt:

code
Help me refactor the auth module. It's getting messy.

This is a conversation opener, not a brief. Claude will ask clarifying questions, propose something generic, and require continuous steering. You are the primary reasoning agent. Claude is the typing assistant.

Right way - delegation brief:

code
Goal: The auth module should validate JWT tokens using our shared AuthServiceclass. All existing tests must pass. No new `any` type annotations.Constraints: TypeScript strict mode. No new dependencies. Do not modifyUserModel schema - pending migration in progress.Files: src/auth/validator.ts, src/auth/AuthService.ts, src/shared/types.tsVerify: Run `bun run test:auth` - all tests must pass. Run `tsc --noEmit` -zero type errors.Escape hatch: If you hit an ambiguous case involving the UserModel schema,stop and ask. Don't guess around it.

Same task. Fundamentally different operating mode. The first prompt produces an ongoing conversation you must steer. The second brief produces an autonomous execution you review at the end.

Boris structures his briefs with five components:

1. The goal - What "done" looks like, stated as an outcome. Not "refactor the auth module." "The auth module should validate JWT tokens using our shared AuthService class, with all existing tests passing and no new any type annotations."

2. The constraints - The stack, the conventions, what not to touch. "TypeScript strict mode. No new dependencies. Don't modify the UserModel schema - it has a pending migration." Constraints prevent Claude from solving the problem in a way that creates a different problem.

3. Relevant files - Point to them by path. Don't make Claude hunt. Hunting costs context. If you know which files are relevant, provide the paths. Claude will read them. If you don't know, /plan and let Claude propose them before it writes a line of code.

4. How to verify - The test, the command, the screenshot to compare against. This is the element most engineers omit and Boris considers most important. "Run bun run test:auth and confirm all tests pass" is a verification criterion. "Make sure it works" is not. Without a concrete verification criterion, the agent will decide for itself when it is done - and agents are systematically overconfident about their own output.

5. The escape hatch - "If blocked, stop and ask. Don't guess." This one instruction prevents the failure mode where Claude encounters an ambiguous case, makes a judgment call you would have made differently, and proceeds for another 30 minutes in the wrong direction before you review the output. The escape hatch gives Claude explicit permission to pause instead of improvise.

Plan Mode First. Then Auto. The Three-Step Flow.

The single biggest unlock in Boris's workflow is not a configuration - it is a discipline. The Spec-Plan-Auto-Verify Loop - the four-phase cycle that makes one-shot implementation reliably possible:

mermaid
flowchart LR
    A[Write spec.md\nhalf a page\noutcome + constraints]:::blue --> B[Plan Mode\nshift+tab twice\niterate until solid]:::purple
    B --> C{Plan matches\nwhat you had in mind?}:::purple
    C -->|No| D[Fix the plan\nnot the PR]:::yellow
    D --> B
    C -->|Yes| E[Auto Mode\nno prompts\nClaude one-shots]:::green
    E --> F[Verifier runs\nbun test / playwright\nacceptance criteria]:::teal
    F --> G{Passes?}:::purple
    G -->|No - feedback injected| E
    G -->|Yes| H[PR opens\nhuman reviews result\nnot every step]:::green

    classDef blue fill:#4A90E2,color:#fff,stroke:#3A7BC8
    classDef purple fill:#7B68EE,color:#fff,stroke:#6858DE
    classDef teal fill:#98D8C8,color:#fff,stroke:#88C8B8
    classDef yellow fill:#FFD93D,color:#333,stroke:#EFC92D
    classDef green fill:#6BCF7F,color:#fff,stroke:#5BBF6F
code
STEP 01 · SPEC → STEP 02 · PLAN MODE → STEP 03 · AUTO MODE

Step 01: Write the spec. Half a page. User story, constraints, acceptance criteria. Drop it as spec.md in the repo. This is the brief from the previous section, written down. The act of writing it forces clarity that vague prompts avoid. If you cannot write a half-page spec for what you want, you do not know what you want well enough to delegate it.

Step 02: Plan Mode (shift+tab twice). Claude reads the repo. Proposes the files it will touch, the approach it will take, the risks it has flagged. You iterate. You say yes or push back. Zero code has been written. Zero wrong directions have been taken. This is where the leverage is: the plan surfaces misunderstandings before they become lines of code. Boris's rule: if the plan doesn't match what you had in mind, the code wouldn't have either. Fix the plan, not the PR.

/ultraplan offloads plan mode to a cloud session - useful for complex plans where you want to review asynchronously in-browser, then execute locally.

Step 03: Auto Mode. Switch to Auto - no prompts, no steering. Claude implements, runs the verifier (your test command, your acceptance criterion), opens the PR. You review the result, not every step. When the plan is solid, the one-shot implementation rate is high. When the plan is weak, you pay in correction cycles. The investment in Step 02 pays off in Step 03.

AI Review Before Human Review

When PR volume scales, review becomes the bottleneck. Boris's answer: at Anthropic, Claude triages every PR before a human looks at it.

The flow is: PR opens → webhook fires → Claude reviews via subagents (logic, security, perf, style) → inline comments with findings on the diff → human opens a pre-triaged diff.

The compounding mechanism: during human review, tag @claude on any comment with "add this to the review checklist." The reviewer gets sharper with every PR. Every pattern a human catches gets encoded for automated detection on future PRs. The quality system is self-expanding.

This is now available as Managed Code Review - Anthropic-hosted, multi-agent reviewer posting severity-tagged findings, tunable via REVIEW.md. Enable with /install-github-app.


Habit 03: Five Worktrees. Round-Robin.

"Five terminals. Five worktrees. Round-robin." - Boris, describing his setup as "surprisingly vanilla" - Slide 06

Boris's parallelism habit is the one that gets the most attention when people see his setup. The configuration is simple. The operating model shift is significant.

The Setup

code
# Create a worktree for each parallel workstreamgit worktree add .claude/worktrees/feat-cart origin/maingit worktree add .claude/worktrees/perf origin/maingit worktree add .claude/worktrees/migrate origin/maingit worktree add .claude/worktrees/review origin/maingit worktree add .claude/worktrees/analysis origin/main# Start Claude in each worktreecd .claude/worktrees/feat-cart && claude --worktree feat-cartcd .claude/worktrees/perf && claude --worktree perf# ...

Each worktree is a clean checkout of the same repo. Five tabs in the terminal, each numbered, each with its own worktree, each running its own Claude session. No shared state between workstreams. No clobbering. One repo.

The Flow: Round-Robin Latency Drops to Zero

The workflow is not about doing five things simultaneously. It is about eliminating idle time.

Kick off Tab 1. While Claude works on Tab 1 (which takes 90 seconds, or 5 minutes, or 20 minutes depending on the task), switch to Tab 2 and kick off that task. Tab 3. Tab 4. Tab 5. Come back to Tab 1 - it is done or needs input. Review. Redirect or approve. Move to Tab 2.

Your wait-for-Claude latency drops to zero because you are never waiting. You are always working on the next thing while Claude works on the current thing. The bottleneck shifts from Claude's generation speed to your attention - and attention is the scarce resource that should be respected.

Boris uses system notifications so any Claude that needs input surfaces to his attention without him having to watch all five tabs constantly. iTerm2 notifications work. Claude's built-in notification system on the Desktop app also handles this.

claude --teleport moves a running terminal session into the Claude Desktop app (which has a built-in browser and worktree UI), and back. Start a session on mobile, teleport it to your desk. Start on desktop, land on phone. The same session continues across environments. This is relevant for his parallel mobile sessions: he starts 5-10 more sessions in the browser, monitors them on mobile, teleports the ready ones to his desktop for review.

The total: 5 local terminal sessions + 5-10 browser sessions = 10-15 simultaneous workstreams. One human. Zero waiting.

What Worktrees Enable Beyond Parallelism

The analysis worktree Boris mentions is specifically read-only. No implementation work happens there - it is for reading logs, running BigQuery, exploring large datasets. Keeping it separate from implementation worktrees prevents the context accumulation from exploratory work from polluting implementation sessions.

This is the parallelism pattern from the previous article on subagents applied at the human workflow level, not just the agent level. Separate contexts for separate concerns. The principle is the same whether you are isolating agent subcontexts or isolating human worktree sessions.


Habit 04: If You're Running It By Hand, You're Running It Too Slowly

"Once a workflow is working, Boris moves it off the keyboard. The same three primitives wire up for nearly everything." - Slide 07

The automation habit is the one that explains how Boris ships 20-30 PRs per day while doing less typing. Every workflow that works twice gets moved off the keyboard. This is the ladder:

code
Manual session → Scripted command → Scheduled job → Headless service / Managed agent

Every time something works twice, move it up one rung.

Primitive 1: Claude Code as a Unix Tool

code
# --output-format stream-json gives you structured outputclaude --output-format stream-json "Run the linter and report any errors"# Pipe anything incat failing-tests.log | claude "Diagnose these failures and propose fixes"# Claude in a bash script#!/bin/bashDIFF=$(git diff main)echo "$DIFF" | claude --output-format text "Review this diff for security issues"

--output-format stream-json is the key flag. It gives you machine-parseable structured output from Claude, which makes it composable with every other CLI tool in your environment. Claude becomes a step in a bash pipeline, a GitHub Action, a cron job - not a product you interact with manually.

Boris uses /commit-push-pr dozens of times daily. That is a slash command (scripted prompt) that takes a task and automates the entire commit → push → PR workflow. One invocation. Zero manual steps.

Primitive 2: Claude as a Service (Agent SDK + Managed Agents)

The Agent SDK exposes the same loop the CLI uses programmatically. You spawn Claude inside your own application: triage queues, review pipelines, async agent workflows that run on triggers, not on your keystrokes.

Managed Agents (launched April 8, 2026) takes this further: Anthropic hosts the infrastructure. You define the agent logic, business rules, and integrations. Anthropic handles sandboxing, state management, scaling, and tool execution at $0.08/session-hour plus standard token costs. You do not manage the underlying infrastructure.

The use case Boris describes: automation workflows that run without his involvement. Nightly dependency audits. Morning standup digests. Friday-kickoff refactors. These are not sessions he starts and monitors - they are scheduled agents that run, produce output, and surface results for him to review asynchronously.

Primitive 3: /schedule - Work That Happens Without You

Save a routine once; it runs on an Anthropic-managed cloud on a cadence, an API call, or a GitHub event.

The pattern: any workflow you run more than twice on a schedule is a candidate for /schedule. Nightly dependency audits, CI failure triage, documentation freshness checks, performance regression detection. These do not require your attention to initiate. They require your attention only when they surface findings.

The ladder in practice:

A developer first runs "check for deprecated dependencies" manually in Claude. It works. They run it again next week. That pattern triggers the rung-up: they write a bash script that runs Claude with the right context and saves the output. That script goes into a GitHub Action on a weekly schedule. Months later, that action is producing findings every week and the developer has not typed a single character to initiate it since they moved it to the action.

Boris's explicit principle: every time something works twice, move it up one rung. The manual session teaches you what the workflow needs. The scripted command removes the friction. The scheduled job removes the initiation. The managed agent removes the monitoring.


The Verification Loop: Why Three Reasons Agents Lose the Plot

Underlying all four habits is a verification philosophy that Boris considers the single highest-leverage practice:

"The single most important thing for great results - give Claude a way to verify its work. That feedback loop will 2-3× the quality of the final result." - Boris Cherny

Agents fail for three reasons (Slide 08), and all three are verification failures:

01 - Context: Can't carry state. Finite window, amnesia between sessions, coherence degrades as window fills, wraps up early as the limit nears. Solution: Habit 01 - active context hygiene.

02 - Planning: Can't size the work. Tries to one-shot the whole project, or sees partial progress and calls it done, or runs out of context mid-feature. Solution: Habit 02 - Plan Mode before Auto.

03 - Verification: Can't judge its own output. Tests shallowly (curl, not click), stubs features and moves on, rates mediocre work as good. This is the loop most teams never close.

The verification failure is the most consequential because it is invisible. An agent that cannot carry state shows visible degradation. An agent that fails to size the work produces visible incomplete output. An agent that cannot judge its own output produces visible passing tests and invisible wrong behavior.

Planner → Generator → Evaluator: The GAN-Inspired Pattern

The architectural response to verification failure is role separation (Slide 10):

code
Planner → Generator → Evaluator
  • Planner: Expands a 1-4 sentence prompt into a full spec and feature list. Hands off to Generator.
  • Generator: Builds the code. Iterates in sprints. Commits to git. Hands off with progress artifacts.
  • Evaluator: Runs the product. Grades against a rubric. Returns actionable feedback - not rubber-stamp approval.

Anthropic's internal harness runs 5-15 loops per task, for up to six hours. The Evaluator is not a one-shot review - it drives the next iteration. The Generator does not stop when it thinks it is done. It stops when the Evaluator's rubric passes.

Build This Today: The Adversarial Reviewer Subagent

The immediate, buildable implementation of the Planner→Generator→Evaluator pattern is a single subagent definition (Slide 11):

code
# .claude/agents/adversarial-reviewer.md---name: adversarial-reviewerdescription: >  Skeptical code reviewer. Invoke before opening a PR.  Finds bugs, challenges assumptions, penalizes AI-generic output.model: claude-opus-4-5-20251001   # use latest capable model; update as new versions releasetools: [Read, Grep, Bash]---You are an adversarial reviewer. Your job is to find reasons NOT to ship this change.## Grade on- Correctness (weight 3): does it do the thing?- Edge cases (weight 3): null, empty, boundary?- Security (weight 3): input validation, authz- Style drift (weight 2): matches house patterns?- Originality (weight 1): not generic AI output## OutputPer criterion: PASS / FAIL + specific file:line evidence.Return the weakest FAIL first. Do not soften your assessment.

The main agent produces. It then calls the adversarial reviewer as a subagent. The reviewer returns a critique the main agent must address before landing a PR. The rubric is the loss function. Weighted criteria turn subjective judgment into scores. Put the rubric in the agent's prompt file and iterate on it - it improves with every PR that teaches you something.

Boris's framing: "Where rubrics live. Weighted criteria that turn subjective judgement into scores. Put the rubric in the agent's prompt file and iterate on it - it's a loss function in Markdown."


The Operating Model: What These Four Habits Add Up To - The Fleet Commander Pattern

Taken individually, each habit is a useful productivity improvement. Taken together, they describe a complete shift in how software gets built - what Boris himself has implicitly defined as the Fleet Commander Operating Model: the engineer as director of parallel, delegated, verified workstreams rather than author of sequential, typed-by-hand code.

The human role shifts from author to director. You do not write the code. You write the spec, the brief, the rubric. You review the plan before implementation starts. You review the result after verification completes. The typing is not where your value is. It never was.

The agent is not a tool. It is a worker. Boris schedules Claude sessions the way a manager allocates engineering capacity: queue them, give them clear briefs, let them work, review their output. The failure mode of treating Claude as a pair programmer (guide every step, watch every line) produces the output of a pair programmer session - bounded by how fast you can type instructions.

Verification is not optional. It is the system. Without a way for Claude to verify its own work, you are the verification system. You are reviewing every output manually. That is the bottleneck in the workflow. The Planner→Generator→Evaluator pattern, the adversarial reviewer subagent, the How to verify field in every brief - all of these are mechanisms for making the agent responsible for its own quality instead of making you responsible for catching every mistake.

Every habit that works twice gets automated. The automation ladder (manual → scripted → scheduled → managed) is how a single engineer runs at the output capacity of a small engineering department. Not by working faster. By eliminating the manual initiation of work that has already proven itself.


The Workflow End-to-End: A Day Running Boris's System

To make the habits concrete, here is what a single workstream looks like running all four habits together:

Morning: Open five worktrees. Start five Claude sessions. On mobile, kick off three more browser sessions for longer-running tasks (nightly audit review, documentation freshness check, dependency update PR).

First task: Write a half-page spec for the authentication refactor. Drop it as spec.md. Enter Plan Mode (shift+tab twice). Claude reads the repo, proposes the approach, lists the files it will touch. Iterate until the plan is clean. Switch to Auto Mode. Claude implements, runs bun run test:auth, opens a PR.

While Tab 1 runs: Brief Tab 2 with a new spec. Tab 3 is running the perf regression analysis in the read-only analysis worktree. Tab 4 has a migration running in its own isolated checkout. Zero waiting.

When Tab 1 completes: Review the PR. If it is wrong, it is not wrong by much - the plan was reviewed. The adversarial reviewer subagent has already posted findings inline. Tag @claude on any pattern that should become a review rule. /commit-push-pr.

Context discipline throughout: When a task finishes and the next one is unrelated - /clear. When a long thread is accumulating - /compact before quality drifts. When Claude goes off the rails - /rewind to the last good state.

End of week: Any workflow that ran manually three times this week is a candidate for /schedule. Write the command. Test it. Move it to scheduled.


Operational Checklist: Implementing Boris's Habits

Habit 01 - Context Hygiene

  • Are you clearing context between unrelated tasks (/clear) rather than continuing in the same session?
  • Are you compacting proactively (/compact) before quality drifts, not after?
  • When Claude goes wrong, are you using /rewind instead of correcting inline?
  • Have you turned off reliance on auto-compact as a primary strategy?

Habit 02 - Delegation Quality

  • Does every brief include: goal (outcome), constraints (what not to touch), relevant files (by path), how to verify (concrete command), escape hatch (stop and ask if blocked)?
  • Are you entering Plan Mode before any Auto Mode session longer than 15 minutes?
  • Is the spec written before Plan Mode starts, not constructed during it?
  • Is your project's CLAUDE.md capturing every mistake as a rule?

Habit 03 - Parallelism

  • Are you using worktrees (git worktree add) rather than branches for parallel sessions?
  • Do you have system notifications configured so Claude sessions surface to your attention without monitoring?
  • Is there a dedicated read-only worktree for analysis and exploration?
  • Is your wait-for-Claude latency actually zero, or are you watching one session at a time?

Habit 04 - Automation

  • Have you identified workflows you run manually more than twice?
  • For those workflows: are they scripted commands, scheduled jobs, or managed agents - or are they still manual sessions?
  • Does your PostToolUse hook run the formatter after every file edit, unconditionally?
  • Have you set up the adversarial reviewer subagent before every PR opens?

References


Agentic AI

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments