Every engineering team eventually inherits a codebase with no architecture documentation. A new hire joins and spends three weeks mapping dependencies by reading code. A tech lead wants to present the system to stakeholders but has nothing to show. A modernisation project begins and no one can agree on the current boundaries.
The standard solution is to schedule an architecture workshop, assign someone to "write the doc," and watch it stall under competing priorities. The doc either never gets written, or it gets written once and immediately starts drifting from reality.
This guide builds an automated solution: a 12-node LangGraph pipeline that takes a Git repository as input and produces a validated, production-ready architecture document as output — complete with component diagrams, sequence diagrams, a debt register, and ADR stubs. The pipeline runs the 3-pass analysis methodology automatically, pauses for human review at the right moment, and loops back for refinement when gaps are found.
What you will build
By the end of this guide you will have a working pipeline that:
- Clones any Git repository and groups source files by module boundary
- Summarises each module using Claude, applying the 5 architecture compression rules
- Generates a Mermaid component diagram with 5–12 capability-named components
- Traces the top 3 revenue-generating flows and produces sequence diagrams
- Runs 4 validation gates — static analysis, developer review, runtime comparison, and failure scenarios
- Pauses for human review at Gate 2, persists state, and resumes when feedback is submitted
- Loops back for refinement when gaps are found, then publishes a final verified document
You should be comfortable with LangGraph's core concepts — StateGraph, nodes, edges, and conditional routing. This guide does not re-explain those basics. It goes directly to the decisions specific to codebase analysis: why module chunking matters, how to design state that survives a human interrupt, and how to wire a refinement loop without hitting a chain's architectural limits.
This guide implements the 3-pass architecture methodology — a structured approach to reading an unfamiliar codebase covered in the companion guide "From Unknown Codebase to Architecture Document." If terms like "compression rules," "validation gates," or "the 12-section output template" are unfamiliar, reading that guide first will give you the context this one assumes.
The 12-node pipeline at a glance
| # | Node | Responsibility | Phase |
|---|---|---|---|
| 1 | ingest_repo | Clone repo · parse build file · extract dependency graph | Pass 1 |
| 2 | chunk_by_module | Group files by package boundary · never by individual file | Pass 1 |
| 3 | pass1_structure | LLM per module chunk · accumulate module summaries | Pass 1 |
| 4 | pass2_behavior | Trace top 3 flows · map state stores · async boundaries | Pass 2 |
| 5 | pass3_compress | Apply 5 compression rules · generate Mermaid · score anti-patterns | Pass 3 |
| 6 | assemble_draft | Fill 12-section template · mark all claims [unverified] | Assembly |
| 7 | gate1_static | Run jdeps/depcruise · diff import graph vs component diagram | Gate 1 |
| 8 | gate2_human | interrupt() — pause for developer review · collect feedback | ⏸ Human |
| 9 | gate3_runtime | Fetch logs/APM · compare top flows vs sequence diagrams | Gate 3 |
| 10 | gate4_failure | 4 standard failure scenarios · LLM evaluator · gap reporter | Gate 4 |
| 11 | refinement | Update diagrams + debt list · conditional loop back to node 5 | ↻ Loop |
| 12 | publish_doc | Write Markdown · generate ADR stubs · output debt register | Output |
Why LangGraph, not a simple chain
If you've used LangGraph before, you know the answer. But it's worth being precise about which features this pipeline actually requires — because teams that try to build this on a chain hit every one of these walls.
Cycles — the chain killer. Node 11 (refinement) conditionally loops back to node 5 (compress) if major gaps are found. No chain architecture can do this. You discover this requirement on the first real codebase, not in testing.
Human interrupt at node 8. Gate 2 pauses execution and waits hours or days for developer review. LangGraph's interrupt() + checkpointer handles this natively. Without it, you lose all state when the process idles.
State persistence across 12 nodes. Module summaries from node 3 must be readable by node 10. The checkpointer persists all state to SQLite/Postgres automatically — no manual serialisation between steps.
Conditional edges at gates. Each gate can either pass (proceed) or fail (route to refinement). LangGraph's conditional edges handle this with a simple router function — no complex if-else chains in application code.
1 · State design — the most important decision
State is the contract between every node. Get this wrong and you will rewrite it twice. Define it as a TypedDict before touching any node implementation.
The complete state schema
# state.py — define this first, before any nodefrom typing import TypedDict, Optionalfrom langgraph.graph import add_messagesclass ModuleChunk(TypedDict): module_id: str # e.g. "com.example.payment" layer: str # "handler" | "service" | "dao" | "domain" | "integration" files: list[str] # absolute paths of every file in this module file_contents: dict # filename → source code string import_count: int # used to sort chunks by complexityclass ModuleSummary(TypedDict): module_id: str capability: str # compressed name — "Order Management", not "OrderService" responsibility: str # one sentence dependencies: list[str] # other module_ids this depends on patterns: list[str] # detected design patterns anti_patterns: list[str]# detected violations raw_llm_output: str # keep for Gate 2 displayclass FlowTrace(TypedDict): flow_name: str # "place_order", "authenticate_user" priority: int # 1=revenue, 2=frequency, 3=complex, 4=failure-prone, 5=auth mermaid_sequence: str # full Mermaid sequenceDiagram source consistency_boundary: str async_boundaries: list[str] external_calls: list[str]class DebtItem(TypedDict): title: str location: str risk_score: int # 1-5 effort_score: int # 1-5 decision: str # "Fix Now" | "Fix Next Sprint" | "Plan" | "Accept" | "Skip" owner: Optional[str] target_date: Optional[str]class PipelineState(TypedDict): # ── Input ── repo_path: str target_stack: str # "java" | "python" | "nodejs" | "dotnet" | "go" | "ruby" log_source: Optional[str] # APM endpoint or log path for Gate 3 # ── Pass 1 outputs ── build_file_summary: dict dependency_graph: dict # module_id → list of module_ids it imports module_chunks: list[ModuleChunk] module_summaries: list[ModuleSummary] entry_points: list[str] architecture_style: str # ── Pass 2 outputs ── flow_traces: list[FlowTrace] state_stores: list[dict] async_boundaries: list[str] # ── Pass 3 outputs ── component_diagram: str # Mermaid source anti_pattern_score: int anti_pattern_violations: list[str] debt_register: list[DebtItem] # ── Draft doc ── draft_doc: str # full Markdown with [unverified] markers # ── Gate results ── gate1_violations: list[str] # import graph vs diagram mismatches gate1_passed: bool human_feedback: Optional[str] # structured feedback from Gate 2 gate2_approved: bool gate3_gaps: list[str] # flows in logs not in diagrams gate3_passed: bool gate4_gaps: list[str] # failure scenarios the doc can't explain gate4_passed: bool # ── Refinement tracking ── refinement_count: int # prevent infinite loops — max 2 refinement_notes: list[str] # audit trail of what changed each loop # ── Final output ── final_doc: str adr_stubs: list[str]If you add fields incrementally, Node 8 will fail because it expects fields that Node 3 never wrote. Define the complete schema upfront, use Optional for fields that are populated later, and initialise everything to sensible defaults in the graph entry point.
State flow across nodes
| Node | Reads from state | Writes to state |
|---|---|---|
ingest_repo | repo_path, target_stack | build_file_summary, dependency_graph, entry_points |
chunk_by_module | repo_path, dependency_graph, target_stack | module_chunks |
pass1_structure | module_chunks, build_file_summary | module_summaries, architecture_style |
pass2_behavior | module_summaries, entry_points | flow_traces, state_stores, async_boundaries |
pass3_compress | module_summaries, flow_traces | component_diagram, anti_pattern_score, debt_register |
assemble_draft | Everything above | draft_doc |
gate1_static | dependency_graph, component_diagram | gate1_violations, gate1_passed |
gate2_human | draft_doc, gate1_violations | human_feedback, gate2_approved |
gate3_runtime | flow_traces, log_source | gate3_gaps, gate3_passed |
gate4_failure | draft_doc, component_diagram | gate4_gaps, gate4_passed |
refinement | All gate results + human_feedback | component_diagram, draft_doc, refinement_count |
publish_doc | Everything | final_doc, adr_stubs |
2 · Chunking logic — the quality multiplier
Chunking quality determines everything downstream. A PaymentService.java analysed in isolation looks self-contained. Its dependency on OrderRepository, its position in the layer stack, and the boundary decisions around it are all in the surrounding package. Send the file alone and the LLM hallucinates its dependencies.
Sending individual files is the single biggest quality failure in codebase analysis agents. It produces module summaries with invented dependencies, which produces a wrong component diagram, which produces a useless architecture document.
Implementation
# chunker.pyimport osfrom pathlib import Pathfrom collections import defaultdictfrom state import ModuleChunkSTACK_CONFIG = { "java": {"ext": [".java"], "module_from": java_package}, "python": {"ext": [".py"], "module_from": python_package}, "nodejs": {"ext": [".ts", ".js"], "module_from": node_module}, "go": {"ext": [".go"], "module_from": go_package}, "dotnet": {"ext": [".cs"], "module_from": dotnet_namespace},}def chunk_by_module(state: PipelineState) -> dict: repo = Path(state["repo_path"]) cfg = STACK_CONFIG[state["target_stack"]] groups: dict[str, list] = defaultdict(list) for path in repo.rglob("*"): if path.suffix not in cfg["ext"]: continue if is_test_file(path): continue # skip tests — they add noise if is_generated(path): continue # skip generated code module_id = cfg["module_from"](path, repo) groups[module_id].append(path) chunks = [] for module_id, files in groups.items(): contents = {} for f in files: try: contents[f.name] = f.read_text(errors="replace") except: pass # Estimate import count for complexity sorting import_count = sum( c.count("import") for c in contents.values() ) chunks.append(ModuleChunk( module_id=module_id, layer=infer_layer(module_id, list(contents.keys())), files=[str(f) for f in files], file_contents=contents, import_count=import_count, )) # Sort: high import count first — richer context for early LLM calls chunks.sort(key=lambda c: c["import_count"], reverse=True) return {"module_chunks": chunks}def java_package(path: Path, repo: Path) -> str: # Read "package com.example.payment" from file header try: first = path.read_text().split("\n")[:10] for line in first: if line.strip().startswith("package "): return line.strip()[8:].rstrip(";") except: pass # Fallback: use directory as module id return str(path.parent.relative_to(repo)).replace("/", ".")def infer_layer(module_id: str, filenames: list[str]) -> str: m = module_id.lower() if any(x in m for x in ["controller","handler","router","api","web"]): return "handler" elif any(x in m for x in ["service","usecase","business","facade"]): return "service" elif any(x in m for x in ["dao","repo","repository","store"]): return "dao" elif any(x in m for x in ["model","domain","entity"]): return "domain" elif any(x in m for x in ["client","adapter","integration"]): return "integration" return "unknown"If a module chunk exceeds ~8,000 tokens, split it: keep the largest file intact and summarise the smaller ones as one-line stubs. A chunk with one fully detailed file plus brief stubs of its siblings outperforms a truncated chunk every time.
3 · Node wiring — LangGraph graph construction
Pass 1 nodes — structure discovery
# nodes/pass1.pyfrom langchain_anthropic import ChatAnthropicfrom state import PipelineState, ModuleSummaryllm_sonnet = ChatAnthropic(model="claude-sonnet-4-6", max_tokens=4096)llm_haiku = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=1024)def ingest_repo(state: PipelineState) -> dict: repo = state["repo_path"] stack = state["target_stack"] build_summary = parse_build_file(repo, stack) dep_graph = run_dependency_analysis(repo, stack) entry_points = find_entry_points(repo, stack) return { "build_file_summary": build_summary, "dependency_graph": dep_graph, "entry_points": entry_points, }def pass1_structure(state: PipelineState) -> dict: summaries = [] for chunk in state["module_chunks"]: prompt = build_module_prompt(chunk, state["build_file_summary"]) response = llm_sonnet.invoke(prompt) summary = parse_module_summary(response.content, chunk["module_id"]) summaries.append(summary) # Infer architecture style using Haiku (cheaper, simpler task) style_prompt = build_style_prompt(summaries, state["dependency_graph"]) style = llm_haiku.invoke(style_prompt).content.strip() return { "module_summaries": summaries, "architecture_style": style, }Pass 2 nodes — behavior tracing
# nodes/pass2.py# Flow priority mirrors the sequence in Guide 1 — revenue flows first# See: [*From Unknown Codebase to Architecture Document*](/from-unknown-codebase-to-architecture-document-a-complete-practitioners-guide)FLOW_PRIORITY = [ ("revenue", 1), # order placement, payment, account creation ("frequency", 2), # search, list views, dashboard load ("complex", 3), # multi-step workflows, approval chains ("failure", 4), # external integrations, scheduled jobs ("auth", 5), # login, session management]def pass2_behavior(state: PipelineState) -> dict: summaries = state["module_summaries"] entry_points = state["entry_points"] # Step 1: identify top 3 flows to trace flows_to_trace = identify_key_flows( summaries, entry_points, llm_haiku ) # Step 2: trace each flow and generate sequence diagram traces = [] for flow in flows_to_trace[:3]: # cap at 3 prompt = build_flow_trace_prompt(flow, summaries) response = llm_sonnet.invoke(prompt) trace = parse_flow_trace(response.content, flow) traces.append(trace) # Step 3: identify state stores and async boundaries state_prompt = build_state_analysis_prompt(summaries) state_response = llm_haiku.invoke(state_prompt) state_analysis = parse_state_analysis(state_response.content) return { "flow_traces": traces, "state_stores": state_analysis["stores"], "async_boundaries": state_analysis["async"], }Pass 3 + assembly nodes
# nodes/pass3.pydef pass3_compress(state: PipelineState) -> dict: summaries = state["module_summaries"] prompt = build_synthesis_prompt( summaries, state["flow_traces"], state["architecture_style"], ) response = llm_sonnet.invoke(prompt) parsed = parse_synthesis_output(response.content) # Score anti-patterns using Haiku (35-item checklist — see Guide 1) score_prompt = build_antipattern_prompt(summaries, state["dependency_graph"]) score_response = llm_haiku.invoke(score_prompt) violations, score = parse_antipattern_score(score_response.content) debt = build_debt_register(violations, llm_haiku) return { "component_diagram": parsed["mermaid"], "anti_pattern_score": score, "anti_pattern_violations": violations, "debt_register": debt, }def assemble_draft(state: PipelineState) -> dict: # Fill the 12-section output template from Guide 1 # Every claim not verified by a gate gets marked [unverified] doc = TEMPLATE.format( architecture_style = state["architecture_style"], component_diagram = state["component_diagram"], flow_traces = format_flows(state["flow_traces"]), state_stores = format_state_stores(state["state_stores"]), anti_pattern_score = state["anti_pattern_score"], debt_register = format_debt(state["debt_register"]), security = "[unverified — complete after Gate 2]", deployment = "[unverified — complete after Gate 2]", ) return {"draft_doc": doc}Graph wiring — connecting all 12 nodes
# graph.pyfrom langgraph.graph import StateGraph, ENDfrom langgraph.checkpoint.sqlite import SqliteSaverfrom langgraph.types import interruptdef build_graph(): g = StateGraph(PipelineState) # ── Add all 12 nodes ── g.add_node("ingest_repo", ingest_repo) g.add_node("chunk_by_module", chunk_by_module) g.add_node("pass1_structure", pass1_structure) g.add_node("pass2_behavior", pass2_behavior) g.add_node("pass3_compress", pass3_compress) g.add_node("assemble_draft", assemble_draft) g.add_node("gate1_static", gate1_static) g.add_node("gate2_human", gate2_human) # uses interrupt() g.add_node("gate3_runtime", gate3_runtime) g.add_node("gate4_failure", gate4_failure) g.add_node("refinement", refinement) g.add_node("publish_doc", publish_doc) # ── Linear edges (Pass 1 → 3) ── g.set_entry_point("ingest_repo") g.add_edge("ingest_repo", "chunk_by_module") g.add_edge("chunk_by_module", "pass1_structure") g.add_edge("pass1_structure", "pass2_behavior") g.add_edge("pass2_behavior", "pass3_compress") g.add_edge("pass3_compress", "assemble_draft") g.add_edge("assemble_draft", "gate1_static") # ── Gate 1 → Gate 2 (always — violations are input to G2 review) ── g.add_edge("gate1_static", "gate2_human") # ── Gate 2 → conditional ── g.add_conditional_edges("gate2_human", route_gate2, { "proceed": "gate3_runtime", "refine": "refinement", # major disputes — loop back }) # ── Gates 3 & 4 ── g.add_edge("gate3_runtime", "gate4_failure") # ── Gate 4 → conditional ── g.add_conditional_edges("gate4_failure", route_gate4, { "publish": "publish_doc", "refine": "refinement", }) # ── Refinement → conditional loop ── g.add_conditional_edges("refinement", route_refinement, { "compress": "pass3_compress", # major gaps → re-compress "draft": "assemble_draft", # minor gaps → rebuild draft "publish": "publish_doc", # max iterations hit }) g.add_edge("publish_doc", END) # ── Compile with checkpointer ── checkpointer = SqliteSaver.from_conn_string("checkpoints.db") return g.compile(checkpointer=checkpointer, interrupt_before=["gate2_human"])# Router functionsdef route_gate2(state: PipelineState) -> str: if state.get("gate2_approved"): return "proceed" if state.get("refinement_count", 0) >= 2: return "proceed" # max loops return "refine"def route_gate4(state: PipelineState) -> str: gaps = state.get("gate4_gaps", []) if not gaps or state.get("refinement_count", 0) >= 2: return "publish" return "refine"def route_refinement(state: PipelineState) -> str: if state.get("refinement_count", 0) >= 2: return "publish" # Major gaps (multiple component boundaries disputed) → full re-compress if len(state.get("gate4_gaps", [])) > 3: return "compress" return "draft"4 · Validation gates — implementation
Gate 1 — Static validation
Compare the import graph produced by static analysis against the component diagram produced by the LLM. Any arrow in the diagram with no corresponding import, and any import with no corresponding arrow, is a violation. This is the automated form of the manual Gate 1 check described in From Unknown Codebase to Architecture Document.
# nodes/gates.pyimport subprocess, json, redef gate1_static(state: PipelineState) -> dict: # 1. Extract component relationships from Mermaid diagram diagram_edges = parse_mermaid_edges(state["component_diagram"]) # 2. Extract relationships from actual import graph import_edges = extract_import_edges( state["dependency_graph"], state["module_summaries"] ) violations = [] # In diagram but no imports to support it for edge in diagram_edges: if edge not in import_edges: violations.append( f"Diagram shows {edge[0]} → {edge[1]} but no imports support this" ) # Significant imports with no diagram representation for edge in import_edges: if edge not in diagram_edges and is_significant(edge, state): violations.append( f"Import {edge[0]} → {edge[1]} has no corresponding diagram arrow" ) return { "gate1_violations": violations, "gate1_passed": len(violations) == 0, }def parse_mermaid_edges(mermaid: str) -> set[tuple]: # Extract "A --> B" or "A --calls--> B" patterns edges = set() for line in mermaid.split("\n"): m = re.search(r"(\w+)\s*--[>\w\s]*-*>\s*(\w+)", line) if m: edges.add((m.group(1), m.group(2))) return edgesGate 2 — Human-in-the-loop interrupt
LLM-generated component diagrams have a ~30% error rate on relationships. This gate directly implements Gate 2 (Developer Validation) from the companion guide — it is the quality checkpoint that makes the output trustworthy. Build the interrupt, then build the UI around it.
# Gate 2 node — uses LangGraph interrupt()from langgraph.types import interruptdef gate2_human(state: PipelineState) -> dict: # interrupt() pauses execution and surfaces data to the caller # The graph checkpoints here and waits until resumed feedback = interrupt({ "action": "review_required", "draft_doc": state["draft_doc"], "component_diagram": state["component_diagram"], "gate1_violations": state["gate1_violations"], "questions": [ "Does this component diagram match how you'd explain the system to a new hire?", "Are there components I've merged that you'd keep separate?", "Is there significant behavior that doesn't appear in these diagrams?", "Does the consistency boundary match what actually commits and rolls back?", "Which parts would you argue with?", ] }) approved = feedback.get("approved", False) return { "human_feedback": feedback.get("comments", ""), "gate2_approved": approved, }Running and resuming the pipeline:
# Caller — how to run and resume the graphfrom langgraph.types import Commandgraph = build_graph()thread_id = "run-001"config = {"configurable": {"thread_id": thread_id}}# Start the pipeline — it will pause at gate2_humanfor event in graph.stream(initial_state(), config=config): print(event)# ↑ pipeline pauses here, interrupt data is in the event stream# --- hours or days later ---# Human has reviewed — resume with their feedbackfor event in graph.stream( Command(resume={ "approved": True, "comments": "PaymentProcessing should be separate from OrderManagement. Split those." }), config=config): print(event)# ↑ pipeline resumes from gate2_human with the human's feedback in stateGate 2 UI — minimal Streamlit implementation:
# review_ui.py — minimal Streamlit UI for Gate 2import streamlit as stfrom langgraph.types import Commandst.set_page_config(page_title="Architecture Review", layout="wide")thread_id = st.query_params.get("thread", "run-001")config = {"configurable": {"thread_id": thread_id}}graph = build_graph()# Load current state from checkpointstate = graph.get_state(config)data = state.tasks[0].interrupts[0].value if state.tasks else {}col1, col2 = st.columns([1, 1])with col1: st.subheader("Component Diagram") st.code(data.get("component_diagram", ""), language="text") if data.get("gate1_violations"): st.warning("Gate 1 violations:\n" + "\n".join(data["gate1_violations"]))with col2: st.subheader("Your review") for q in data.get("questions", []): st.markdown(f"• {q}") comments = st.text_area("Comments (specific component boundary changes):", height=200) approved = st.checkbox("I approve this architecture document to proceed") if st.button("Submit review"): for _ in graph.stream( Command(resume={"approved": approved, "comments": comments}), config=config ): pass st.success("Pipeline resumed. Proceeding to Gates 3 and 4.")Gates 3 & 4
def gate3_runtime(state: PipelineState) -> dict: if not state.get("log_source"): # No log source configured — mark as skipped, not failed return {"gate3_gaps": [], "gate3_passed": True} # 1. Fetch top endpoints from logs top_endpoints = fetch_top_endpoints(state["log_source"], limit=10) # 2. Extract documented flow entry points from sequence diagrams documented = { t.get("entry_point") for t in state["flow_traces"] if t.get("entry_point") } # 3. Find high-frequency paths not in any documented flow gaps = [ f"High-frequency path {ep['path']} ({ep['count']} calls) not in any sequence diagram" for ep in top_endpoints if not any(ep["path"] in d for d in documented) ] return {"gate3_gaps": gaps, "gate3_passed": len(gaps) == 0}def gate4_failure(state: PipelineState) -> dict: scenarios = [ "Primary database becomes unavailable for 5 minutes", "External payment gateway returns 503 for all requests", "Message broker / queue becomes unreachable", "Application server restarts mid-request", ] gaps = [] for scenario in scenarios: prompt = build_failure_eval_prompt( scenario, state["draft_doc"], state["component_diagram"] ) response = llm_haiku.invoke(prompt) result = parse_failure_result(response.content) if not result["doc_explains_behavior"]: gaps.append(f"Scenario '{scenario}': {result['gap']}") return {"gate4_gaps": gaps, "gate4_passed": len(gaps) == 0}Refinement loop
def refinement(state: PipelineState) -> dict: count = state.get("refinement_count", 0) + 1 notes = list(state.get("refinement_notes", [])) updates = {} # Apply human feedback if present if state.get("human_feedback"): feedback_prompt = build_feedback_application_prompt( state["human_feedback"], state["component_diagram"], state["module_summaries"], ) updated_diagram = llm_sonnet.invoke(feedback_prompt).content updates["component_diagram"] = updated_diagram notes.append(f"Refinement {count}: applied human feedback") # Apply Gate 3 gaps — annotate draft with unverified markers if state.get("gate3_gaps"): draft = state["draft_doc"] for gap in state["gate3_gaps"]: draft += f"\n\n> **[unverified]** {gap}" updates["draft_doc"] = draft # Apply Gate 4 gaps — add to debt register as high-risk items if state.get("gate4_gaps"): new_debt = [ DebtItem(title=g, location="architecture doc", risk_score=4, effort_score=2, decision="Fix Next Sprint", owner=None, target_date=None) for g in state["gate4_gaps"] ] updates["debt_register"] = state["debt_register"] + new_debt return {**updates, "refinement_count": count, "refinement_notes": notes}5 · Production concerns
Cost & latency model
A typical medium codebase (~40 modules, 3 flows) costs approximately:
| Stage | Model | Approx tokens | Approx cost | Approx time |
|---|---|---|---|---|
| Pass 1 (40 modules) | Sonnet | ~160k in / 40k out | ~$0.60 | 3–5 min |
| Pass 2 (3 flows) | Sonnet | ~30k in / 15k out | ~$0.12 | 1–2 min |
| Pass 3 (compress) | Sonnet | ~20k in / 8k out | ~$0.08 | 30–60s |
| Anti-pattern scoring | Haiku | ~15k in / 3k out | ~$0.01 | 15–30s |
| Gate 4 (4 scenarios) | Haiku | ~20k in / 4k out | ~$0.01 | 30–60s |
| Total | ~$0.82 | ~7–9 min |
Anti-pattern scoring, architecture style identification, and failure scenario evaluation are classification tasks. Haiku handles them accurately at 1/20th the cost of Sonnet. Reserve Sonnet for module summarisation, flow tracing, and synthesis — tasks where reasoning depth matters.
Parallel module summarisation
Pass 1 processes 40 modules sequentially by default. Parallelise it with asyncio and rate limiting:
import asynciofrom asyncio import Semaphoreasync def pass1_structure_parallel(state: PipelineState) -> dict: sem = Semaphore(5) # max 5 concurrent Sonnet calls async def summarise_chunk(chunk): async with sem: prompt = build_module_prompt(chunk, state["build_file_summary"]) response = await llm_sonnet.ainvoke(prompt) return parse_module_summary(response.content, chunk["module_id"]) summaries = await asyncio.gather(*[ summarise_chunk(chunk) for chunk in state["module_chunks"] ]) return {"module_summaries": list(summaries)}# Reduces Pass 1 from 5 min to ~90 seconds on 40 modulesError recovery
Three failure modes to handle explicitly:
# 1. LLM call failure — retry with exponential backofffrom tenacity import retry, stop_after_attempt, wait_exponential@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30))def llm_call_with_retry(prompt): return llm_sonnet.invoke(prompt)# 2. Malformed LLM output — validate before writing to statedef parse_module_summary(content: str, module_id: str) -> ModuleSummary: try: parsed = extract_json_from_llm_output(content) assert "capability" in parsed assert "responsibility" in parsed return ModuleSummary(**parsed, module_id=module_id, raw_llm_output=content) except (json.JSONDecodeError, AssertionError, KeyError): # Fallback: use a degraded summary rather than crashing the pipeline return ModuleSummary( module_id=module_id, capability=module_id.split(".")[-1], # use package name as fallback responsibility="[parse error — review manually]", dependencies=[], patterns=[], anti_patterns=[], raw_llm_output=content, )# 3. Checkpoint recovery — resume after crashdef resume_from_checkpoint(thread_id: str): config = {"configurable": {"thread_id": thread_id}} state = graph.get_state(config) if state.next: print(f"Resuming from: {state.next}") for event in graph.stream(None, config=config): # None = resume from checkpoint print(event) else: print("Pipeline complete or not started")Observability
# Enable LangSmith tracing — set env vars:# LANGCHAIN_TRACING_V2=true# LANGCHAIN_API_KEY=your-key# LANGCHAIN_PROJECT=arch-pipeline# Per-node metadata for filtering in LangSmithdef pass1_structure(state: PipelineState, config: dict) -> dict: for i, chunk in enumerate(state["module_chunks"]): response = llm_sonnet.invoke( build_module_prompt(chunk, state["build_file_summary"]), config={"metadata": { "node": "pass1_structure", "module_id": chunk["module_id"], "chunk_index": i, "total_chunks": len(state["module_chunks"]), }} )Prompt templates — complete reference
Module summary prompt
def build_module_prompt(chunk: ModuleChunk, build_summary: dict) -> str: files_text = "\n\n".join([ f"=== {name} ===\n{content[:3000]}" # cap per file for name, content in chunk["file_contents"].items() ]) return f"""You are analyzing a module in a {build_summary.get('framework', 'software')} codebase.BUILD CONTEXT:{json.dumps(build_summary, indent=2)}MODULE: {chunk['module_id']}INFERRED LAYER: {chunk['layer']}SOURCE FILES:{files_text}Analyze this module. Respond ONLY with valid JSON matching this schema exactly:{{ "capability": "string — one capability name (not a class name, e.g. 'Order Management' not 'OrderService')", "responsibility": "string — one sentence describing what this module enables", "dependencies": ["list of other module IDs this imports from"], "patterns": ["list of design patterns identified, e.g. Repository, Service Facade"], "anti_patterns": ["list of violations, e.g. 'business logic in handler', 'N+1 query risk'"], "layer_violations": ["list of imports that cross layer boundaries incorrectly"]}}Rules:- capability MUST be a business capability name, never a class name- responsibility MUST be one sentence, max 20 words- Only list dependencies that are internal modules (not third-party libraries)- Only list anti_patterns you can point to specific lines of evidence for"""Synthesis prompt (Pass 3)
def build_synthesis_prompt(summaries, flows, style) -> str: summary_text = "\n".join([ f"- {s['module_id']}: {s['capability']} — {s['responsibility']}" for s in summaries ]) return f"""You are compressing a codebase analysis into an architecture document.ARCHITECTURE STYLE: {style}MODULE SUMMARIES:{summary_text}Apply ALL FIVE compression rules strictly:1. Collapse classes/modules into capabilities (5-12 components MAX, never more)2. Collapse endpoints into use cases (not HTTP verbs)3. Collapse tables into domain concepts (aggregate roots)4. Collapse integrations into roles (not product names)5. Express each layer boundary in one sentenceRespond ONLY with valid JSON:{{ "components": [ {{ "name": "capability name", "responsibility": "one sentence", "dependencies": ["other component names"], "source_modules": ["module_ids compressed into this component"] }} ], "mermaid": "complete C4Context or graph TD Mermaid diagram source", "layer_boundary_sentences": [ "The handler layer delegates all business decisions to the service layer.", "The service layer owns consistency boundaries and orchestrates repositories.", "Repositories are the only components that touch the data store." ], "domain_concepts": [ {{"concept": "Order", "tables": ["orders", "order_lines"], "aggregate_root": "Order"}} ]}}Hard constraints:- components array MUST have 5-12 items. If you have more, merge.- Every component name must be a business capability, never a class name.- mermaid must be syntactically valid Mermaid."""Failure evaluation prompt (Gate 4)
def build_failure_eval_prompt(scenario, draft_doc, diagram) -> str: return f"""You are evaluating whether an architecture document adequately covers a failure scenario.FAILURE SCENARIO: {scenario}ARCHITECTURE DOCUMENT (excerpt):{draft_doc[:4000]}COMPONENT DIAGRAM:{diagram}Answer these questions based ONLY on what is explicitly documented:1. Which components fail immediately when this scenario occurs?2. Which components can degrade gracefully?3. Does the document show a circuit breaker or fallback? (yes/no)4. Does the document explain the user-visible impact? (yes/no)5. Is there anything about this scenario the document cannot explain?Respond ONLY with valid JSON:{{ "doc_explains_behavior": true or false, "immediate_failures": ["component names"], "graceful_degradation": ["component names"], "has_fallback_documented": true or false, "user_impact_documented": true or false, "gap": "one sentence describing what the doc fails to explain, or null if no gap"}}"""Where to start — the 3-day plan
Get module chunking working reliably on one real codebase and validate the module summaries before touching compression, diagram generation, or any validation gate. Everything downstream depends on chunking quality.
| Day | Goal | Success criterion |
|---|---|---|
| Day 1 | Build nodes 1–3. Get module summaries for a real codebase. | A developer who knows the codebase reads 5 random module summaries and says they are accurate. |
| Day 2 | Build nodes 4–6. Generate a draft component diagram and sequence diagrams. | The component diagram has 5–12 components named as capabilities (not classes). At least one sequence diagram traces a real flow end to end. |
| Day 3 | Build node 8 (human interrupt) and the review UI. Wire the LangGraph graph. | A developer can pause the pipeline at gate2_human, review the draft, submit feedback, and the pipeline resumes with their comments in state. |
| Week 2 | Build gates 1, 3, 4 and the refinement node. Add error recovery. | The pipeline completes end to end on a medium codebase (~30 modules) and produces a validated, published architecture document. |
The most common failure is building the visual output first and discovering the chunking problem six weeks later when the diagrams consistently contain invented components. Module summaries first. Diagrams second.
Related reading
This guide is part two of a two-part series:
- Part 1 — From Unknown Codebase to Architecture Document: The full 3-pass methodology, compression rules, anti-pattern checklist, validation gates, debt register, and 12-section output template. Read this first.
- Part 2 — This guide: Automating Part 1 as a 12-node LangGraph pipeline with human-in-the-loop review and production error recovery.
Minimal runnable entry point
# main.py — run the pipeline on a local repofrom graph import build_graphfrom state import PipelineStategraph = build_graph()initial = PipelineState( repo_path="/path/to/your/repo", target_stack="java", # or python / nodejs / go / dotnet log_source=None, # set to APM endpoint for Gate 3 build_file_summary={}, dependency_graph={}, module_chunks=[], module_summaries=[], entry_points=[], architecture_style="", flow_traces=[], state_stores=[], async_boundaries=[], component_diagram="", anti_pattern_score=0, anti_pattern_violations=[], debt_register=[], draft_doc="", gate1_violations=[], gate1_passed=False, human_feedback=None, gate2_approved=False, gate3_gaps=[], gate3_passed=False, gate4_gaps=[], gate4_passed=False, refinement_count=0, refinement_notes=[], final_doc="", adr_stubs=[],)config = {"configurable": {"thread_id": "run-001"}}for event in graph.stream(initial, config=config, stream_mode="values"): node = list(event.keys())[0] print(f"✓ Completed: {node}") if node == "gate2_human": print("⏸ Paused for human review. Run review UI or resume programmatically.") break