Every engineering team eventually inherits a codebase with no architecture documentation. A new hire joins and spends three weeks mapping dependencies by reading code. A tech lead wants to present the system to stakeholders but has nothing to show. A modernisation project begins and no one can agree on the current boundaries.
The standard solution is to schedule an architecture workshop, assign someone to "write the doc," and watch it stall under competing priorities. The doc either never gets written, or it gets written once and immediately starts drifting from reality.
This guide builds ArchLens - a 12-node LangGraph pipeline that takes a Git repository as input and produces a validated, production-ready architecture document as output - complete with component diagrams, sequence diagrams, a debt register, and ADR stubs. The pipeline runs the 3-pass analysis methodology automatically, pauses for human review at the right moment, and loops back for refinement when gaps are found.
What you will build
By the end of this guide you will have a working ArchLens pipeline that:
- Clones any Git repository and groups source files by module boundary
- Summarises each module using Claude, applying the 5 architecture compression rules
- Generates a Mermaid component diagram with 5–12 capability-named components
- Traces the top 3 revenue-generating flows and produces sequence diagrams
- Runs 4 validation gates - static analysis, developer review, runtime comparison, and failure scenarios
- Pauses for human review at Gate 2, persists state, and resumes when feedback is submitted
- Loops back for refinement when gaps are found, then publishes a final verified document
You should be comfortable with LangGraph's core concepts - StateGraph, nodes, edges, and conditional routing. This guide does not re-explain those basics. It goes directly to the decisions specific to codebase analysis: why module chunking matters, how to design state that survives a human interrupt, and how to wire a refinement loop without hitting a chain's architectural limits.
This guide implements the 3-pass architecture methodology - a structured approach to reading an unfamiliar codebase covered in the companion guide "From Unknown Codebase to Architecture Document: A Complete Practitioner's Guide." If terms like "compression rules," "validation gates," or "the 12-section output template" are unfamiliar, reading that guide first will give you the context this one assumes.
The 12-node pipeline at a glance
| # | Node | Responsibility | Phase |
|---|---|---|---|
| 1 | ingest_repo | Clone repo · parse build file · extract dependency graph | Pass 1 |
| 2 | chunk_by_module | Group files by package boundary · never by individual file | Pass 1 |
| 3 | pass1_structure | LLM per module chunk · accumulate module summaries | Pass 1 |
| 4 | pass2_behavior | Trace top 3 flows · map state stores · async boundaries | Pass 2 |
| 5 | pass3_compress | Apply 5 compression rules · generate Mermaid · score anti-patterns | Pass 3 |
| 6 | assemble_draft | Fill 12-section template · mark all claims [unverified] | Assembly |
| 7 | gate1_static | Run jdeps/depcruise · diff import graph vs component diagram | Gate 1 |
| 8 | gate2_human | interrupt() - pause for developer review · collect feedback | ⏸ Human |
| 9 | gate3_runtime | Fetch logs/APM · compare top flows vs sequence diagrams | Gate 3 |
| 10 | gate4_failure | 4 standard failure scenarios · LLM evaluator · gap reporter | Gate 4 |
| 11 | refinement | Update diagrams + debt list · conditional loop back to node 5 | ↻ Loop |
| 12 | publish_doc | Write Markdown · generate ADR stubs · output debt register | Output |
Why LangGraph, not a simple chain
If you've used LangGraph before, you know the answer. But it's worth being precise about which features this pipeline actually requires - because teams that try to build this on a chain hit every one of these walls.
Cycles - the chain killer. Node 11 (refinement) conditionally loops back to node 5 (compress) if major gaps are found. No chain architecture can do this. You discover this requirement on the first real codebase, not in testing.
Human interrupt at node 8. Gate 2 pauses execution and waits hours or days for developer review. LangGraph's interrupt() + checkpointer handles this natively. Without it, you lose all state when the process idles.
State persistence across 12 nodes. Module summaries from node 3 must be readable by node 10. The checkpointer persists all state to SQLite/Postgres automatically - no manual serialisation between steps.
Conditional edges at gates. Each gate can either pass (proceed) or fail (route to refinement). LangGraph's conditional edges handle this with a simple router function - no complex if-else chains in application code.
1 · State design - the most important decision
State is the contract between every node. Get this wrong and you will rewrite it twice. Define it as a TypedDict before touching any node implementation.
The complete state schema
# state.py - define this first, before any nodefrom typing import TypedDict, Optionalfrom langgraph.graph import add_messagesclass ModuleChunk(TypedDict): module_id: str # e.g. "com.example.payment" layer: str # "handler" | "service" | "dao" | "domain" | "integration" files: list[str] # absolute paths of every file in this module file_contents: dict # filename → source code string import_count: int # used to sort chunks by complexityclass ModuleSummary(TypedDict): module_id: str capability: str # compressed name - "Order Management", not "OrderService" responsibility: str # one sentence dependencies: list[str] # other module_ids this depends on patterns: list[str] # detected design patterns anti_patterns: list[str]# detected violations raw_llm_output: str # keep for Gate 2 displayclass FlowTrace(TypedDict): flow_name: str # "place_order", "authenticate_user" priority: int # 1=revenue, 2=frequency, 3=complex, 4=failure-prone, 5=auth mermaid_sequence: str # full Mermaid sequenceDiagram source consistency_boundary: str async_boundaries: list[str] external_calls: list[str]class DebtItem(TypedDict): title: str location: str risk_score: int # 1-5 effort_score: int # 1-5 decision: str # "Fix Now" | "Fix Next Sprint" | "Plan" | "Accept" | "Skip" owner: Optional[str] target_date: Optional[str]class PipelineState(TypedDict): # ── Input ── repo_path: str target_stack: str # "java" | "python" | "nodejs" | "dotnet" | "go" | "ruby" log_source: Optional[str] # APM endpoint or log path for Gate 3 # ── Pass 1 outputs ── build_file_summary: dict dependency_graph: dict # module_id → list of module_ids it imports module_chunks: list[ModuleChunk] module_summaries: list[ModuleSummary] entry_points: list[str] architecture_style: str # ── Pass 2 outputs ── flow_traces: list[FlowTrace] state_stores: list[dict] async_boundaries: list[str] # ── Pass 3 outputs ── component_diagram: str # Mermaid source anti_pattern_score: int anti_pattern_violations: list[str] debt_register: list[DebtItem] # ── Draft doc ── draft_doc: str # full Markdown with [unverified] markers # ── Gate results ── gate1_violations: list[str] # import graph vs diagram mismatches gate1_passed: bool human_feedback: Optional[str] # structured feedback from Gate 2 gate2_approved: bool gate3_gaps: list[str] # flows in logs not in diagrams gate3_passed: bool gate4_gaps: list[str] # failure scenarios the doc can't explain gate4_passed: bool # ── Refinement tracking ── refinement_count: int # prevent infinite loops - max 2 refinement_notes: list[str] # audit trail of what changed each loop # ── Final output ── final_doc: str adr_stubs: list[str]If you add fields incrementally, Node 8 will fail because it expects fields that Node 3 never wrote. Define the complete schema upfront, use Optional for fields that are populated later, and initialise everything to sensible defaults in the graph entry point.
State flow across nodes
| Node | Reads from state | Writes to state |
|---|---|---|
ingest_repo | repo_path, target_stack | build_file_summary, dependency_graph, entry_points |
chunk_by_module | repo_path, dependency_graph, target_stack | module_chunks |
pass1_structure | module_chunks, build_file_summary | module_summaries, architecture_style |
pass2_behavior | module_summaries, entry_points | flow_traces, state_stores, async_boundaries |
pass3_compress | module_summaries, flow_traces | component_diagram, anti_pattern_score, debt_register |
assemble_draft | Everything above | draft_doc |
gate1_static | dependency_graph, component_diagram | gate1_violations, gate1_passed |
gate2_human | draft_doc, gate1_violations | human_feedback, gate2_approved |
gate3_runtime | flow_traces, log_source | gate3_gaps, gate3_passed |
gate4_failure | draft_doc, component_diagram | gate4_gaps, gate4_passed |
refinement | All gate results + human_feedback | component_diagram, draft_doc, refinement_count |
publish_doc | Everything | final_doc, adr_stubs |
2 · Chunking logic - the quality multiplier
Chunking quality determines everything downstream. A PaymentService.java analysed in isolation looks self-contained. Its dependency on OrderRepository, its position in the layer stack, and the boundary decisions around it are all in the surrounding package. Send the file alone and the LLM hallucinates its dependencies.
Sending individual files is the single biggest quality failure in codebase analysis agents. It produces module summaries with invented dependencies, which produces a wrong component diagram, which produces a useless architecture document.
Implementation
# chunker.pyimport osfrom pathlib import Pathfrom collections import defaultdictfrom state import ModuleChunkSTACK_CONFIG = { "java": {"ext": [".java"], "module_from": java_package}, "python": {"ext": [".py"], "module_from": python_package}, "nodejs": {"ext": [".ts", ".js"], "module_from": node_module}, "go": {"ext": [".go"], "module_from": go_package}, "dotnet": {"ext": [".cs"], "module_from": dotnet_namespace},}def chunk_by_module(state: PipelineState) -> dict: repo = Path(state["repo_path"]) cfg = STACK_CONFIG[state["target_stack"]] groups: dict[str, list] = defaultdict(list) for path in repo.rglob("*"): if path.suffix not in cfg["ext"]: continue if is_test_file(path): continue # skip tests - they add noise if is_generated(path): continue # skip generated code module_id = cfg["module_from"](path, repo) groups[module_id].append(path) chunks = [] for module_id, files in groups.items(): contents = {} for f in files: try: contents[f.name] = f.read_text(errors="replace") except: pass # Estimate import count for complexity sorting import_count = sum( c.count("import") for c in contents.values() ) chunks.append(ModuleChunk( module_id=module_id, layer=infer_layer(module_id, list(contents.keys())), files=[str(f) for f in files], file_contents=contents, import_count=import_count, )) # Sort: high import count first - richer context for early LLM calls chunks.sort(key=lambda c: c["import_count"], reverse=True) return {"module_chunks": chunks}def java_package(path: Path, repo: Path) -> str: # Read "package com.example.payment" from file header try: first = path.read_text().split("\n")[:10] for line in first: if line.strip().startswith("package "): return line.strip()[8:].rstrip(";") except: pass # Fallback: use directory as module id return str(path.parent.relative_to(repo)).replace("/", ".")def infer_layer(module_id: str, filenames: list[str]) -> str: m = module_id.lower() if any(x in m for x in ["controller","handler","router","api","web"]): return "handler" elif any(x in m for x in ["service","usecase","business","facade"]): return "service" elif any(x in m for x in ["dao","repo","repository","store"]): return "dao" elif any(x in m for x in ["model","domain","entity"]): return "domain" elif any(x in m for x in ["client","adapter","integration"]): return "integration" return "unknown"If a module chunk exceeds ~8,000 tokens, split it: keep the largest file intact and summarise the smaller ones as one-line stubs. A chunk with one fully detailed file plus brief stubs of its siblings outperforms a truncated chunk every time.
If more than 20% of your chunks return "unknown" from infer_layer, the heuristic keyword matching isn't working for your codebase's naming conventions. This happens most often with domain-specific naming (SubmissionProcessor, AuditCoordinator) that doesn't contain the expected keywords. The fix is to add stack-specific directory path overrides to STACK_CONFIG rather than modifying infer_layer directly - this keeps fallback rules auditable and per-stack:
STACK_CONFIG["java"]["layer_dirs"] = { "controller": "handler", "api": "handler", "rest": "handler", "service": "service", "business": "service", "repository": "dao", "persistence": "dao", "jpa": "dao", "domain": "domain", "model": "domain", "entity": "domain", "client": "integration", "adapter": "integration",}def infer_layer(module_id: str, filenames: list[str], stack: str = "java") -> str: m = module_id.lower() # Primary: keyword match on module_id if any(x in m for x in ["controller","handler","router","api","web"]): return "handler" elif any(x in m for x in ["service","usecase","business","facade"]): return "service" elif any(x in m for x in ["dao","repo","repository","store"]): return "dao" elif any(x in m for x in ["model","domain","entity"]): return "domain" elif any(x in m for x in ["client","adapter","integration"]): return "integration" # Fallback: check directory path segment against stack-specific overrides dir_map = STACK_CONFIG.get(stack, {}).get("layer_dirs", {}) for segment in module_id.split("."): if segment in dir_map: return dir_map[segment] return "unknown"Count unknowns before running Pass 1 and abort if the ratio is too high - a Pass 1 run with 40% unknown layers produces module summaries without layer context, which degrades compression quality significantly.
def validate_chunking(chunks: list[ModuleChunk]) -> None: unknown = sum(1 for c in chunks if c["layer"] == "unknown") ratio = unknown / len(chunks) if chunks else 0 if ratio > 0.2: raise ValueError( f"Layer detection too weak: {unknown}/{len(chunks)} chunks are 'unknown' ({ratio:.0%}). " f"Add directory overrides to STACK_CONFIG before running Pass 1." )3 · Node wiring - LangGraph graph construction
Pass 1 nodes - structure discovery
# nodes/pass1.pyfrom langchain_anthropic import ChatAnthropicfrom state import PipelineState, ModuleSummaryllm_sonnet = ChatAnthropic(model="claude-sonnet-4-6", max_tokens=4096)llm_haiku = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=1024)def ingest_repo(state: PipelineState) -> dict: repo = state["repo_path"] stack = state["target_stack"] build_summary = parse_build_file(repo, stack) dep_graph = run_dependency_analysis(repo, stack) entry_points = find_entry_points(repo, stack) return { "build_file_summary": build_summary, "dependency_graph": dep_graph, "entry_points": entry_points, }def pass1_structure(state: PipelineState) -> dict: summaries = [] for chunk in state["module_chunks"]: prompt = build_module_prompt(chunk, state["build_file_summary"]) response = llm_sonnet.invoke(prompt) summary = parse_module_summary(response.content, chunk["module_id"]) summaries.append(summary) # Infer architecture style using Haiku (cheaper, simpler task) style_prompt = build_style_prompt(summaries, state["dependency_graph"]) style = llm_haiku.invoke(style_prompt).content.strip() return { "module_summaries": summaries, "architecture_style": style, }Pass 2 nodes - behavior tracing
# nodes/pass2.py# Flow priority mirrors the sequence in Guide 1 - revenue flows first# See: [*From Unknown Codebase to Architecture Document*](/from-unknown-codebase-to-architecture-document-a-complete-practitioners-guide)FLOW_PRIORITY = [ ("revenue", 1), # order placement, payment, account creation ("frequency", 2), # search, list views, dashboard load ("complex", 3), # multi-step workflows, approval chains ("failure", 4), # external integrations, scheduled jobs ("auth", 5), # login, session management]def pass2_behavior(state: PipelineState) -> dict: summaries = state["module_summaries"] entry_points = state["entry_points"] # Step 1: identify top 3 flows to trace flows_to_trace = identify_key_flows( summaries, entry_points, llm_haiku ) # Step 2: trace each flow and generate sequence diagram traces = [] for flow in flows_to_trace[:3]: # cap at 3 prompt = build_flow_trace_prompt(flow, summaries) response = llm_sonnet.invoke(prompt) trace = parse_flow_trace(response.content, flow) traces.append(trace) # Step 3: identify state stores and async boundaries state_prompt = build_state_analysis_prompt(summaries) state_response = llm_haiku.invoke(state_prompt) state_analysis = parse_state_analysis(state_response.content) return { "flow_traces": traces, "state_stores": state_analysis["stores"], "async_boundaries": state_analysis["async"], }Pass 3 + assembly nodes
# nodes/pass3.pydef pass3_compress(state: PipelineState) -> dict: summaries = state["module_summaries"] prompt = build_synthesis_prompt( summaries, state["flow_traces"], state["architecture_style"], ) response = llm_sonnet.invoke(prompt) parsed = parse_synthesis_output(response.content) # Score anti-patterns using Haiku (35-item checklist - see Guide 1) score_prompt = build_antipattern_prompt(summaries, state["dependency_graph"]) score_response = llm_haiku.invoke(score_prompt) violations, score = parse_antipattern_score(score_response.content) debt = build_debt_register(violations, llm_haiku) return { "component_diagram": parsed["mermaid"], "anti_pattern_score": score, "anti_pattern_violations": violations, "debt_register": debt, }def assemble_draft(state: PipelineState) -> dict: # Fill the 12-section output template from Guide 1 # Every claim not verified by a gate gets marked [unverified] doc = TEMPLATE.format( architecture_style = state["architecture_style"], component_diagram = state["component_diagram"], flow_traces = format_flows(state["flow_traces"]), state_stores = format_state_stores(state["state_stores"]), anti_pattern_score = state["anti_pattern_score"], debt_register = format_debt(state["debt_register"]), security = "[unverified - complete after Gate 2]", deployment = "[unverified - complete after Gate 2]", ) return {"draft_doc": doc}# The TEMPLATE variable - 12-section skeleton# Sections marked [unverified] will be updated during the validation gatesTEMPLATE = """# Architecture Document - {repo_name}> Generated by the agentic pipeline on {date}. Claims marked [unverified] have> not yet passed a validation gate. Do not remove these markers until Gate 2–4> have signed off on the relevant section.## 01 - Executive overviewArchitecture style: {architecture_style}\{executive_summary\}## 02 - Component diagram```mermaid\{component_diagram\}```## 03 - Layer boundary rules\{layer_boundary_sentences\}## 04 - Key flows\{flow_traces\}## 05 - State and data model\{state_stores\}## 06 - Domain concepts\{domain_concepts\}## 07 - Integration map\{integration_map\}## 08 - Anti-pattern registerScore: {anti_pattern_score}/35\{anti_pattern_violations\}## 09 - Debt register\{debt_register\}## 10 - Security and auth\{security\}## 11 - Deployment and operations\{deployment\}## 12 - Architecture decisions (ADRs)\{adr_stubs\}"""Graph wiring - connecting all 12 nodes
# graph.py - ArchLens pipelinefrom langgraph.graph import StateGraph, ENDfrom langgraph.checkpoint.sqlite import SqliteSaverfrom langgraph.types import interruptdef build_graph(): g = StateGraph(PipelineState) # ── Add all 12 nodes ── g.add_node("ingest_repo", ingest_repo) g.add_node("chunk_by_module", chunk_by_module) g.add_node("pass1_structure", pass1_structure) g.add_node("pass2_behavior", pass2_behavior) g.add_node("pass3_compress", pass3_compress) g.add_node("assemble_draft", assemble_draft) g.add_node("gate1_static", gate1_static) g.add_node("gate2_human", gate2_human) # uses interrupt() g.add_node("gate3_runtime", gate3_runtime) g.add_node("gate4_failure", gate4_failure) g.add_node("refinement", refinement) g.add_node("publish_doc", publish_doc) # ── Linear edges (Pass 1 → 3) ── g.set_entry_point("ingest_repo") g.add_edge("ingest_repo", "chunk_by_module") g.add_edge("chunk_by_module", "pass1_structure") g.add_edge("pass1_structure", "pass2_behavior") g.add_edge("pass2_behavior", "pass3_compress") g.add_edge("pass3_compress", "assemble_draft") g.add_edge("assemble_draft", "gate1_static") # ── Gate 1 → Gate 2 (always - violations are input to G2 review) ── g.add_edge("gate1_static", "gate2_human") # ── Gate 2 → conditional ── g.add_conditional_edges("gate2_human", route_gate2, { "proceed": "gate3_runtime", "refine": "refinement", # major disputes - loop back }) # ── Gates 3 & 4 ── g.add_edge("gate3_runtime", "gate4_failure") # ── Gate 4 → conditional ── g.add_conditional_edges("gate4_failure", route_gate4, { "publish": "publish_doc", "refine": "refinement", }) # ── Refinement → conditional loop ── g.add_conditional_edges("refinement", route_refinement, { "compress": "pass3_compress", # major gaps → re-compress "draft": "assemble_draft", # minor gaps → rebuild draft "publish": "publish_doc", # max iterations hit }) g.add_edge("publish_doc", END) # ── Compile with checkpointer ── checkpointer = SqliteSaver.from_conn_string("checkpoints.db") return g.compile(checkpointer=checkpointer, interrupt_before=["gate2_human"])# Router functionsdef route_gate2(state: PipelineState) -> str: if state.get("gate2_approved"): return "proceed" if state.get("refinement_count", 0) >= 2: return "proceed" # max loops return "refine"def route_gate4(state: PipelineState) -> str: gaps = state.get("gate4_gaps", []) if not gaps or state.get("refinement_count", 0) >= 2: return "publish" return "refine"def route_refinement(state: PipelineState) -> str: if state.get("refinement_count", 0) >= 2: return "publish" # Major gaps (multiple component boundaries disputed) → full re-compress if len(state.get("gate4_gaps", [])) > 3: return "compress" return "draft"Graph topology - conditional edges visualised
The linear description above doesn't show the most important part: the conditional edges that make this a graph, not a chain. Here's what the routing actually looks like:
flowchart TD
A[ingest_repo] --> B[chunk_by_module]
B --> C[pass1_structure]
C --> D[pass2_behavior]
D --> E[pass3_compress]
E --> F[assemble_draft]
F --> G[gate1_static]
G -->|always proceeds| H[gate2_human]
H -->|approved| I[gate3_runtime]
H -->|not approved| J[refinement]
I --> K[gate4_failure]
K -->|has gaps| J
J -->|compress| E
J -->|draft| F
J -->|publish| L[publish_doc]
L --> M([END])
style H fill:#22C55E,color:#FFFFFF,stroke:#0ea5e9
style J fill:#EC4899,color:#FFFFFF,stroke:#0ea5e9
style L fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
style M fill:#2C2C2A,color:#F1EFE8,stroke:#2C2C2A
style A fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style B fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style C fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style D fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style E fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style F fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style G fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style I fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
style K fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
Three routing decisions determine the shape of any given run:
After Gate 2: If the developer approves → proceed to Gate 3. If not approved and refinement_count < 2 → refinement. After 2 refinement cycles → proceed regardless (to prevent infinite loops).
After Gate 4: If no gaps found, or refinement_count >= 2 → publish. Otherwise → refinement.
After Refinement: If gate4_gaps > 3 (major structural gaps) → re-run pass3_compress from scratch. If minor gaps → rebuild draft only. If refinement_count >= 2 → force publish with remaining gaps marked [unverified].
The max refinement count of 2 is the safety valve. Without it, a codebase with genuinely ambiguous boundaries could loop indefinitely between Gate 4 and compression.
4 · Validation gates - implementation
Gate 1 - Static validation
Compare the import graph produced by static analysis against the component diagram produced by the LLM. Any arrow in the diagram with no corresponding import, and any import with no corresponding arrow, is a violation. This is the automated form of the manual Gate 1 check described in From Unknown Codebase to Architecture Document.
# nodes/gates.pyimport subprocess, json, redef gate1_static(state: PipelineState) -> dict: # 1. Extract component relationships from Mermaid diagram diagram_edges = parse_mermaid_edges(state["component_diagram"]) # 2. Extract relationships from actual import graph import_edges = extract_import_edges( state["dependency_graph"], state["module_summaries"] ) violations = [] # In diagram but no imports to support it for edge in diagram_edges: if edge not in import_edges: violations.append( f"Diagram shows {edge[0]} → {edge[1]} but no imports support this" ) # Significant imports with no diagram representation for edge in import_edges: if edge not in diagram_edges and is_significant(edge, state): violations.append( f"Import {edge[0]} → {edge[1]} has no corresponding diagram arrow" ) return { "gate1_violations": violations, "gate1_passed": len(violations) == 0, }def parse_mermaid_edges(mermaid: str) -> set[tuple]: # Extract "A --> B" or "A --calls--> B" patterns edges = set() for line in mermaid.split("\n"): m = re.search(r"(\w+)\s*--[>\w\s]*-*>\s*(\w+)", line) if m: edges.add((m.group(1), m.group(2))) return edgesdef extract_import_edges( dependency_graph: dict, module_summaries: list[ModuleSummary]) -> set[tuple]: """ Maps module-level imports (com.example.payment → com.example.order) to capability-level edges (Payment Processing → Order Management) so they can be compared against the Mermaid component diagram. This is the hard part of Gate 1: the diagram uses compressed capability names but the dependency graph uses raw module IDs. The mapping goes through module_summaries which holds both the module_id and its capability. """ # Build lookup: module_id → capability name module_to_capability = { s["module_id"]: s["capability"] for s in module_summaries } capability_edges = set() for src_module, dst_modules in dependency_graph.items(): src_cap = module_to_capability.get(src_module) if not src_cap: continue # module has no summary - skip (chunking gap, not a violation) for dst_module in dst_modules: dst_cap = module_to_capability.get(dst_module) if not dst_cap: continue # same - skip unmapped modules if src_cap != dst_cap: # ignore intra-component imports capability_edges.add((src_cap, dst_cap)) return capability_edgesdef is_significant(edge: tuple, state: PipelineState) -> bool: """ An import edge is significant - and worth flagging as a diagram violation - if it meets any of the following criteria: """ src, dst = edge # 1. Crosses a layer boundary (handler → dao, service → handler, etc.) src_layer = next((s["layer"] for s in state["module_summaries"] if s["module_id"] == src), "unknown") dst_layer = next((s["module_id"] for s in state["module_summaries"] if s["module_id"] == dst), "unknown") layer_order = {"handler": 0, "service": 1, "dao": 2, "domain": 3, "integration": 4, "unknown": 5} crosses_boundary = abs( layer_order.get(src_layer, 5) - layer_order.get(dst_layer, 5) ) > 1 # 2. High import frequency - called from 3+ other modules call_count = sum( 1 for m in state["dependency_graph"].values() if dst in m ) high_frequency = call_count >= 3 # 3. One of the modules is a known entry point involves_entry = src in state["entry_points"] or dst in state["entry_points"] return crosses_boundary or high_frequency or involves_entryGate 2 - Human-in-the-loop interrupt
LLM-generated component diagrams have a ~30% error rate on relationships. This gate directly implements Gate 2 (Developer Validation) from the companion guide - it is the quality checkpoint that makes the output trustworthy. Build the interrupt, then build the UI around it.
# Gate 2 node - uses LangGraph interrupt()from langgraph.types import interruptdef gate2_human(state: PipelineState) -> dict: # interrupt() pauses execution and surfaces data to the caller # The graph checkpoints here and waits until resumed feedback = interrupt({ "action": "review_required", "draft_doc": state["draft_doc"], "component_diagram": state["component_diagram"], "gate1_violations": state["gate1_violations"], "questions": [ "Does this component diagram match how you'd explain the system to a new hire?", "Are there components I've merged that you'd keep separate?", "Is there significant behavior that doesn't appear in these diagrams?", "Does the consistency boundary match what actually commits and rolls back?", "Which parts would you argue with?", ] }) approved = feedback.get("approved", False) return { "human_feedback": feedback.get("comments", ""), "gate2_approved": approved, }Running and resuming the pipeline:
# Caller - how to run and resume the graphfrom langgraph.types import Commandgraph = build_graph()thread_id = "run-001"config = {"configurable": {"thread_id": thread_id}}# Start the pipeline - it will pause at gate2_humanfor event in graph.stream(initial_state(), config=config): print(event)# ↑ pipeline pauses here, interrupt data is in the event stream# --- hours or days later ---# Human has reviewed - resume with their feedbackfor event in graph.stream( Command(resume={ "approved": True, "comments": "PaymentProcessing should be separate from OrderManagement. Split those." }), config=config): print(event)# ↑ pipeline resumes from gate2_human with the human's feedback in stateGate 2 UI - minimal Streamlit implementation:
# review_ui.py - minimal Streamlit UI for Gate 2import streamlit as stfrom langgraph.types import Commandst.set_page_config(page_title="Architecture Review", layout="wide")thread_id = st.query_params.get("thread", "run-001")config = {"configurable": {"thread_id": thread_id}}graph = build_graph()# Load current state from checkpointstate = graph.get_state(config)data = state.tasks[0].interrupts[0].value if state.tasks else {}col1, col2 = st.columns([1, 1])with col1: st.subheader("Component Diagram") st.code(data.get("component_diagram", ""), language="text") if data.get("gate1_violations"): st.warning("Gate 1 violations:\n" + "\n".join(data["gate1_violations"]))with col2: st.subheader("Your review") for q in data.get("questions", []): st.markdown(f"• {q}") comments = st.text_area("Comments (specific component boundary changes):", height=200) approved = st.checkbox("I approve this architecture document to proceed") if st.button("Submit review"): for _ in graph.stream( Command(resume={"approved": approved, "comments": comments}), config=config ): pass st.success("Pipeline resumed. Proceeding to Gates 3 and 4.")Gates 3 & 4
def gate3_runtime(state: PipelineState) -> dict: if not state.get("log_source"): # No log source configured - mark as skipped, not failed return {"gate3_gaps": [], "gate3_passed": True} # 1. Fetch top endpoints from logs top_endpoints = fetch_top_endpoints(state["log_source"], limit=10) # 2. Extract documented flow entry points from sequence diagrams documented = { t.get("entry_point") for t in state["flow_traces"] if t.get("entry_point") } # 3. Find high-frequency paths not in any documented flow gaps = [ f"High-frequency path {ep['path']} ({ep['count']} calls) not in any sequence diagram" for ep in top_endpoints if not any(ep["path"] in d for d in documented) ] return {"gate3_gaps": gaps, "gate3_passed": len(gaps) == 0}def gate4_failure(state: PipelineState) -> dict: scenarios = [ "Primary database becomes unavailable for 5 minutes", "External payment gateway returns 503 for all requests", "Message broker / queue becomes unreachable", "Application server restarts mid-request", ] gaps = [] for scenario in scenarios: prompt = build_failure_eval_prompt( scenario, state["draft_doc"], state["component_diagram"] ) response = llm_haiku.invoke(prompt) result = parse_failure_result(response.content) if not result["doc_explains_behavior"]: gaps.append(f"Scenario '{scenario}': {result['gap']}") return {"gate4_gaps": gaps, "gate4_passed": len(gaps) == 0}Refinement loop
def refinement(state: PipelineState) -> dict: count = state.get("refinement_count", 0) + 1 notes = list(state.get("refinement_notes", [])) updates = {} # Apply human feedback if present if state.get("human_feedback"): feedback_prompt = build_feedback_application_prompt( state["human_feedback"], state["component_diagram"], state["module_summaries"], ) updated_diagram = llm_sonnet.invoke(feedback_prompt).content updates["component_diagram"] = updated_diagram notes.append(f"Refinement {count}: applied human feedback") # Apply Gate 3 gaps - annotate draft with unverified markers if state.get("gate3_gaps"): draft = state["draft_doc"] for gap in state["gate3_gaps"]: draft += f"\n\n> **[unverified]** {gap}" updates["draft_doc"] = draft # Apply Gate 4 gaps - add to debt register as high-risk items if state.get("gate4_gaps"): new_debt = [ DebtItem(title=g, location="architecture doc", risk_score=4, effort_score=2, decision="Fix Next Sprint", owner=None, target_date=None) for g in state["gate4_gaps"] ] updates["debt_register"] = state["debt_register"] + new_debt return {**updates, "refinement_count": count, "refinement_notes": notes}Handling contradictory human feedback.
The most common reviewer instruction is "split X and Y into separate components." This is almost always correct from an architectural standpoint - but the dependency graph sometimes doesn't support it yet because the code hasn't been refactored to respect that boundary. The refinement node applies feedback via an LLM call, so it will produce a diagram with the split - but Gate 1 will immediately flag a violation when that new diagram boundary has no corresponding import separation.
The right response is not to ignore the split. It's to document the violation explicitly:
def build_feedback_application_prompt(feedback, diagram, summaries) -> str: return f"""Apply this reviewer feedback to the component diagram:FEEDBACK: {feedback}CURRENT DIAGRAM:\{diagram\}IMPORTANT: If the requested boundary change is not supported by the currentimport graph, still apply the change BUT add a comment to the affectedcomponents noting: "[debt] boundary exists in architecture intent but notyet in code - import separation required."This makes the aspirational architecture explicit rather than silentlykeeping a wrong diagram to avoid a Gate 1 violation."""This produces a diagram that reflects the intended architecture - which is more useful than one that only reflects the current state - while making the gap explicit as a debt item. Gate 1 will still flag it, but the violation is now documented as an intentional architectural decision rather than a detection error.
5 · Production concerns
Cost & latency model
A typical medium codebase (~40 modules, 3 flows) costs approximately:
| Stage | Model | Approx tokens | Approx cost | Approx time |
|---|---|---|---|---|
| Pass 1 (40 modules) | Sonnet | ~160k in / 40k out | ~$0.60 | 3–5 min |
| Pass 2 (3 flows) | Sonnet | ~30k in / 15k out | ~$0.12 | 1–2 min |
| Pass 3 (compress) | Sonnet | ~20k in / 8k out | ~$0.08 | 30–60s |
| Anti-pattern scoring | Haiku | ~15k in / 3k out | ~$0.01 | 15–30s |
| Gate 4 (4 scenarios) | Haiku | ~20k in / 4k out | ~$0.01 | 30–60s |
| Total | ~$0.82 | ~7–9 min |
Anti-pattern scoring, architecture style identification, and failure scenario evaluation are classification tasks. Haiku handles them accurately at 1/20th the cost of Sonnet. Reserve Sonnet for module summarisation, flow tracing, and synthesis - tasks where reasoning depth matters.
Parallel module summarisation
Pass 1 processes 40 modules sequentially by default. Parallelise it with asyncio and rate limiting:
import asynciofrom asyncio import Semaphoreasync def pass1_structure_parallel(state: PipelineState) -> dict: sem = Semaphore(5) # max 5 concurrent Sonnet calls async def summarise_chunk(chunk): async with sem: prompt = build_module_prompt(chunk, state["build_file_summary"]) response = await llm_sonnet.ainvoke(prompt) return parse_module_summary(response.content, chunk["module_id"]) summaries = await asyncio.gather(*[ summarise_chunk(chunk) for chunk in state["module_chunks"] ]) return {"module_summaries": list(summaries)}# Reduces Pass 1 from 5 min to ~90 seconds on 40 modulesError recovery
Three failure modes to handle explicitly:
# 1. LLM call failure - retry with exponential backofffrom tenacity import retry, stop_after_attempt, wait_exponential@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30))def llm_call_with_retry(prompt): return llm_sonnet.invoke(prompt)# 2. Malformed LLM output - validate before writing to statedef parse_module_summary(content: str, module_id: str) -> ModuleSummary: try: parsed = extract_json_from_llm_output(content) assert "capability" in parsed assert "responsibility" in parsed return ModuleSummary(**parsed, module_id=module_id, raw_llm_output=content) except (json.JSONDecodeError, AssertionError, KeyError): # Fallback: use a degraded summary rather than crashing the pipeline return ModuleSummary( module_id=module_id, capability=module_id.split(".")[-1], # use package name as fallback responsibility="[parse error - review manually]", dependencies=[], patterns=[], anti_patterns=[], raw_llm_output=content, )# 3. Checkpoint recovery - resume after crashdef resume_from_checkpoint(thread_id: str): config = {"configurable": {"thread_id": thread_id}} state = graph.get_state(config) if state.next: print(f"Resuming from: {state.next}") for event in graph.stream(None, config=config): # None = resume from checkpoint print(event) else: print("Pipeline complete or not started")Observability
Enable LangSmith tracing by setting three environment variables, then instrument every LLM call with node-level metadata so you can filter traces by pass and module:
# .envLANGCHAIN_TRACING_V2=trueLANGCHAIN_API_KEY=your-keyLANGCHAIN_PROJECT=arch-pipeline# Per-node metadata - attach to every LLM calldef pass1_structure(state: PipelineState, config: dict) -> dict: for i, chunk in enumerate(state["module_chunks"]): response = llm_sonnet.invoke( build_module_prompt(chunk, state["build_file_summary"]), config={"metadata": { "node": "pass1_structure", "module_id": chunk["module_id"], "chunk_index": i, "total_chunks": len(state["module_chunks"]), }} )What a healthy run looks like in LangSmith:
pass1_structuretraces are uniform in latency (~3–5s per module). Outliers (>10s) usually mean a chunk is oversized - checkimport_countand split it.pass3_compressproduces one trace with a single large input (all module summaries concatenated) and a short JSON output. If the output is longer than ~2,000 tokens, the compression prompt isn't enforcing the 5–12 component limit - add that constraint explicitly.gate2_humanshows aninterruptevent followed by a long gap (hours or days) before theCommand(resume=...)event. This is normal. If you see the gap is zero seconds, the interrupt isn't working and you're running Gate 3 without human review.gate4_failureproduces four short traces (one per scenario). If all four returndoc_explains_behavior: trueon the first run, something is wrong - the prompt is too lenient. Tighten the evaluation criteria.
What a bad chunking run looks like:
pass1_structuretraces show high variance in output length - some modules produce 50-word summaries, others produce 500-word summaries. This usually meansinfer_layeris returning"unknown"for large chunks, and the LLM is trying to summarise everything without a layer hint.- Module summaries list dependencies that don't appear in
dependency_graph- the classic hallucination signal. Cross-check anydependenciesfield in aModuleSummaryagainst the actualdependency_graphkeys. Invented dependencies cascade into wrong component diagrams. pass3_compressoutput has more than 12 components despite the constraint. The synthesis prompt isn't strong enough, or the module summaries themselves are too granular. Loop back and check whethercapabilityfields in summaries are class names rather than capability names.
Prompt templates - complete reference
Module summary prompt
def build_module_prompt(chunk: ModuleChunk, build_summary: dict) -> str: files_text = "\n\n".join([ f"=== {name} ===\n{content[:3000]}" # cap per file for name, content in chunk["file_contents"].items() ]) return f"""You are analyzing a module in a {build_summary.get('framework', 'software')} codebase.BUILD CONTEXT:{json.dumps(build_summary, indent=2)}MODULE: {chunk['module_id']}INFERRED LAYER: {chunk['layer']}SOURCE FILES:\{files_text\}Analyze this module. Respond ONLY with valid JSON matching this schema exactly:{{ "capability": "string - one capability name (not a class name, e.g. 'Order Management' not 'OrderService')", "responsibility": "string - one sentence describing what this module enables", "dependencies": ["list of other module IDs this imports from"], "patterns": ["list of design patterns identified, e.g. Repository, Service Facade"], "anti_patterns": ["list of violations, e.g. 'business logic in handler', 'N+1 query risk'"], "layer_violations": ["list of imports that cross layer boundaries incorrectly"]}}Rules:- capability MUST be a business capability name, never a class name- responsibility MUST be one sentence, max 20 words- Only list dependencies that are internal modules (not third-party libraries)- Only list anti_patterns you can point to specific lines of evidence for"""Synthesis prompt (Pass 3)
def build_synthesis_prompt(summaries, flows, style) -> str: summary_text = "\n".join([ f"- {s['module_id']}: {s['capability']} - {s['responsibility']}" for s in summaries ]) return f"""You are compressing a codebase analysis into an architecture document.ARCHITECTURE STYLE: {style}MODULE SUMMARIES:\{summary_text\}Apply ALL FIVE compression rules strictly:1. Collapse classes/modules into capabilities (5-12 components MAX, never more)2. Collapse endpoints into use cases (not HTTP verbs)3. Collapse tables into domain concepts (aggregate roots)4. Collapse integrations into roles (not product names)5. Express each layer boundary in one sentenceRespond ONLY with valid JSON:{{ "components": [ {{ "name": "capability name", "responsibility": "one sentence", "dependencies": ["other component names"], "source_modules": ["module_ids compressed into this component"] }} ], "mermaid": "complete C4Context or graph TD Mermaid diagram source", "layer_boundary_sentences": [ "The handler layer delegates all business decisions to the service layer.", "The service layer owns consistency boundaries and orchestrates repositories.", "Repositories are the only components that touch the data store." ], "domain_concepts": [ {{"concept": "Order", "tables": ["orders", "order_lines"], "aggregate_root": "Order"}} ]}}Hard constraints:- components array MUST have 5-12 items. If you have more, merge.- Every component name must be a business capability, never a class name.- mermaid must be syntactically valid Mermaid."""Failure evaluation prompt (Gate 4)
def build_failure_eval_prompt(scenario, draft_doc, diagram) -> str: return f"""You are evaluating whether an architecture document adequately covers a failure scenario.FAILURE SCENARIO: {scenario}ARCHITECTURE DOCUMENT (excerpt):{draft_doc[:4000]}COMPONENT DIAGRAM:\{diagram\}Answer these questions based ONLY on what is explicitly documented:1. Which components fail immediately when this scenario occurs?2. Which components can degrade gracefully?3. Does the document show a circuit breaker or fallback? (yes/no)4. Does the document explain the user-visible impact? (yes/no)5. Is there anything about this scenario the document cannot explain?Respond ONLY with valid JSON:{{ "doc_explains_behavior": true or false, "immediate_failures": ["component names"], "graceful_degradation": ["component names"], "has_fallback_documented": true or false, "user_impact_documented": true or false, "gap": "one sentence describing what the doc fails to explain, or null if no gap"}}"""Node 12 - publish_doc implementation
Every other node has been shown in full. Node 12 is the simplest - but leaving it out breaks the completeness of the guide.
# nodes/publish.pyfrom pathlib import Pathfrom datetime import datedef publish_doc(state: PipelineState) -> dict: """ Final node - write the validated document to disk and generate ADR stubs. At this point all gate results are in state and [unverified] markers have been applied by the refinement node to any unresolved gaps. """ repo_name = Path(state["repo_path"]).name today = date.today().isoformat() # 1. Finalise the document - replace [unverified] with [gate-X-gap] # so the reader knows which gate each open item came from final = state["draft_doc"] for gap in state.get("gate3_gaps", []): final = final.replace("[unverified]", "[gate-3-gap]", 1) for gap in state.get("gate4_gaps", []): final = final.replace("[unverified]", "[gate-4-gap]", 1) # 2. Generate ADR stubs for the top 3 debt items by risk score top_debt = sorted( state.get("debt_register", []), key=lambda d: d["risk_score"], reverse=True )[:3] adr_stubs = [] for i, item in enumerate(top_debt, start=1): stub = f"""## ADR-{i:03d} - {item['title']}**Status:** Proposed**Date:** {today}**Context:** {item['location']}**Decision:** [pending - risk score {item['risk_score']}/5, effort {item['effort_score']}/5]**Consequences:** [to be completed by architecture owner]""" adr_stubs.append(stub) # 3. Write to output directory out_dir = Path(state["repo_path"]).parent / "architecture-docs" out_dir.mkdir(exist_ok=True) doc_path = out_dir / f"{repo_name}-architecture-{today}.md" doc_path.write_text(final) adr_path = out_dir / f"{repo_name}-adrs-{today}.md" adr_path.write_text("\n\n---\n\n".join(adr_stubs)) print(f"✓ Architecture doc written to {doc_path}") print(f"✓ ADR stubs written to {adr_path}") return { "final_doc": final, "adr_stubs": adr_stubs, }The two output files land in an architecture-docs/ directory next to the repo. The ADR stubs are intentionally skeletal - they capture the decision that needs to be made and the context, but leave the actual decision and consequences for the team to fill in. This is intentional: a stub that names the decision is worth more than a completed ADR that was written by an LLM without human context.
ArchLens in action - end-to-end demo
The best way to understand what ArchLens produces is to see a real run. The following is representative output from the pipeline run against a mid-size Spring Boot e-commerce application (~120 Java files, 14 packages). Your codebase will produce different component names and debt items - the structure and compression ratio will be consistent, but the specifics are always codebase-specific.
Before - what the codebase looks like
Repository: spring-ecommerceLanguage: Java 17 · Spring Boot 3.1Files: 120 source filesPackages: 14 (com.example.order, .payment, .inventory, .user, .notification, .report, .admin, .config, .security, .util, .dto, .exception, .scheduler, .integration)No architecture document exists. A new engineer joining this team would spend 2–3 weeks reading code to build the mental model that ArchLens produces in 8 minutes.
After - what ArchLens produces
Component diagram (Pass 3 output, 8 components from 120 files):
C4Context
title ArchLens output - spring-ecommerce
Boundary(entry, "Entry / Interface") {
Component(api, "API Gateway", "Spring MVC", "Routes all inbound HTTP · auth middleware · rate limiting")
}
Boundary(core, "Business Core") {
Component(order, "Order Management", "Service", "Place · modify · cancel · track orders")
Component(payment, "Payment Processing", "Service", "Charge · refund · gateway abstraction")
Component(inventory, "Inventory Control", "Service", "Reserve · release · restock")
Component(user, "User Management", "Service", "Registration · auth · profile · notifications")
}
Boundary(support, "Supporting Capabilities") {
Component(report, "Reporting", "Service", "Sales · inventory · audit reports")
Component(notify, "Notification Service", "Async", "Email · SMS · push via adapter")
}
Boundary(data, "Data & Integration") {
Component(store, "Data Store", "PostgreSQL", "Primary persistence - orders · users · inventory")
}
Rel(api, order, "delegates")
Rel(api, user, "delegates")
Rel(order, payment, "calls")
Rel(order, inventory, "reserves")
Rel(order, notify, "emits event")
Rel(payment, store, "reads/writes")
Rel(order, store, "reads/writes")
Rel(inventory, store, "reads/writes")
Key flow - place order (Pass 2 output):
sequenceDiagram
participant U as User
participant A as API Gateway
participant O as Order Management
participant I as Inventory Control
participant P as Payment Processing
participant N as Notification Service
participant DB as Data Store
U->>A: POST /api/orders
A->>O: placeOrder() [consistency boundary begins]
O->>I: reserve(items)
I->>DB: UPDATE stock_levels
I-->>O: reserved
O->>P: charge(amount) [external - outside boundary]
P-->>O: charged
O->>DB: INSERT order + order_lines
O-->>A: [consistency boundary commits]
O-)N: order.placed [async, fire-and-forget]
A-->>U: 201 Created
Debt register (Pass 3 anti-pattern scoring, score: 9/35 - top 5 shown, 12 total):
| # | Item | Risk | Effort | Decision |
|---|---|---|---|---|
| 1 | ReportService contains business logic (340 lines) | 4 | 2 | Fix Next Sprint |
| 2 | OrderController calls InventoryRepository directly (layer skip) | 4 | 1 | Fix Now |
| 3 | No circuit breaker on PaymentGatewayClient | 4 | 3 | Plan & Track |
| 4 | UserService handles auth, profile, AND notification dispatch | 3 | 3 | Plan & Track |
| 5 | 3 unindexed foreign key columns in order_lines | 3 | 1 | Fix When Passing |
| 6 | SchedulerService has no idempotency guard on job execution | 3 | 2 | Fix Next Sprint |
| 7 | AdminController bypasses auth middleware for 2 endpoints | 4 | 1 | Fix Now |
| 8 | NotificationService retries with no dead-letter queue | 3 | 2 | Fix Next Sprint |
| 9 | OrderRepository runs N+1 query on findWithLines() | 3 | 1 | Fix When Passing |
| 10 | IntegrationLayer has no contract/spec for 3 external calls | 2 | 2 | Backlog With Date |
| 11 | ConfigService loads secrets from env at startup - no rotation | 2 | 3 | Accept & Document |
| 12 | DtoMapper contains domain validation logic | 2 | 1 | Fix When Passing |
The compression in numbers
| Metric | Before | After |
|---|---|---|
| Source files | 120 | - |
| Packages / modules | 14 | - |
| Components in diagram | - | 8 |
| Key flows documented | - | 3 |
| Anti-pattern score | - | 9/35 (moderate) |
| Debt items surfaced | - | 12 |
| Gate 1 violations | - | 3 (all layer skips) |
| Time to produce | 2–3 weeks (manual) | 8 minutes |
The compression from 14 packages to 8 capability-named components is exactly what the 5 compression rules are designed to produce. OrderController, OrderService, and OrderRepository become a single Order Management component. PaymentClient, PaymentGatewayAdapter become Payment Processing. What was a taxonomy of classes becomes a map of what the system actually does.
Where to start - the 3-day plan
pip install langgraph==0.2.x \ langchain-anthropic \ langgraph-checkpoint-sqlite \ tenacity \ gitpython \ streamlit \ pathlib2Key versions to pin: langgraph>=0.2.0 for interrupt() support and SqliteSaver; langchain-anthropic>=0.3.0 for claude-sonnet-4-6 and claude-haiku-4-5-20251001. The langgraph-checkpoint-sqlite package is the most commonly missed - without it SqliteSaver.from_conn_string() will raise an ImportError at graph compilation.
Get module chunking working reliably on one real codebase and validate the module summaries before touching compression, diagram generation, or any validation gate. Everything downstream depends on chunking quality.
| Day | Goal | Success criterion |
|---|---|---|
| Day 1 | Build nodes 1–3. Get module summaries for a real codebase. | A developer who knows the codebase reads 5 random module summaries and says they are accurate. |
| Day 2 | Build nodes 4–6. Generate a draft component diagram and sequence diagrams. | The component diagram has 5–12 components named as capabilities (not classes). At least one sequence diagram traces a real flow end to end. |
| Day 3 | Build node 8 (human interrupt) and the review UI. Wire the LangGraph graph. | A developer can pause the pipeline at gate2_human, review the draft, submit feedback, and the pipeline resumes with their comments in state. |
| Week 2 | Build gates 1, 3, 4 and the refinement node. Add error recovery. | The pipeline completes end to end on a medium codebase (~30 modules) and produces a validated, published architecture document. |
The most common failure is building the visual output first and discovering the chunking problem six weeks later when the diagrams consistently contain invented components. Module summaries first. Diagrams second.
Related reading
This guide is part two of a two-part series:
- Part 1 - From Unknown Codebase to Architecture Document: The full 3-pass methodology, compression rules, anti-pattern checklist, validation gates, debt register, and 12-section output template. Read this first.
- Part 2 - This guide: Automating Part 1 as a 12-node LangGraph pipeline with human-in-the-loop review and production error recovery.
Minimal runnable entry point
# main.py - run ArchLens on a local repofrom graph import build_graphfrom state import PipelineStategraph = build_graph()initial = PipelineState( repo_path="/path/to/your/repo", target_stack="java", # or python / nodejs / go / dotnet log_source=None, # set to APM endpoint for Gate 3 build_file_summary={}, dependency_graph={}, module_chunks=[], module_summaries=[], entry_points=[], architecture_style="", flow_traces=[], state_stores=[], async_boundaries=[], component_diagram="", anti_pattern_score=0, anti_pattern_violations=[], debt_register=[], draft_doc="", gate1_violations=[], gate1_passed=False, human_feedback=None, gate2_approved=False, gate3_gaps=[], gate3_passed=False, gate4_gaps=[], gate4_passed=False, refinement_count=0, refinement_notes=[], final_doc="", adr_stubs=[],)config = {"configurable": {"thread_id": "run-001"}}for event in graph.stream(initial, config=config, stream_mode="values"): node = list(event.keys())[0] print(f"✓ Completed: {node}") if node == "gate2_human": print("⏸ Paused for human review. Run review UI or resume programmatically.") breakWhat you actually get - and where it still fails
After building this pipeline on a real codebase, the output is genuinely useful in ways that manual documentation isn't.
The document reflects what the code actually does, not what the team remembers it was supposed to do. Gate 1 catches the disconnect between the architecture someone drew on a whiteboard six months ago and the import graph that exists today. Gate 3 catches the flows that are real but undocumented - the three endpoints that account for 60% of traffic but never made it into the sequence diagrams. Gate 4 catches the failure modes the team hadn't thought to document. These are the gaps that only surface when you systematically compare your documented architecture against evidence.
The human interrupt at Gate 2 is where the real value compounds. The developer review isn't just a quality check - it's a forcing function. When someone reads five targeted questions about a diagram the pipeline produced, they engage with the architecture in a way that a blank-page documentation request never produces. The pipeline creates the artefact; the human makes it accurate.
Where it still fails:
The pipeline struggles with three codebase types. First, highly dynamic codebases where runtime behaviour differs significantly from static structure - plugin architectures, script-injection patterns, or heavy reflection. Gate 1 can only check what's in the import graph; it can't see what gets loaded at runtime. Second, monorepos with more than ~80 modules where the synthesis step (Pass 3) receives too much input and produces over-merged components. The 5–12 component ceiling becomes an approximation rather than a compression. Third, polyglot codebases - a Java backend with a Python ML service and a Go gateway - where the chunker runs separately per stack and the synthesis step has to reconcile summaries with no shared module boundary language.
None of these are reasons not to build it. They're reasons to know what the output is and isn't before you publish the document it produces. The pipeline gets you 80% of a good architecture document in 8 minutes. The remaining 20% still requires a human who knows the system - and that's exactly why Gate 2 exists.
What next
If you've read this far and are ready to build:
- Start with the companion guide - From Unknown Codebase to Architecture Document covers the 3-pass methodology, all 5 compression rules, the validation gates, and the 12-section output template that ArchLens generates. It's the conceptual foundation this guide implements.
- Run nodes 1–3 first - get module chunking working on one real codebase before touching anything else. The quality of every downstream output is set at node 2.
- Share your output - if you run ArchLens and get a component diagram you're proud of (or one that surprised you), share it. The most useful thing for this methodology's development is real-world runs on real codebases.
If you have questions on a specific implementation decision, or hit a chunking problem that the fallback strategies in this guide don't cover, the comments are open.
Related Articles
More Articles
- Designing User Experience for Agentic AI Systems
- From LLMs to Agents: The Mindset Shift Nobody Talks About