From Unknown Codebase to Architecture Doc, Automated - Building the LangGraph Pipeline

Every engineering team eventually inherits a codebase with no architecture documentation. A new hire joins and spends three weeks mapping dependencies by reading code. A tech lead wants to present the system to stakeholders but has nothing to show. A modernisation project begins and no one can agree on the current boundaries.

The standard solution is to schedule an architecture workshop, assign someone to "write the doc," and watch it stall under competing priorities. The doc either never gets written, or it gets written once and immediately starts drifting from reality.

This guide builds ArchLens - a 12-node LangGraph pipeline that takes a Git repository as input and produces a validated, production-ready architecture document as output - complete with component diagrams, sequence diagrams, a debt register, and ADR stubs. The pipeline runs the 3-pass analysis methodology automatically, pauses for human review at the right moment, and loops back for refinement when gaps are found.

What you will build

By the end of this guide you will have a working ArchLens pipeline that:

Clones any Git repository and groups source files by module boundary
Summarises each module using Claude, applying the 5 architecture compression rules
Generates a Mermaid component diagram with 5–12 capability-named components
Traces the top 3 revenue-generating flows and produces sequence diagrams
Runs 4 validation gates - static analysis, developer review, runtime comparison, and failure scenarios
Pauses for human review at Gate 2, persists state, and resumes when feedback is submitted
Loops back for refinement when gaps are found, then publishes a final verified document

ℹ️

You should be comfortable with LangGraph's core concepts - StateGraph, nodes, edges, and conditional routing. This guide does not re-explain those basics. It goes directly to the decisions specific to codebase analysis: why module chunking matters, how to design state that survives a human interrupt, and how to wire a refinement loop without hitting a chain's architectural limits.

⚠️

This guide implements the 3-pass architecture methodology - a structured approach to reading an unfamiliar codebase covered in the companion guide "From Unknown Codebase to Architecture Document: A Complete Practitioner's Guide." If terms like "compression rules," "validation gates," or "the 12-section output template" are unfamiliar, reading that guide first will give you the context this one assumes.

The 12-node pipeline at a glance

#	Node	Responsibility	Phase
1	`ingest_repo`	Clone repo · parse build file · extract dependency graph	Pass 1
2	`chunk_by_module`	Group files by package boundary · never by individual file	Pass 1
3	`pass1_structure`	LLM per module chunk · accumulate module summaries	Pass 1
4	`pass2_behavior`	Trace top 3 flows · map state stores · async boundaries	Pass 2
5	`pass3_compress`	Apply 5 compression rules · generate Mermaid · score anti-patterns	Pass 3
6	`assemble_draft`	Fill 12-section template · mark all claims `[unverified]`	Assembly
7	`gate1_static`	Run jdeps/depcruise · diff import graph vs component diagram	Gate 1
8	`gate2_human`	interrupt() - pause for developer review · collect feedback	⏸ Human
9	`gate3_runtime`	Fetch logs/APM · compare top flows vs sequence diagrams	Gate 3
10	`gate4_failure`	4 standard failure scenarios · LLM evaluator · gap reporter	Gate 4
11	`refinement`	Update diagrams + debt list · conditional loop back to node 5	↻ Loop
12	`publish_doc`	Write Markdown · generate ADR stubs · output debt register	Output

Jump toOverview·State·Chunking·Nodes·Gates·Production

Why LangGraph, not a simple chain

If you've used LangGraph before, you know the answer. But it's worth being precise about which features this pipeline actually requires - because teams that try to build this on a chain hit every one of these walls.

Cycles - the chain killer. Node 11 (refinement) conditionally loops back to node 5 (compress) if major gaps are found. No chain architecture can do this. You discover this requirement on the first real codebase, not in testing.

Human interrupt at node 8. Gate 2 pauses execution and waits hours or days for developer review. LangGraph's interrupt() + checkpointer handles this natively. Without it, you lose all state when the process idles.

State persistence across 12 nodes. Module summaries from node 3 must be readable by node 10. The checkpointer persists all state to SQLite/Postgres automatically - no manual serialisation between steps.

Conditional edges at gates. Each gate can either pass (proceed) or fail (route to refinement). LangGraph's conditional edges handle this with a simple router function - no complex if-else chains in application code.

Jump toOverview·State·Chunking·Nodes·Gates·Production

1 · State design - the most important decision

State is the contract between every node. Get this wrong and you will rewrite it twice. Define it as a TypedDict before touching any node implementation.

The complete state schema

code

# state.py - define this first, before any nodefrom typing import TypedDict, Optionalfrom langgraph.graph import add_messagesclass ModuleChunk(TypedDict):    module_id: str          # e.g. "com.example.payment"    layer: str              # "handler" | "service" | "dao" | "domain" | "integration"    files: list[str]        # absolute paths of every file in this module    file_contents: dict     # filename → source code string    import_count: int       # used to sort chunks by complexityclass ModuleSummary(TypedDict):    module_id: str    capability: str         # compressed name - "Order Management", not "OrderService"    responsibility: str     # one sentence    dependencies: list[str] # other module_ids this depends on    patterns: list[str]     # detected design patterns    anti_patterns: list[str]# detected violations    raw_llm_output: str     # keep for Gate 2 displayclass FlowTrace(TypedDict):    flow_name: str          # "place_order", "authenticate_user"    priority: int           # 1=revenue, 2=frequency, 3=complex, 4=failure-prone, 5=auth    mermaid_sequence: str   # full Mermaid sequenceDiagram source    consistency_boundary: str    async_boundaries: list[str]    external_calls: list[str]class DebtItem(TypedDict):    title: str    location: str    risk_score: int         # 1-5    effort_score: int       # 1-5    decision: str           # "Fix Now" | "Fix Next Sprint" | "Plan" | "Accept" | "Skip"    owner: Optional[str]    target_date: Optional[str]class PipelineState(TypedDict):    # ── Input ──    repo_path: str    target_stack: str       # "java" | "python" | "nodejs" | "dotnet" | "go" | "ruby"    log_source: Optional[str]  # APM endpoint or log path for Gate 3    # ── Pass 1 outputs ──    build_file_summary: dict    dependency_graph: dict  # module_id → list of module_ids it imports    module_chunks: list[ModuleChunk]    module_summaries: list[ModuleSummary]    entry_points: list[str]    architecture_style: str    # ── Pass 2 outputs ──    flow_traces: list[FlowTrace]    state_stores: list[dict]    async_boundaries: list[str]    # ── Pass 3 outputs ──    component_diagram: str          # Mermaid source    anti_pattern_score: int    anti_pattern_violations: list[str]    debt_register: list[DebtItem]    # ── Draft doc ──    draft_doc: str                  # full Markdown with [unverified] markers    # ── Gate results ──    gate1_violations: list[str]     # import graph vs diagram mismatches    gate1_passed: bool    human_feedback: Optional[str]   # structured feedback from Gate 2    gate2_approved: bool    gate3_gaps: list[str]           # flows in logs not in diagrams    gate3_passed: bool    gate4_gaps: list[str]           # failure scenarios the doc can't explain    gate4_passed: bool    # ── Refinement tracking ──    refinement_count: int           # prevent infinite loops - max 2    refinement_notes: list[str]     # audit trail of what changed each loop    # ── Final output ──    final_doc: str    adr_stubs: list[str]

⚠️

If you add fields incrementally, Node 8 will fail because it expects fields that Node 3 never wrote. Define the complete schema upfront, use Optional for fields that are populated later, and initialise everything to sensible defaults in the graph entry point.

State flow across nodes

Node	Reads from state	Writes to state
`ingest_repo`	`repo_path`, `target_stack`	`build_file_summary`, `dependency_graph`, `entry_points`
`chunk_by_module`	`repo_path`, `dependency_graph`, `target_stack`	`module_chunks`
`pass1_structure`	`module_chunks`, `build_file_summary`	`module_summaries`, `architecture_style`
`pass2_behavior`	`module_summaries`, `entry_points`	`flow_traces`, `state_stores`, `async_boundaries`
`pass3_compress`	`module_summaries`, `flow_traces`	`component_diagram`, `anti_pattern_score`, `debt_register`
`assemble_draft`	Everything above	`draft_doc`
`gate1_static`	`dependency_graph`, `component_diagram`	`gate1_violations`, `gate1_passed`
`gate2_human`	`draft_doc`, `gate1_violations`	`human_feedback`, `gate2_approved`
`gate3_runtime`	`flow_traces`, `log_source`	`gate3_gaps`, `gate3_passed`
`gate4_failure`	`draft_doc`, `component_diagram`	`gate4_gaps`, `gate4_passed`
`refinement`	All gate results + `human_feedback`	`component_diagram`, `draft_doc`, `refinement_count`
`publish_doc`	Everything	`final_doc`, `adr_stubs`

Jump toOverview·State·Chunking·Nodes·Gates·Production

2 · Chunking logic - the quality multiplier

Chunking quality determines everything downstream. A PaymentService.java analysed in isolation looks self-contained. Its dependency on OrderRepository, its position in the layer stack, and the boundary decisions around it are all in the surrounding package. Send the file alone and the LLM hallucinates its dependencies.

ℹ️

Sending individual files is the single biggest quality failure in codebase analysis agents. It produces module summaries with invented dependencies, which produces a wrong component diagram, which produces a useless architecture document.

Implementation

code

# chunker.pyimport osfrom pathlib import Pathfrom collections import defaultdictfrom state import ModuleChunkSTACK_CONFIG = {    "java":   {"ext": [".java"], "module_from": java_package},    "python": {"ext": [".py"],   "module_from": python_package},    "nodejs": {"ext": [".ts", ".js"], "module_from": node_module},    "go":     {"ext": [".go"],   "module_from": go_package},    "dotnet": {"ext": [".cs"],   "module_from": dotnet_namespace},}def chunk_by_module(state: PipelineState) -> dict:    repo = Path(state["repo_path"])    cfg  = STACK_CONFIG[state["target_stack"]]    groups: dict[str, list] = defaultdict(list)    for path in repo.rglob("*"):        if path.suffix not in cfg["ext"]: continue        if is_test_file(path): continue    # skip tests - they add noise        if is_generated(path): continue    # skip generated code        module_id = cfg["module_from"](path, repo)        groups[module_id].append(path)    chunks = []    for module_id, files in groups.items():        contents = {}        for f in files:            try: contents[f.name] = f.read_text(errors="replace")            except: pass        # Estimate import count for complexity sorting        import_count = sum(            c.count("import") for c in contents.values()        )        chunks.append(ModuleChunk(            module_id=module_id,            layer=infer_layer(module_id, list(contents.keys())),            files=[str(f) for f in files],            file_contents=contents,            import_count=import_count,        ))    # Sort: high import count first - richer context for early LLM calls    chunks.sort(key=lambda c: c["import_count"], reverse=True)    return {"module_chunks": chunks}def java_package(path: Path, repo: Path) -> str:    # Read "package com.example.payment" from file header    try:        first = path.read_text().split("\n")[:10]        for line in first:            if line.strip().startswith("package "):                return line.strip()[8:].rstrip(";")    except: pass    # Fallback: use directory as module id    return str(path.parent.relative_to(repo)).replace("/", ".")def infer_layer(module_id: str, filenames: list[str]) -> str:    m = module_id.lower()    if   any(x in m for x in ["controller","handler","router","api","web"]): return "handler"    elif any(x in m for x in ["service","usecase","business","facade"]): return "service"    elif any(x in m for x in ["dao","repo","repository","store"]): return "dao"    elif any(x in m for x in ["model","domain","entity"]): return "domain"    elif any(x in m for x in ["client","adapter","integration"]): return "integration"    return "unknown"

ℹ️

If a module chunk exceeds ~8,000 tokens, split it: keep the largest file intact and summarise the smaller ones as one-line stubs. A chunk with one fully detailed file plus brief stubs of its siblings outperforms a truncated chunk every time.

⚠️

If more than 20% of your chunks return "unknown" from infer_layer, the heuristic keyword matching isn't working for your codebase's naming conventions. This happens most often with domain-specific naming (SubmissionProcessor, AuditCoordinator) that doesn't contain the expected keywords. The fix is to add stack-specific directory path overrides to STACK_CONFIG rather than modifying infer_layer directly - this keeps fallback rules auditable and per-stack:

code

STACK_CONFIG["java"]["layer_dirs"] = {    "controller": "handler",  "api": "handler",   "rest": "handler",    "service":    "service",  "business": "service",    "repository": "dao",      "persistence": "dao", "jpa": "dao",    "domain":     "domain",   "model": "domain",  "entity": "domain",    "client":     "integration", "adapter": "integration",}def infer_layer(module_id: str, filenames: list[str], stack: str = "java") -> str:    m = module_id.lower()    # Primary: keyword match on module_id    if   any(x in m for x in ["controller","handler","router","api","web"]): return "handler"    elif any(x in m for x in ["service","usecase","business","facade"]):     return "service"    elif any(x in m for x in ["dao","repo","repository","store"]):           return "dao"    elif any(x in m for x in ["model","domain","entity"]):                   return "domain"    elif any(x in m for x in ["client","adapter","integration"]):            return "integration"    # Fallback: check directory path segment against stack-specific overrides    dir_map = STACK_CONFIG.get(stack, {}).get("layer_dirs", {})    for segment in module_id.split("."):        if segment in dir_map:            return dir_map[segment]    return "unknown"

Count unknowns before running Pass 1 and abort if the ratio is too high - a Pass 1 run with 40% unknown layers produces module summaries without layer context, which degrades compression quality significantly.

code

def validate_chunking(chunks: list[ModuleChunk]) -> None:    unknown = sum(1 for c in chunks if c["layer"] == "unknown")    ratio = unknown / len(chunks) if chunks else 0    if ratio > 0.2:        raise ValueError(            f"Layer detection too weak: {unknown}/{len(chunks)} chunks are 'unknown' ({ratio:.0%}). "            f"Add directory overrides to STACK_CONFIG before running Pass 1."        )

Jump toOverview·State·Chunking·Nodes·Gates·Production

3 · Node wiring - LangGraph graph construction

Pass 1 nodes - structure discovery

code

# nodes/pass1.pyfrom langchain_anthropic import ChatAnthropicfrom state import PipelineState, ModuleSummaryllm_sonnet = ChatAnthropic(model="claude-sonnet-4-6", max_tokens=4096)llm_haiku  = ChatAnthropic(model="claude-haiku-4-5-20251001", max_tokens=1024)def ingest_repo(state: PipelineState) -> dict:    repo = state["repo_path"]    stack = state["target_stack"]    build_summary = parse_build_file(repo, stack)    dep_graph = run_dependency_analysis(repo, stack)    entry_points = find_entry_points(repo, stack)    return {        "build_file_summary": build_summary,        "dependency_graph": dep_graph,        "entry_points": entry_points,    }def pass1_structure(state: PipelineState) -> dict:    summaries = []    for chunk in state["module_chunks"]:        prompt = build_module_prompt(chunk, state["build_file_summary"])        response = llm_sonnet.invoke(prompt)        summary = parse_module_summary(response.content, chunk["module_id"])        summaries.append(summary)    # Infer architecture style using Haiku (cheaper, simpler task)    style_prompt = build_style_prompt(summaries, state["dependency_graph"])    style = llm_haiku.invoke(style_prompt).content.strip()    return {        "module_summaries": summaries,        "architecture_style": style,    }

Pass 2 nodes - behavior tracing

code

# nodes/pass2.py# Flow priority mirrors the sequence in Guide 1 - revenue flows first# See: [*From Unknown Codebase to Architecture Document*](/from-unknown-codebase-to-architecture-document-a-complete-practitioners-guide)FLOW_PRIORITY = [    ("revenue", 1),   # order placement, payment, account creation    ("frequency", 2), # search, list views, dashboard load    ("complex", 3),   # multi-step workflows, approval chains    ("failure", 4),   # external integrations, scheduled jobs    ("auth", 5),      # login, session management]def pass2_behavior(state: PipelineState) -> dict:    summaries = state["module_summaries"]    entry_points = state["entry_points"]    # Step 1: identify top 3 flows to trace    flows_to_trace = identify_key_flows(        summaries, entry_points, llm_haiku    )    # Step 2: trace each flow and generate sequence diagram    traces = []    for flow in flows_to_trace[:3]:  # cap at 3        prompt = build_flow_trace_prompt(flow, summaries)        response = llm_sonnet.invoke(prompt)        trace = parse_flow_trace(response.content, flow)        traces.append(trace)    # Step 3: identify state stores and async boundaries    state_prompt = build_state_analysis_prompt(summaries)    state_response = llm_haiku.invoke(state_prompt)    state_analysis = parse_state_analysis(state_response.content)    return {        "flow_traces": traces,        "state_stores": state_analysis["stores"],        "async_boundaries": state_analysis["async"],    }

Pass 3 + assembly nodes

code

# nodes/pass3.pydef pass3_compress(state: PipelineState) -> dict:    summaries = state["module_summaries"]    prompt = build_synthesis_prompt(        summaries,        state["flow_traces"],        state["architecture_style"],    )    response = llm_sonnet.invoke(prompt)    parsed = parse_synthesis_output(response.content)    # Score anti-patterns using Haiku (35-item checklist - see Guide 1)    score_prompt = build_antipattern_prompt(summaries, state["dependency_graph"])    score_response = llm_haiku.invoke(score_prompt)    violations, score = parse_antipattern_score(score_response.content)    debt = build_debt_register(violations, llm_haiku)    return {        "component_diagram": parsed["mermaid"],        "anti_pattern_score": score,        "anti_pattern_violations": violations,        "debt_register": debt,    }def assemble_draft(state: PipelineState) -> dict:    # Fill the 12-section output template from Guide 1    # Every claim not verified by a gate gets marked [unverified]    doc = TEMPLATE.format(        architecture_style  = state["architecture_style"],        component_diagram   = state["component_diagram"],        flow_traces         = format_flows(state["flow_traces"]),        state_stores        = format_state_stores(state["state_stores"]),        anti_pattern_score  = state["anti_pattern_score"],        debt_register       = format_debt(state["debt_register"]),        security            = "[unverified - complete after Gate 2]",        deployment          = "[unverified - complete after Gate 2]",    )    return {"draft_doc": doc}# The TEMPLATE variable - 12-section skeleton# Sections marked [unverified] will be updated during the validation gatesTEMPLATE = """# Architecture Document - {repo_name}> Generated by the agentic pipeline on {date}. Claims marked [unverified] have> not yet passed a validation gate. Do not remove these markers until Gate 2–4> have signed off on the relevant section.## 01 - Executive overviewArchitecture style: {architecture_style}\{executive_summary\}## 02 - Component diagram```mermaid\{component_diagram\}```## 03 - Layer boundary rules\{layer_boundary_sentences\}## 04 - Key flows\{flow_traces\}## 05 - State and data model\{state_stores\}## 06 - Domain concepts\{domain_concepts\}## 07 - Integration map\{integration_map\}## 08 - Anti-pattern registerScore: {anti_pattern_score}/35\{anti_pattern_violations\}## 09 - Debt register\{debt_register\}## 10 - Security and auth\{security\}## 11 - Deployment and operations\{deployment\}## 12 - Architecture decisions (ADRs)\{adr_stubs\}"""

Graph wiring - connecting all 12 nodes

code

# graph.py - ArchLens pipelinefrom langgraph.graph import StateGraph, ENDfrom langgraph.checkpoint.sqlite import SqliteSaverfrom langgraph.types import interruptdef build_graph():    g = StateGraph(PipelineState)    # ── Add all 12 nodes ──    g.add_node("ingest_repo",      ingest_repo)    g.add_node("chunk_by_module",  chunk_by_module)    g.add_node("pass1_structure",  pass1_structure)    g.add_node("pass2_behavior",   pass2_behavior)    g.add_node("pass3_compress",   pass3_compress)    g.add_node("assemble_draft",   assemble_draft)    g.add_node("gate1_static",     gate1_static)    g.add_node("gate2_human",      gate2_human)    # uses interrupt()    g.add_node("gate3_runtime",    gate3_runtime)    g.add_node("gate4_failure",    gate4_failure)    g.add_node("refinement",       refinement)    g.add_node("publish_doc",      publish_doc)    # ── Linear edges (Pass 1 → 3) ──    g.set_entry_point("ingest_repo")    g.add_edge("ingest_repo",     "chunk_by_module")    g.add_edge("chunk_by_module", "pass1_structure")    g.add_edge("pass1_structure", "pass2_behavior")    g.add_edge("pass2_behavior",  "pass3_compress")    g.add_edge("pass3_compress",  "assemble_draft")    g.add_edge("assemble_draft",  "gate1_static")    # ── Gate 1 → Gate 2 (always - violations are input to G2 review) ──    g.add_edge("gate1_static", "gate2_human")    # ── Gate 2 → conditional ──    g.add_conditional_edges("gate2_human", route_gate2, {        "proceed":  "gate3_runtime",        "refine":   "refinement",   # major disputes - loop back    })    # ── Gates 3 & 4 ──    g.add_edge("gate3_runtime", "gate4_failure")    # ── Gate 4 → conditional ──    g.add_conditional_edges("gate4_failure", route_gate4, {        "publish":  "publish_doc",        "refine":   "refinement",    })    # ── Refinement → conditional loop ──    g.add_conditional_edges("refinement", route_refinement, {        "compress": "pass3_compress",  # major gaps → re-compress        "draft":    "assemble_draft",  # minor gaps → rebuild draft        "publish":  "publish_doc",     # max iterations hit    })    g.add_edge("publish_doc", END)    # ── Compile with checkpointer ──    checkpointer = SqliteSaver.from_conn_string("checkpoints.db")    return g.compile(checkpointer=checkpointer, interrupt_before=["gate2_human"])# Router functionsdef route_gate2(state: PipelineState) -> str:    if state.get("gate2_approved"): return "proceed"    if state.get("refinement_count", 0) >= 2: return "proceed"  # max loops    return "refine"def route_gate4(state: PipelineState) -> str:    gaps = state.get("gate4_gaps", [])    if not gaps or state.get("refinement_count", 0) >= 2:        return "publish"    return "refine"def route_refinement(state: PipelineState) -> str:    if state.get("refinement_count", 0) >= 2: return "publish"    # Major gaps (multiple component boundaries disputed) → full re-compress    if len(state.get("gate4_gaps", [])) > 3: return "compress"    return "draft"

Graph topology - conditional edges visualised

The linear description above doesn't show the most important part: the conditional edges that make this a graph, not a chain. Here's what the routing actually looks like:

mermaid

flowchart TD
    A[ingest_repo] --> B[chunk_by_module]
    B --> C[pass1_structure]
    C --> D[pass2_behavior]
    D --> E[pass3_compress]
    E --> F[assemble_draft]
    F --> G[gate1_static]
    G -->|always proceeds| H[gate2_human]

    H -->|approved| I[gate3_runtime]
    H -->|not approved| J[refinement]

    I --> K[gate4_failure]

    K -->|has gaps| J

    J -->|compress| E
    J -->|draft| F
    J -->|publish| L[publish_doc]

    L --> M([END])

    style H fill:#22C55E,color:#FFFFFF,stroke:#0ea5e9
    style J fill:#EC4899,color:#FFFFFF,stroke:#0ea5e9
    style L fill:#10B981,color:#FFFFFF,stroke:#0ea5e9
    style M fill:#2C2C2A,color:#F1EFE8,stroke:#2C2C2A
    style A fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style B fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style C fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style D fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style E fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style F fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style G fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style I fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9
    style K fill:#06B6D4,color:#FFFFFF,stroke:#0ea5e9

Three routing decisions determine the shape of any given run:

After Gate 2: If the developer approves → proceed to Gate 3. If not approved and refinement_count < 2 → refinement. After 2 refinement cycles → proceed regardless (to prevent infinite loops).

After Gate 4: If no gaps found, or refinement_count >= 2 → publish. Otherwise → refinement.

After Refinement: If gate4_gaps > 3 (major structural gaps) → re-run pass3_compress from scratch. If minor gaps → rebuild draft only. If refinement_count >= 2 → force publish with remaining gaps marked [unverified].

The max refinement count of 2 is the safety valve. Without it, a codebase with genuinely ambiguous boundaries could loop indefinitely between Gate 4 and compression.

Jump toOverview·State·Chunking·Nodes·Gates·Production

4 · Validation gates - implementation

Gate 1 - Static validation

Compare the import graph produced by static analysis against the component diagram produced by the LLM. Any arrow in the diagram with no corresponding import, and any import with no corresponding arrow, is a violation. This is the automated form of the manual Gate 1 check described in From Unknown Codebase to Architecture Document.

code

# nodes/gates.pyimport subprocess, json, redef gate1_static(state: PipelineState) -> dict:    # 1. Extract component relationships from Mermaid diagram    diagram_edges = parse_mermaid_edges(state["component_diagram"])    # 2. Extract relationships from actual import graph    import_edges = extract_import_edges(        state["dependency_graph"],        state["module_summaries"]    )    violations = []    # In diagram but no imports to support it    for edge in diagram_edges:        if edge not in import_edges:            violations.append(                f"Diagram shows {edge[0]} → {edge[1]} but no imports support this"            )    # Significant imports with no diagram representation    for edge in import_edges:        if edge not in diagram_edges and is_significant(edge, state):            violations.append(                f"Import {edge[0]} → {edge[1]} has no corresponding diagram arrow"            )    return {        "gate1_violations": violations,        "gate1_passed": len(violations) == 0,    }def parse_mermaid_edges(mermaid: str) -> set[tuple]:    # Extract "A --> B" or "A --calls--> B" patterns    edges = set()    for line in mermaid.split("\n"):        m = re.search(r"(\w+)\s*--[>\w\s]*-*>\s*(\w+)", line)        if m: edges.add((m.group(1), m.group(2)))    return edgesdef extract_import_edges(    dependency_graph: dict,    module_summaries: list[ModuleSummary]) -> set[tuple]:    """    Maps module-level imports (com.example.payment → com.example.order)    to capability-level edges (Payment Processing → Order Management)    so they can be compared against the Mermaid component diagram.    This is the hard part of Gate 1: the diagram uses compressed capability    names but the dependency graph uses raw module IDs. The mapping goes    through module_summaries which holds both the module_id and its capability.    """    # Build lookup: module_id → capability name    module_to_capability = {        s["module_id"]: s["capability"]        for s in module_summaries    }    capability_edges = set()    for src_module, dst_modules in dependency_graph.items():        src_cap = module_to_capability.get(src_module)        if not src_cap:            continue  # module has no summary - skip (chunking gap, not a violation)        for dst_module in dst_modules:            dst_cap = module_to_capability.get(dst_module)            if not dst_cap:                continue  # same - skip unmapped modules            if src_cap != dst_cap:  # ignore intra-component imports                capability_edges.add((src_cap, dst_cap))    return capability_edgesdef is_significant(edge: tuple, state: PipelineState) -> bool:    """    An import edge is significant - and worth flagging as a diagram violation -    if it meets any of the following criteria:    """    src, dst = edge    # 1. Crosses a layer boundary (handler → dao, service → handler, etc.)    src_layer = next((s["layer"] for s in state["module_summaries"]                      if s["module_id"] == src), "unknown")    dst_layer = next((s["module_id"] for s in state["module_summaries"]                      if s["module_id"] == dst), "unknown")    layer_order = {"handler": 0, "service": 1, "dao": 2, "domain": 3,                   "integration": 4, "unknown": 5}    crosses_boundary = abs(        layer_order.get(src_layer, 5) - layer_order.get(dst_layer, 5)    ) > 1    # 2. High import frequency - called from 3+ other modules    call_count = sum(        1 for m in state["dependency_graph"].values() if dst in m    )    high_frequency = call_count >= 3    # 3. One of the modules is a known entry point    involves_entry = src in state["entry_points"] or dst in state["entry_points"]    return crosses_boundary or high_frequency or involves_entry

Gate 2 - Human-in-the-loop interrupt

ℹ️

LLM-generated component diagrams have a ~30% error rate on relationships. This gate directly implements Gate 2 (Developer Validation) from the companion guide - it is the quality checkpoint that makes the output trustworthy. Build the interrupt, then build the UI around it.

code

# Gate 2 node - uses LangGraph interrupt()from langgraph.types import interruptdef gate2_human(state: PipelineState) -> dict:    # interrupt() pauses execution and surfaces data to the caller    # The graph checkpoints here and waits until resumed    feedback = interrupt({        "action": "review_required",        "draft_doc": state["draft_doc"],        "component_diagram": state["component_diagram"],        "gate1_violations": state["gate1_violations"],        "questions": [            "Does this component diagram match how you'd explain the system to a new hire?",            "Are there components I've merged that you'd keep separate?",            "Is there significant behavior that doesn't appear in these diagrams?",            "Does the consistency boundary match what actually commits and rolls back?",            "Which parts would you argue with?",        ]    })    approved = feedback.get("approved", False)    return {        "human_feedback": feedback.get("comments", ""),        "gate2_approved": approved,    }

Running and resuming the pipeline:

code

# Caller - how to run and resume the graphfrom langgraph.types import Commandgraph = build_graph()thread_id = "run-001"config = {"configurable": {"thread_id": thread_id}}# Start the pipeline - it will pause at gate2_humanfor event in graph.stream(initial_state(), config=config):    print(event)# ↑ pipeline pauses here, interrupt data is in the event stream# --- hours or days later ---# Human has reviewed - resume with their feedbackfor event in graph.stream(    Command(resume={        "approved": True,        "comments": "PaymentProcessing should be separate from OrderManagement. Split those."    }),    config=config):    print(event)# ↑ pipeline resumes from gate2_human with the human's feedback in state

Gate 2 UI - minimal Streamlit implementation:

code

# review_ui.py - minimal Streamlit UI for Gate 2import streamlit as stfrom langgraph.types import Commandst.set_page_config(page_title="Architecture Review", layout="wide")thread_id = st.query_params.get("thread", "run-001")config = {"configurable": {"thread_id": thread_id}}graph = build_graph()# Load current state from checkpointstate = graph.get_state(config)data = state.tasks[0].interrupts[0].value if state.tasks else {}col1, col2 = st.columns([1, 1])with col1:    st.subheader("Component Diagram")    st.code(data.get("component_diagram", ""), language="text")    if data.get("gate1_violations"):        st.warning("Gate 1 violations:\n" + "\n".join(data["gate1_violations"]))with col2:    st.subheader("Your review")    for q in data.get("questions", []):        st.markdown(f"• {q}")    comments = st.text_area("Comments (specific component boundary changes):", height=200)    approved = st.checkbox("I approve this architecture document to proceed")    if st.button("Submit review"):        for _ in graph.stream(            Command(resume={"approved": approved, "comments": comments}),            config=config        ): pass        st.success("Pipeline resumed. Proceeding to Gates 3 and 4.")

Gates 3 & 4

code

def gate3_runtime(state: PipelineState) -> dict:    if not state.get("log_source"):        # No log source configured - mark as skipped, not failed        return {"gate3_gaps": [], "gate3_passed": True}    # 1. Fetch top endpoints from logs    top_endpoints = fetch_top_endpoints(state["log_source"], limit=10)    # 2. Extract documented flow entry points from sequence diagrams    documented = {        t.get("entry_point")        for t in state["flow_traces"]        if t.get("entry_point")    }    # 3. Find high-frequency paths not in any documented flow    gaps = [        f"High-frequency path {ep['path']} ({ep['count']} calls) not in any sequence diagram"        for ep in top_endpoints        if not any(ep["path"] in d for d in documented)    ]    return {"gate3_gaps": gaps, "gate3_passed": len(gaps) == 0}def gate4_failure(state: PipelineState) -> dict:    scenarios = [        "Primary database becomes unavailable for 5 minutes",        "External payment gateway returns 503 for all requests",        "Message broker / queue becomes unreachable",        "Application server restarts mid-request",    ]    gaps = []    for scenario in scenarios:        prompt = build_failure_eval_prompt(            scenario, state["draft_doc"], state["component_diagram"]        )        response = llm_haiku.invoke(prompt)        result = parse_failure_result(response.content)        if not result["doc_explains_behavior"]:            gaps.append(f"Scenario '{scenario}': {result['gap']}")    return {"gate4_gaps": gaps, "gate4_passed": len(gaps) == 0}

code

def refinement(state: PipelineState) -> dict:    count = state.get("refinement_count", 0) + 1    notes = list(state.get("refinement_notes", []))    updates = {}    # Apply human feedback if present    if state.get("human_feedback"):        feedback_prompt = build_feedback_application_prompt(            state["human_feedback"],            state["component_diagram"],            state["module_summaries"],        )        updated_diagram = llm_sonnet.invoke(feedback_prompt).content        updates["component_diagram"] = updated_diagram        notes.append(f"Refinement {count}: applied human feedback")    # Apply Gate 3 gaps - annotate draft with unverified markers    if state.get("gate3_gaps"):        draft = state["draft_doc"]        for gap in state["gate3_gaps"]:            draft += f"\n\n> **[unverified]** {gap}"        updates["draft_doc"] = draft    # Apply Gate 4 gaps - add to debt register as high-risk items    if state.get("gate4_gaps"):        new_debt = [            DebtItem(title=g, location="architecture doc", risk_score=4,                     effort_score=2, decision="Fix Next Sprint",                     owner=None, target_date=None)            for g in state["gate4_gaps"]        ]        updates["debt_register"] = state["debt_register"] + new_debt    return {**updates, "refinement_count": count, "refinement_notes": notes}

Handling contradictory human feedback.

The most common reviewer instruction is "split X and Y into separate components." This is almost always correct from an architectural standpoint - but the dependency graph sometimes doesn't support it yet because the code hasn't been refactored to respect that boundary. The refinement node applies feedback via an LLM call, so it will produce a diagram with the split - but Gate 1 will immediately flag a violation when that new diagram boundary has no corresponding import separation.

The right response is not to ignore the split. It's to document the violation explicitly:

code

def build_feedback_application_prompt(feedback, diagram, summaries) -> str:    return f"""Apply this reviewer feedback to the component diagram:FEEDBACK: {feedback}CURRENT DIAGRAM:\{diagram\}IMPORTANT: If the requested boundary change is not supported by the currentimport graph, still apply the change BUT add a comment to the affectedcomponents noting: "[debt] boundary exists in architecture intent but notyet in code - import separation required."This makes the aspirational architecture explicit rather than silentlykeeping a wrong diagram to avoid a Gate 1 violation."""

This produces a diagram that reflects the intended architecture - which is more useful than one that only reflects the current state - while making the gap explicit as a debt item. Gate 1 will still flag it, but the violation is now documented as an intentional architectural decision rather than a detection error.

Jump toOverview·State·Chunking·Nodes·Gates·Production

5 · Production concerns

Cost & latency model

A typical medium codebase (~40 modules, 3 flows) costs approximately:

Stage	Model	Approx tokens	Approx cost	Approx time
Pass 1 (40 modules)	Sonnet	~160k in / 40k out	~$0.60	3–5 min
Pass 2 (3 flows)	Sonnet	~30k in / 15k out	~$0.12	1–2 min
Pass 3 (compress)	Sonnet	~20k in / 8k out	~$0.08	30–60s
Anti-pattern scoring	Haiku	~15k in / 3k out	~$0.01	15–30s
Gate 4 (4 scenarios)	Haiku	~20k in / 4k out	~$0.01	30–60s
Total			~$0.82	~7–9 min

ℹ️

Anti-pattern scoring, architecture style identification, and failure scenario evaluation are classification tasks. Haiku handles them accurately at 1/20th the cost of Sonnet. Reserve Sonnet for module summarisation, flow tracing, and synthesis - tasks where reasoning depth matters.

Parallel module summarisation

Pass 1 processes 40 modules sequentially by default. Parallelise it with asyncio and rate limiting:

code

import asynciofrom asyncio import Semaphoreasync def pass1_structure_parallel(state: PipelineState) -> dict:    sem = Semaphore(5)  # max 5 concurrent Sonnet calls    async def summarise_chunk(chunk):        async with sem:            prompt = build_module_prompt(chunk, state["build_file_summary"])            response = await llm_sonnet.ainvoke(prompt)            return parse_module_summary(response.content, chunk["module_id"])    summaries = await asyncio.gather(*[        summarise_chunk(chunk)        for chunk in state["module_chunks"]    ])    return {"module_summaries": list(summaries)}# Reduces Pass 1 from 5 min to ~90 seconds on 40 modules

Error recovery

Three failure modes to handle explicitly:

code

# 1. LLM call failure - retry with exponential backofffrom tenacity import retry, stop_after_attempt, wait_exponential@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30))def llm_call_with_retry(prompt):    return llm_sonnet.invoke(prompt)# 2. Malformed LLM output - validate before writing to statedef parse_module_summary(content: str, module_id: str) -> ModuleSummary:    try:        parsed = extract_json_from_llm_output(content)        assert "capability" in parsed        assert "responsibility" in parsed        return ModuleSummary(**parsed, module_id=module_id, raw_llm_output=content)    except (json.JSONDecodeError, AssertionError, KeyError):        # Fallback: use a degraded summary rather than crashing the pipeline        return ModuleSummary(            module_id=module_id,            capability=module_id.split(".")[-1],  # use package name as fallback            responsibility="[parse error - review manually]",            dependencies=[], patterns=[], anti_patterns=[],            raw_llm_output=content,        )# 3. Checkpoint recovery - resume after crashdef resume_from_checkpoint(thread_id: str):    config = {"configurable": {"thread_id": thread_id}}    state = graph.get_state(config)    if state.next:        print(f"Resuming from: {state.next}")        for event in graph.stream(None, config=config):  # None = resume from checkpoint            print(event)    else:        print("Pipeline complete or not started")

Observability

Enable LangSmith tracing by setting three environment variables, then instrument every LLM call with node-level metadata so you can filter traces by pass and module:

code

# .envLANGCHAIN_TRACING_V2=trueLANGCHAIN_API_KEY=your-keyLANGCHAIN_PROJECT=arch-pipeline# Per-node metadata - attach to every LLM calldef pass1_structure(state: PipelineState, config: dict) -> dict:    for i, chunk in enumerate(state["module_chunks"]):        response = llm_sonnet.invoke(            build_module_prompt(chunk, state["build_file_summary"]),            config={"metadata": {                "node": "pass1_structure",                "module_id": chunk["module_id"],                "chunk_index": i,                "total_chunks": len(state["module_chunks"]),            }}        )

What a healthy run looks like in LangSmith:

pass1_structure traces are uniform in latency (~3–5s per module). Outliers (>10s) usually mean a chunk is oversized - check import_count and split it.
pass3_compress produces one trace with a single large input (all module summaries concatenated) and a short JSON output. If the output is longer than ~2,000 tokens, the compression prompt isn't enforcing the 5–12 component limit - add that constraint explicitly.
gate2_human shows an interrupt event followed by a long gap (hours or days) before the Command(resume=...) event. This is normal. If you see the gap is zero seconds, the interrupt isn't working and you're running Gate 3 without human review.
gate4_failure produces four short traces (one per scenario). If all four return doc_explains_behavior: true on the first run, something is wrong - the prompt is too lenient. Tighten the evaluation criteria.

What a bad chunking run looks like:

pass1_structure traces show high variance in output length - some modules produce 50-word summaries, others produce 500-word summaries. This usually means infer_layer is returning "unknown" for large chunks, and the LLM is trying to summarise everything without a layer hint.
Module summaries list dependencies that don't appear in dependency_graph - the classic hallucination signal. Cross-check any dependencies field in a ModuleSummary against the actual dependency_graph keys. Invented dependencies cascade into wrong component diagrams.
pass3_compress output has more than 12 components despite the constraint. The synthesis prompt isn't strong enough, or the module summaries themselves are too granular. Loop back and check whether capability fields in summaries are class names rather than capability names.

Prompt templates - complete reference

Module summary prompt

code

def build_module_prompt(chunk: ModuleChunk, build_summary: dict) -> str:    files_text = "\n\n".join([        f"=== {name} ===\n{content[:3000]}"  # cap per file        for name, content in chunk["file_contents"].items()    ])    return f"""You are analyzing a module in a {build_summary.get('framework', 'software')} codebase.BUILD CONTEXT:{json.dumps(build_summary, indent=2)}MODULE: {chunk['module_id']}INFERRED LAYER: {chunk['layer']}SOURCE FILES:\{files_text\}Analyze this module. Respond ONLY with valid JSON matching this schema exactly:{{  "capability": "string - one capability name (not a class name, e.g. 'Order Management' not 'OrderService')",  "responsibility": "string - one sentence describing what this module enables",  "dependencies": ["list of other module IDs this imports from"],  "patterns": ["list of design patterns identified, e.g. Repository, Service Facade"],  "anti_patterns": ["list of violations, e.g. 'business logic in handler', 'N+1 query risk'"],  "layer_violations": ["list of imports that cross layer boundaries incorrectly"]}}Rules:- capability MUST be a business capability name, never a class name- responsibility MUST be one sentence, max 20 words- Only list dependencies that are internal modules (not third-party libraries)- Only list anti_patterns you can point to specific lines of evidence for"""

Synthesis prompt (Pass 3)

code

def build_synthesis_prompt(summaries, flows, style) -> str:    summary_text = "\n".join([        f"- {s['module_id']}: {s['capability']} - {s['responsibility']}"        for s in summaries    ])    return f"""You are compressing a codebase analysis into an architecture document.ARCHITECTURE STYLE: {style}MODULE SUMMARIES:\{summary_text\}Apply ALL FIVE compression rules strictly:1. Collapse classes/modules into capabilities (5-12 components MAX, never more)2. Collapse endpoints into use cases (not HTTP verbs)3. Collapse tables into domain concepts (aggregate roots)4. Collapse integrations into roles (not product names)5. Express each layer boundary in one sentenceRespond ONLY with valid JSON:{{  "components": [    {{      "name": "capability name",      "responsibility": "one sentence",      "dependencies": ["other component names"],      "source_modules": ["module_ids compressed into this component"]    }}  ],  "mermaid": "complete C4Context or graph TD Mermaid diagram source",  "layer_boundary_sentences": [    "The handler layer delegates all business decisions to the service layer.",    "The service layer owns consistency boundaries and orchestrates repositories.",    "Repositories are the only components that touch the data store."  ],  "domain_concepts": [    {{"concept": "Order", "tables": ["orders", "order_lines"], "aggregate_root": "Order"}}  ]}}Hard constraints:- components array MUST have 5-12 items. If you have more, merge.- Every component name must be a business capability, never a class name.- mermaid must be syntactically valid Mermaid."""

Failure evaluation prompt (Gate 4)

code

def build_failure_eval_prompt(scenario, draft_doc, diagram) -> str:    return f"""You are evaluating whether an architecture document adequately covers a failure scenario.FAILURE SCENARIO: {scenario}ARCHITECTURE DOCUMENT (excerpt):{draft_doc[:4000]}COMPONENT DIAGRAM:\{diagram\}Answer these questions based ONLY on what is explicitly documented:1. Which components fail immediately when this scenario occurs?2. Which components can degrade gracefully?3. Does the document show a circuit breaker or fallback? (yes/no)4. Does the document explain the user-visible impact? (yes/no)5. Is there anything about this scenario the document cannot explain?Respond ONLY with valid JSON:{{  "doc_explains_behavior": true or false,  "immediate_failures": ["component names"],  "graceful_degradation": ["component names"],  "has_fallback_documented": true or false,  "user_impact_documented": true or false,  "gap": "one sentence describing what the doc fails to explain, or null if no gap"}}"""

Node 12 - publish_doc implementation

Every other node has been shown in full. Node 12 is the simplest - but leaving it out breaks the completeness of the guide.

code

# nodes/publish.pyfrom pathlib import Pathfrom datetime import datedef publish_doc(state: PipelineState) -> dict:    """    Final node - write the validated document to disk and generate ADR stubs.    At this point all gate results are in state and [unverified] markers    have been applied by the refinement node to any unresolved gaps.    """    repo_name = Path(state["repo_path"]).name    today = date.today().isoformat()    # 1. Finalise the document - replace [unverified] with [gate-X-gap]    #    so the reader knows which gate each open item came from    final = state["draft_doc"]    for gap in state.get("gate3_gaps", []):        final = final.replace("[unverified]", "[gate-3-gap]", 1)    for gap in state.get("gate4_gaps", []):        final = final.replace("[unverified]", "[gate-4-gap]", 1)    # 2. Generate ADR stubs for the top 3 debt items by risk score    top_debt = sorted(        state.get("debt_register", []),        key=lambda d: d["risk_score"],        reverse=True    )[:3]    adr_stubs = []    for i, item in enumerate(top_debt, start=1):        stub = f"""## ADR-{i:03d} - {item['title']}**Status:** Proposed**Date:** {today}**Context:** {item['location']}**Decision:** [pending - risk score {item['risk_score']}/5, effort {item['effort_score']}/5]**Consequences:** [to be completed by architecture owner]"""        adr_stubs.append(stub)    # 3. Write to output directory    out_dir = Path(state["repo_path"]).parent / "architecture-docs"    out_dir.mkdir(exist_ok=True)    doc_path = out_dir / f"{repo_name}-architecture-{today}.md"    doc_path.write_text(final)    adr_path = out_dir / f"{repo_name}-adrs-{today}.md"    adr_path.write_text("\n\n---\n\n".join(adr_stubs))    print(f"✓ Architecture doc written to {doc_path}")    print(f"✓ ADR stubs written to {adr_path}")    return {        "final_doc": final,        "adr_stubs": adr_stubs,    }

The two output files land in an architecture-docs/ directory next to the repo. The ADR stubs are intentionally skeletal - they capture the decision that needs to be made and the context, but leave the actual decision and consequences for the team to fill in. This is intentional: a stub that names the decision is worth more than a completed ADR that was written by an LLM without human context.

ArchLens in action - end-to-end demo

The best way to understand what ArchLens produces is to see a real run. The following is representative output from the pipeline run against a mid-size Spring Boot e-commerce application (~120 Java files, 14 packages). Your codebase will produce different component names and debt items - the structure and compression ratio will be consistent, but the specifics are always codebase-specific.

Before - what the codebase looks like

code

Repository: spring-ecommerceLanguage:   Java 17 · Spring Boot 3.1Files:      120 source filesPackages:   14 (com.example.order, .payment, .inventory,             .user, .notification, .report, .admin,             .config, .security, .util, .dto,             .exception, .scheduler, .integration)

No architecture document exists. A new engineer joining this team would spend 2–3 weeks reading code to build the mental model that ArchLens produces in 8 minutes.

After - what ArchLens produces

Component diagram (Pass 3 output, 8 components from 120 files):

mermaid

C4Context
  title ArchLens output - spring-ecommerce

  Boundary(entry, "Entry / Interface") {
    Component(api, "API Gateway", "Spring MVC", "Routes all inbound HTTP · auth middleware · rate limiting")
  }

  Boundary(core, "Business Core") {
    Component(order, "Order Management", "Service", "Place · modify · cancel · track orders")
    Component(payment, "Payment Processing", "Service", "Charge · refund · gateway abstraction")
    Component(inventory, "Inventory Control", "Service", "Reserve · release · restock")
    Component(user, "User Management", "Service", "Registration · auth · profile · notifications")
  }

  Boundary(support, "Supporting Capabilities") {
    Component(report, "Reporting", "Service", "Sales · inventory · audit reports")
    Component(notify, "Notification Service", "Async", "Email · SMS · push via adapter")
  }

  Boundary(data, "Data & Integration") {
    Component(store, "Data Store", "PostgreSQL", "Primary persistence - orders · users · inventory")
  }

  Rel(api, order, "delegates")
  Rel(api, user, "delegates")
  Rel(order, payment, "calls")
  Rel(order, inventory, "reserves")
  Rel(order, notify, "emits event")
  Rel(payment, store, "reads/writes")
  Rel(order, store, "reads/writes")
  Rel(inventory, store, "reads/writes")

Key flow - place order (Pass 2 output):

mermaid

sequenceDiagram
  participant U as User
  participant A as API Gateway
  participant O as Order Management
  participant I as Inventory Control
  participant P as Payment Processing
  participant N as Notification Service
  participant DB as Data Store

  U->>A: POST /api/orders
  A->>O: placeOrder() [consistency boundary begins]
  O->>I: reserve(items)
  I->>DB: UPDATE stock_levels
  I-->>O: reserved
  O->>P: charge(amount) [external - outside boundary]
  P-->>O: charged
  O->>DB: INSERT order + order_lines
  O-->>A: [consistency boundary commits]
  O-)N: order.placed [async, fire-and-forget]
  A-->>U: 201 Created

Debt register (Pass 3 anti-pattern scoring, score: 9/35 - top 5 shown, 12 total):

#	Item	Risk	Effort	Decision
1	`ReportService` contains business logic (340 lines)	4	2	Fix Next Sprint
2	`OrderController` calls `InventoryRepository` directly (layer skip)	4	1	Fix Now
3	No circuit breaker on `PaymentGatewayClient`	4	3	Plan & Track
4	`UserService` handles auth, profile, AND notification dispatch	3	3	Plan & Track
5	3 unindexed foreign key columns in `order_lines`	3	1	Fix When Passing
6	`SchedulerService` has no idempotency guard on job execution	3	2	Fix Next Sprint
7	`AdminController` bypasses auth middleware for 2 endpoints	4	1	Fix Now
8	`NotificationService` retries with no dead-letter queue	3	2	Fix Next Sprint
9	`OrderRepository` runs N+1 query on `findWithLines()`	3	1	Fix When Passing
10	`IntegrationLayer` has no contract/spec for 3 external calls	2	2	Backlog With Date
11	`ConfigService` loads secrets from env at startup - no rotation	2	3	Accept & Document
12	`DtoMapper` contains domain validation logic	2	1	Fix When Passing

The compression in numbers

Metric	Before	After
Source files	120	-
Packages / modules	14	-
Components in diagram	-	8
Key flows documented	-	3
Anti-pattern score	-	9/35 (moderate)
Debt items surfaced	-	12
Gate 1 violations	-	3 (all layer skips)
Time to produce	2–3 weeks (manual)	8 minutes

The compression from 14 packages to 8 capability-named components is exactly what the 5 compression rules are designed to produce. OrderController, OrderService, and OrderRepository become a single Order Management component. PaymentClient, PaymentGatewayAdapter become Payment Processing. What was a taxonomy of classes becomes a map of what the system actually does.

Where to start - the 3-day plan

ℹ️

code

pip install langgraph==0.2.x \            langchain-anthropic \            langgraph-checkpoint-sqlite \            tenacity \            gitpython \            streamlit \            pathlib2

Key versions to pin: langgraph>=0.2.0 for interrupt() support and SqliteSaver; langchain-anthropic>=0.3.0 for claude-sonnet-4-6 and claude-haiku-4-5-20251001. The langgraph-checkpoint-sqlite package is the most commonly missed - without it SqliteSaver.from_conn_string() will raise an ImportError at graph compilation.

ℹ️

Get module chunking working reliably on one real codebase and validate the module summaries before touching compression, diagram generation, or any validation gate. Everything downstream depends on chunking quality.

Day	Goal	Success criterion
Day 1	Build nodes 1–3. Get module summaries for a real codebase.	A developer who knows the codebase reads 5 random module summaries and says they are accurate.
Day 2	Build nodes 4–6. Generate a draft component diagram and sequence diagrams.	The component diagram has 5–12 components named as capabilities (not classes). At least one sequence diagram traces a real flow end to end.
Day 3	Build node 8 (human interrupt) and the review UI. Wire the LangGraph graph.	A developer can pause the pipeline at `gate2_human`, review the draft, submit feedback, and the pipeline resumes with their comments in state.
Week 2	Build gates 1, 3, 4 and the refinement node. Add error recovery.	The pipeline completes end to end on a medium codebase (~30 modules) and produces a validated, published architecture document.

⚠️

The most common failure is building the visual output first and discovering the chunking problem six weeks later when the diagrams consistently contain invented components. Module summaries first. Diagrams second.

This guide is part two of a two-part series:

Part 1 - From Unknown Codebase to Architecture Document: The full 3-pass methodology, compression rules, anti-pattern checklist, validation gates, debt register, and 12-section output template. Read this first.
Part 2 - This guide: Automating Part 1 as a 12-node LangGraph pipeline with human-in-the-loop review and production error recovery.

Minimal runnable entry point

code

# main.py - run ArchLens on a local repofrom graph import build_graphfrom state import PipelineStategraph = build_graph()initial = PipelineState(    repo_path="/path/to/your/repo",    target_stack="java",        # or python / nodejs / go / dotnet    log_source=None,            # set to APM endpoint for Gate 3    build_file_summary={},    dependency_graph={},    module_chunks=[],    module_summaries=[],    entry_points=[],    architecture_style="",    flow_traces=[],    state_stores=[],    async_boundaries=[],    component_diagram="",    anti_pattern_score=0,    anti_pattern_violations=[],    debt_register=[],    draft_doc="",    gate1_violations=[],    gate1_passed=False,    human_feedback=None,    gate2_approved=False,    gate3_gaps=[],    gate3_passed=False,    gate4_gaps=[],    gate4_passed=False,    refinement_count=0,    refinement_notes=[],    final_doc="",    adr_stubs=[],)config = {"configurable": {"thread_id": "run-001"}}for event in graph.stream(initial, config=config, stream_mode="values"):    node = list(event.keys())[0]    print(f"✓ Completed: {node}")    if node == "gate2_human":        print("⏸  Paused for human review. Run review UI or resume programmatically.")        break

What you actually get - and where it still fails

After building this pipeline on a real codebase, the output is genuinely useful in ways that manual documentation isn't.

The document reflects what the code actually does, not what the team remembers it was supposed to do. Gate 1 catches the disconnect between the architecture someone drew on a whiteboard six months ago and the import graph that exists today. Gate 3 catches the flows that are real but undocumented - the three endpoints that account for 60% of traffic but never made it into the sequence diagrams. Gate 4 catches the failure modes the team hadn't thought to document. These are the gaps that only surface when you systematically compare your documented architecture against evidence.

The human interrupt at Gate 2 is where the real value compounds. The developer review isn't just a quality check - it's a forcing function. When someone reads five targeted questions about a diagram the pipeline produced, they engage with the architecture in a way that a blank-page documentation request never produces. The pipeline creates the artefact; the human makes it accurate.

Where it still fails:

The pipeline struggles with three codebase types. First, highly dynamic codebases where runtime behaviour differs significantly from static structure - plugin architectures, script-injection patterns, or heavy reflection. Gate 1 can only check what's in the import graph; it can't see what gets loaded at runtime. Second, monorepos with more than ~80 modules where the synthesis step (Pass 3) receives too much input and produces over-merged components. The 5–12 component ceiling becomes an approximation rather than a compression. Third, polyglot codebases - a Java backend with a Python ML service and a Go gateway - where the chunker runs separately per stack and the synthesis step has to reconcile summaries with no shared module boundary language.

None of these are reasons not to build it. They're reasons to know what the output is and isn't before you publish the document it produces. The pipeline gets you 80% of a good architecture document in 8 minutes. The remaining 20% still requires a human who knows the system - and that's exactly why Gate 2 exists.

What next

If you've read this far and are ready to build:

Start with the companion guide - From Unknown Codebase to Architecture Document covers the 3-pass methodology, all 5 compression rules, the validation gates, and the 12-section output template that ArchLens generates. It's the conceptual foundation this guide implements.
Run nodes 1–3 first - get module chunking working on one real codebase before touching anything else. The quality of every downstream output is set at node 2.
Share your output - if you run ArchLens and get a component diagram you're proud of (or one that surprised you), share it. The most useful thing for this methodology's development is real-world runs on real codebases.

If you have questions on a specific implementation decision, or hit a chunking problem that the fallback strategies in this guide don't cover, the comments are open.

AI Engineering

Agentic AI Observability: Why Traditional Monitoring Breaks with Autonomous Systems

System Design

5 Principles for Building Production-Grade Agentic AI Systems

More Articles

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

What you will build

The 12-node pipeline at a glance

Why LangGraph, not a simple chain

1 · State design - the most important decision

The complete state schema

State flow across nodes

2 · Chunking logic - the quality multiplier

Implementation

3 · Node wiring - LangGraph graph construction

Pass 1 nodes - structure discovery

Pass 2 nodes - behavior tracing

Pass 3 + assembly nodes

Graph wiring - connecting all 12 nodes

Graph topology - conditional edges visualised

4 · Validation gates - implementation

Gate 1 - Static validation

Gate 2 - Human-in-the-loop interrupt

Gates 3 & 4

Refinement loop

5 · Production concerns

Cost & latency model

Parallel module summarisation

Error recovery

Observability

Prompt templates - complete reference

Module summary prompt

Synthesis prompt (Pass 3)

Failure evaluation prompt (Gate 4)

Node 12 - publish_doc implementation

ArchLens in action - end-to-end demo

Before - what the codebase looks like

After - what ArchLens produces

The compression in numbers

Where to start - the 3-day plan

Related reading

Minimal runnable entry point

What you actually get - and where it still fails

What next

Related Articles

Comments