Closing the Loop: How to Actually Measure RAG Quality in Production

The Production Incident Nobody Talks About

Six weeks after a RAG-based internal knowledge assistant went live, the support tickets started coming in. "The answers are wrong." "It gave me outdated policy information." "It confidently cited a document that doesn't exist."

The engineering team dug in. They had RAGAS scores from their evaluation suite. They had a Langfuse dashboard showing traces. They had retrieval logs in Postgres.

What they did not have was the answer to the question that actually mattered: which specific chunks, retrieved from which queries, under which conditions, were causing the model to hallucinate? And was the failure happening at retrieval — the wrong chunks were coming back — or at generation — the right chunks were being retrieved but ignored?

They could not answer this. Their logging had captured the LLM's inputs and outputs. It had not captured what the retriever actually did, in enough structural detail, at enough volume, to let them run the queries they needed to run.

This is the RAG quality gap. It is not a tooling gap. Most teams have traces. The gap is an analytics architecture gap. You cannot debug a system you cannot query against. And you cannot query against event data that is sitting in a row-oriented Postgres table at ten million rows and growing.

This article is about closing that gap. Not in theory — in production, with real schema designs, real ClickHouse queries, and real operational patterns that surface the failure modes before users do.

Why RAG Quality Is Harder to Measure Than Benchmark Scores Suggest

Before you can measure anything, you need to decompose the problem. "RAG quality" is not a single metric. It is at least four distinct failure modes that require separate instrumentation to distinguish.

Retrieval Failure

The correct chunk does not appear in the top-k results. The embedding space does not position the query vector close enough to the relevant document vector. This is a pure retrieval problem. It can be caused by a weak embedding model, poor chunking strategy, index staleness, or a query type that the embedding model handles poorly (exact lookups, numerical ranges, proper nouns).

You measure this by comparing retrieved chunk sets against a ground truth relevance set. If you do not have ground truth labels, you measure it as a proxy: how often does user feedback correlate negatively with retrieval rank position?

Ranking Failure

The correct chunk appears in the top-k results but not at position one or two. The LLM's context window is finite. Most implementations truncate context to the top three or five chunks. If the right chunk is at position six, it may never reach the model.

This is a reranker problem, or a fusion weight problem if you are running hybrid retrieval. It is invisible if you only log what the LLM received — you need to log the full retrieval result set, with ranks, before truncation.

Generation Failure

The correct chunk was retrieved, ranked appropriately, and passed to the model. The model still produced a wrong answer. It may have ignored the chunk, misread it, or over-weighted a different chunk that was present but less relevant.

This is the failure mode that teams tend to blame first and incorrectly. "Our retrieval is fine, it's the model." Often that is not true. But to determine whether generation failure is genuine or a cover story for retrieval failure, you need chunk-level utilization data: what did the model actually attend to?

Index Quality Failure

The chunk that should answer the query has degraded in relevance because the source document changed and the index was not updated. Or the chunk was split in a way that separates the question from the answer across a chunk boundary. Or the document was ingested with OCR errors that corrupt the embedding.

This is the most operationally tedious failure mode because it requires tracking the relationship between source documents, their ingestion timestamps, their current versions, and the chunks derived from them. It is also the most commonly ignored.

If you cannot distinguish between these four failure modes from your observability data, you are not measuring RAG quality. You are measuring vibes.

The Core Insight: RAG Events Are Columnar Analytics Problems

Here is where most RAG observability stacks go wrong. They reach for Postgres, or they reach for Elasticsearch because it handles unstructured text, or they dump everything into a JSON blob in S3 and run Athena queries that return in forty seconds.

Every retrieval event is a row. But RAG quality analysis is never a single-row operation. Every query you want to run is an aggregation:

What is the average retrieval rank of chunks that received positive user feedback versus negative feedback?
Which chunks appear in the top-3 results for more than 15% of all queries? (Your candidates for semantic gravity wells.)
How has the p95 retrieval latency changed over the past 72 hours, broken out by embedding model version?
For queries where the retrieved chunk score was above 0.85 but user feedback was negative, what are the common token patterns in the query text?

These are OLAP queries. They scan millions of rows, apply filters, compute aggregations, and return results. Postgres will handle this acceptably at a hundred thousand rows. At ten million rows it starts struggling. At a hundred million rows — which is not an unusual number for a production system that has been live for several months — it becomes unusable for interactive analysis without substantial investment in indexes, partitioning, and query optimization.

ClickHouse is built for this access pattern. Column-oriented storage means it only reads the columns your query touches, not the full row. Its MergeTree engine is optimized for time-series and event data. You can run a query across fifty million retrieval events in under two seconds on a single-node instance.

This is not a performance optimization. It is the difference between being able to run the investigation at all and having to sample your data to make queries tractable.

The Event Schema: What You Actually Need to Capture

The schema is the hardest part to get right, and getting it wrong means backfilling later. Here is a production-tested schema for retrieval events.

CREATE TABLE rag_retrieval_events(    -- Identity    event_id        UUID DEFAULT generateUUIDv4(),    session_id      String,    request_id      String,    user_id         String,    tenant_id       String,    -- Timing    event_ts        DateTime64(3) DEFAULT now64(),    retrieval_latency_ms  UInt32,    total_latency_ms      UInt32,    -- Query    query_text              String,    query_embedding_model   LowCardinality(String),    query_embedding_hash    String,   -- 16-char prefix of SHA256 of query text, for grouping similar queries    -- Retrieval results (full set, pre-truncation)    retrieved_chunk_ids     Array(String),    retrieved_scores        Array(Float32),    retrieved_positions     Array(UInt8),    -- Context window (post-truncation, what the LLM actually received)    context_chunk_ids       Array(String),    context_chunk_count     UInt8,    -- Generation    llm_model               LowCardinality(String),    llm_model_version       String,    response_tokens         UInt32,    prompt_tokens           UInt32,    -- Feedback (written back asynchronously)    feedback_signal         Int8,      -- -1 negative, 0 none, 1 positive    feedback_ts             Nullable(DateTime64(3)),    feedback_correction     Nullable(String),    -- Index metadata    index_name              LowCardinality(String),    index_version           String,    retrieval_strategy      LowCardinality(String),  -- dense, sparse, hybrid    -- Reranker    reranker_used           UInt8,    reranker_model          LowCardinality(String),    pre_rerank_chunk_ids    Array(String),    pre_rerank_scores       Array(Float32))ENGINE = MergeTree()PARTITION BY toYYYYMM(event_ts)ORDER BY (tenant_id, index_name, event_ts)TTL event_ts + INTERVAL 12 MONTH;

A few design choices worth explaining.

Arrays for retrieved chunks and scores. ClickHouse handles arrays natively and efficiently. Storing chunk IDs and scores as parallel arrays is idiomatic ClickHouse — you can use arrayZip, arrayEnumerate, and arrayFilter to process them in queries without joins.

Pre-truncation and post-truncation fields. retrieved_chunk_ids is the full result set from the retriever. context_chunk_ids is what actually went into the LLM prompt. The difference between these two arrays is your ranking loss surface. If a chunk appears in retrieved but not in context, and feedback was negative, that is a ranking failure candidate.

Pre-rerank fields. If you are running a reranker, capture the pre-rerank order. This lets you evaluate whether the reranker is adding value or hurting performance on specific query types.

LowCardinality for model names and strategies. ClickHouse's LowCardinality encoding compresses repeated low-cardinality string values dramatically. Use it for any string that takes a small number of distinct values: model names, index names, retrieval strategies, tenant IDs if your cardinality is under a few thousand.

feedback_signal written back asynchronously. Feedback arrives after the retrieval event. Design for this from the start. Write the event immediately at retrieval time, then issue an ALTER TABLE UPDATE or use an ReplacingMergeTree to update the feedback columns when they arrive. Do not block retrieval on feedback writes.

Instrumentation: Hooking Into Your Retrieval Layer

The logging code needs to be non-blocking. Any synchronous write to your analytics store during retrieval adds latency to every request. Use an async queue pattern.

import asyncioimport hashlibimport uuidfrom dataclasses import dataclass, fieldfrom datetime import datetimefrom typing import Optionalimport clickhouse_connectfrom collections import deque@dataclassclass RetrievalEvent:    session_id: str    request_id: str    user_id: str    tenant_id: str    query_text: str    query_embedding_model: str    query_embedding_hash: str    retrieved_chunk_ids: list[str]    retrieved_scores: list[float]    context_chunk_ids: list[str]    llm_model: str    llm_model_version: str    retrieval_latency_ms: int    total_latency_ms: int    index_name: str    index_version: str    retrieval_strategy: str    reranker_used: bool = False    reranker_model: str = ""    pre_rerank_chunk_ids: list[str] = field(default_factory=list)    pre_rerank_scores: list[float] = field(default_factory=list)    response_tokens: int = 0    prompt_tokens: int = 0    event_id: str = field(default_factory=lambda: str(uuid.uuid4()))    event_ts: datetime = field(default_factory=datetime.utcnow)class RetrievalEventLogger:    def __init__(self, clickhouse_host: str, database: str, batch_size: int = 500):        self.client = clickhouse_connect.get_client(host=clickhouse_host, database=database)        self.queue: deque = deque()        self.batch_size = batch_size        self._flush_task: Optional[asyncio.Task] = None    async def log(self, event: RetrievalEvent):        self.queue.append(event)        if len(self.queue) >= self.batch_size:            await self._flush()    async def _flush(self):        if not self.queue:            return        batch = []        while self.queue:            batch.append(self.queue.popleft())        rows = [self._to_row(e) for e in batch]        # clickhouse-connect insert is synchronous; run in thread pool        # get_running_loop() is correct here — get_event_loop() is deprecated in 3.10+        loop = asyncio.get_running_loop()        await loop.run_in_executor(            None,            lambda: self.client.insert("rag_retrieval_events", rows, column_names=self._columns())        )    def _to_row(self, e: RetrievalEvent) -> list:        return [            e.event_id, e.session_id, e.request_id, e.user_id, e.tenant_id,            e.event_ts, e.retrieval_latency_ms, e.total_latency_ms,            e.query_text, e.query_embedding_model, e.query_embedding_hash,            e.retrieved_chunk_ids, [round(s, 6) for s in e.retrieved_scores],            list(range(1, len(e.retrieved_chunk_ids) + 1)),            e.context_chunk_ids, len(e.context_chunk_ids),            e.llm_model, e.llm_model_version, e.response_tokens, e.prompt_tokens,            0, None, None,            e.index_name, e.index_version, e.retrieval_strategy,            int(e.reranker_used), e.reranker_model,            e.pre_rerank_chunk_ids, [round(s, 6) for s in e.pre_rerank_scores],        ]    @staticmethod    def _columns() -> list[str]:        return [            "event_id", "session_id", "request_id", "user_id", "tenant_id",            "event_ts", "retrieval_latency_ms", "total_latency_ms",            "query_text", "query_embedding_model", "query_embedding_hash",            "retrieved_chunk_ids", "retrieved_scores", "retrieved_positions",            "context_chunk_ids", "context_chunk_count",            "llm_model", "llm_model_version", "response_tokens", "prompt_tokens",            "feedback_signal", "feedback_ts", "feedback_correction",            "index_name", "index_version", "retrieval_strategy",            "reranker_used", "reranker_model",            "pre_rerank_chunk_ids", "pre_rerank_scores",        ]

Now hook this into your retrieval chain. If you are using LangChain or LangGraph, the cleanest insertion point is a custom retriever wrapper:

import timeimport hashlibfrom langchain_core.retrievers import BaseRetrieverfrom langchain_core.documents import Documentfrom langchain_core.callbacks import CallbackManagerForRetrieverRunclass InstrumentedRetriever(BaseRetriever):    base_retriever: BaseRetriever    logger: RetrievalEventLogger    index_name: str    index_version: str    retrieval_strategy: str    embedding_model: str    def _get_relevant_documents(        self,        query: str,        *,        run_manager: CallbackManagerForRetrieverRun,        session_id: str = "",        user_id: str = "",        tenant_id: str = "",    ) -> list[Document]:        t0 = time.monotonic()        docs = self.base_retriever.invoke(query)        latency_ms = int((time.monotonic() - t0) * 1000)        chunk_ids = [doc.metadata.get("chunk_id", "") for doc in docs]        scores = [float(doc.metadata.get("score", 0.0)) for doc in docs]        embedding_hash = hashlib.sha256(query.encode()).hexdigest()[:16]        event = RetrievalEvent(            session_id=session_id,            request_id=str(run_manager.run_id),            user_id=user_id,            tenant_id=tenant_id,            query_text=query,            query_embedding_model=self.embedding_model,            query_embedding_hash=embedding_hash,            retrieved_chunk_ids=chunk_ids,            retrieved_scores=scores,            # context_chunk_ids is intentionally empty here.            # The Context Builder populates it after truncation by writing            # back to event.context_chunk_ids before the async flush fires.            context_chunk_ids=[],            llm_model="",  # filled downstream            llm_model_version="",            retrieval_latency_ms=latency_ms,            total_latency_ms=latency_ms,            index_name=self.index_name,            index_version=self.index_version,            retrieval_strategy=self.retrieval_strategy,        )        # Attach event reference to each doc so the Context Builder and        # Reranker can mutate pre_rerank_chunk_ids and context_chunk_ids        for doc in docs:            doc.metadata["_retrieval_event"] = event        # Yield one event loop turn so the Context Builder can complete        # before the event is enqueued for flush.        asyncio.create_task(self._log_after_context(event))        return docs    async def _log_after_context(self, event: RetrievalEvent):        await asyncio.sleep(0)        await self.logger.log(event)

One implementation note: asyncio.create_task fires and forgets. If your application shuts down mid-batch, events in the queue will be lost. For production, add a shutdown hook that calls await logger._flush() before the process exits.

The System Architecture

The following diagram shows how retrieval events flow from your agent runtime into ClickHouse, and how queries flow back out to your monitoring surfaces.

flowchart TD
    A([User Query]) --> B[LangGraph Agent Runtime]
    B --> C[InstrumentedRetriever]

    C --> D[(Vector Store\nPgVector / Qdrant)]
    D -->|raw ranked results| E[Reranker]
    E -->|reranked results| F[Context Builder]

    C -->|query text\n+ scores extracted\nfrom doc metadata| RE[RetrievalEvent Object]
    E -->|pre_rerank_chunk_ids\npre_rerank_scores| RE
    F -->|context_chunk_ids\npost-truncation| RE

    RE -->|async fire-and-forget| G[Event Queue\nasyncio deque]
    G -->|batch flush\nevery 500 events or 5s| H[(ClickHouse\nrag_retrieval_events)]

    F --> I[LLM]
    I --> J([Response])
    J --> K{User Feedback}
    K -->|signal -1/0/+1| L[Feedback Writer]
    L -->|ALTER TABLE UPDATE| H

    H --> M[Grafana / Metabase\nDashboards]
    H --> N[Quality Review\nSQL Notebooks]
    H --> O[Alert Rules\nRetrieval degradation]

    style A fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style B fill:#5B4FCF,stroke:#3B2F9F,color:#fff
    style C fill:#5B4FCF,stroke:#3B2F9F,color:#fff
    style D fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style E fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style F fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style RE fill:#E9C46A,stroke:#C9A44A,color:#333
    style G fill:#E9C46A,stroke:#C9A44A,color:#333
    style H fill:#E76F51,stroke:#C74F31,color:#fff
    style I fill:#5B4FCF,stroke:#3B2F9F,color:#fff
    style J fill:#4A90D9,stroke:#2C5F8A,color:#fff
    style K fill:#E9C46A,stroke:#C9A44A,color:#333
    style L fill:#E9C46A,stroke:#C9A44A,color:#333
    style M fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style N fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style O fill:#E76F51,stroke:#C74F31,color:#fff

The Four Metrics That Actually Matter in Production

Most RAG evaluation frameworks give you RAGAS metrics: faithfulness, answer relevancy, context precision, context recall. These are useful for offline evaluation against labeled datasets. They are not useful for production monitoring because they require running an LLM-as-judge on every event, which is expensive at volume and introduces its own quality variance.

Here are four metrics you can compute directly from ClickHouse event data, with no LLM-as-judge required.

Metric 1: Chunk Utilization Rate

Of the chunks retrieved and ranked into the top-k, what percentage actually made it into the context window? A low utilization rate means your retriever is returning more candidates than your context window can fit — which is fine — but if the dropped chunks are high-scoring, you have a context truncation problem.

-- Chunk utilization rate by index, last 7 daysSELECT    index_name,    retrieval_strategy,    avg(length(context_chunk_ids)) AS avg_context_chunks,    avg(length(retrieved_chunk_ids)) AS avg_retrieved_chunks,    avg(length(context_chunk_ids) / nullif(length(retrieved_chunk_ids), 0)) AS utilization_rate,    count() AS event_countFROM rag_retrieval_eventsWHERE event_ts >= now() - INTERVAL 7 DAYGROUP BY index_name, retrieval_strategyORDER BY utilization_rate ASC;

A utilization rate below 0.4 with consistently high feedback scores is fine — you are over-retrieving and the model is doing well with what it gets. A utilization rate below 0.4 with negative feedback requires a different response depending on your pipeline. If you are not running a reranker, the options are to increase the context window, add a reranker, or reduce your top-k. If a reranker is already in the pipeline, adding another one is not the answer — the reranker is either miscalibrated and not surfacing the right chunks to the top positions, or the context window is too small to include them even after reranking. In that case, investigate reranker performance first using the pre-rerank versus post-rerank comparison query before touching context window size or top-k.

Metric 2: Query-Chunk Affinity (Semantic Gravity Wells)

Which chunks are appearing in the top-3 results for a disproportionate share of all queries? A chunk that dominates retrieval is either genuinely authoritative — or it is a gravity well that is pulling queries away from more relevant content because its embedding happens to land near the centroid of your query distribution.

-- Top chunks by retrieval frequency (gravity well candidates)SELECT    chunk_id,    count() AS retrieval_count,    avg(score) AS avg_score,    countIf(feedback_signal = 1) AS positive_feedback,    countIf(feedback_signal = -1) AS negative_feedback,    round(countIf(feedback_signal = 1) / nullif(count(), 0), 3) AS positive_rateFROM (    SELECT        request_id,        feedback_signal,        arrayJoin(arrayZip(retrieved_chunk_ids, retrieved_scores)) AS chunk_score_pair,        chunk_score_pair.1 AS chunk_id,        chunk_score_pair.2 AS score    FROM rag_retrieval_events    WHERE event_ts >= now() - INTERVAL 30 DAY)GROUP BY chunk_idHAVING retrieval_count > 100ORDER BY retrieval_count DESCLIMIT 50;

Chunks with high retrieval frequency but low positive rates are your gravity wells. Investigate the chunk content. In most cases, it is an introductory or summary document that uses broad vocabulary, making it semantically close to many queries without actually being the most relevant answer.

Metric 3: Retrieval-Response Correlation

Do higher retrieval scores actually predict better user outcomes? If they do not, your similarity metric is not calibrated for your use case. This is one of the most important signals for deciding whether to switch embedding models.

-- Score-feedback correlation: does score predict quality?SELECT    index_name,    multiIf(        avg_top_score >= 0.9, '0.9+',        avg_top_score >= 0.8, '0.8-0.9',        avg_top_score >= 0.7, '0.7-0.8',        avg_top_score >= 0.6, '0.6-0.7',        'below-0.6'    ) AS score_bucket,    count() AS event_count,    countIf(feedback_signal = 1) AS positive,    countIf(feedback_signal = -1) AS negative,    round(countIf(feedback_signal = 1) / nullif(count(), 0), 3) AS positive_rateFROM (    SELECT        index_name,        feedback_signal,        retrieved_scores[1] AS avg_top_score    FROM rag_retrieval_events    WHERE event_ts >= now() - INTERVAL 30 DAY      AND feedback_signal != 0      AND length(retrieved_scores) > 0)GROUP BY index_name, score_bucketORDER BY index_name, score_bucket DESC;

Healthy retrieval shows a monotonic relationship: higher scores produce higher positive feedback rates. If you see a flat or inverted relationship — high-scoring retrievals producing as many negative feedbacks as low-scoring ones — your embedding model is not capturing the semantics your users care about. This is often the case when you deploy a general-purpose embedding model on a domain-specific corpus.

Metric 4: Index Staleness Signal

As your corpus ages and documents are updated, retrieval scores drift. A chunk that was highly relevant when it was indexed becomes misleading after the source document changes. You cannot detect this without tracking index versions.

-- Retrieval score trend by index version (staleness detection)SELECT    index_name,    index_version,    toStartOfDay(event_ts) AS day,    avg(retrieved_scores[1]) AS avg_top_score,    quantile(0.5)(retrieved_scores[1]) AS median_top_score,    countIf(feedback_signal = 1) / nullif(count(), 0) AS positive_rate,    count() AS eventsFROM rag_retrieval_eventsWHERE event_ts >= now() - INTERVAL 60 DAY  AND length(retrieved_scores) > 0GROUP BY index_name, index_version, dayORDER BY index_name, index_version, day;

If you see average top scores declining over time within a fixed index version, while a newer index version shows higher scores on the same query types, your index refresh cadence is too slow relative to your corpus update frequency.

Building the Feedback Dashboard

The metrics above are actionable only if you can see them continuously, not after you run a SQL notebook in response to a support ticket.

The following diagram outlines what a RAG quality dashboard should surface.

flowchart LR
    CH[(ClickHouse)] --> P1[Panel: Chunk Utilization\nby index / strategy]
    CH --> P2[Panel: Gravity Wells\ntop 20 chunks by retrieval freq]
    CH --> P3[Panel: Score-Feedback\nCorrelation heatmap]
    CH --> P4[Panel: Latency P50/P95/P99\nby retrieval strategy]
    CH --> P5[Panel: Dead Chunks\nnever retrieved in 30d]
    CH --> P6[Panel: Index Version\nScore Drift over time]
    CH --> P7[Panel: Query Volume\nand error rate]

    P1 --> DASH[Grafana Dashboard\nRAG Quality Monitor]
    P2 --> DASH
    P3 --> DASH
    P4 --> DASH
    P5 --> DASH
    P6 --> DASH
    P7 --> DASH

    CH --> AR[Alert Rule\nPositive feedback rate\nbelow 60% threshold]
    AR --> ALERT[PagerDuty / Slack Alert]

    style CH fill:#E76F51,stroke:#C74F31,color:#fff
    style DASH fill:#5B4FCF,stroke:#3B2F9F,color:#fff
    style AR fill:#E76F51,stroke:#C74F31,color:#fff
    style ALERT fill:#E76F51,stroke:#C74F31,color:#fff
    style P1 fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style P2 fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style P3 fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style P4 fill:#2A9D8F,stroke:#1A6D5F,color:#fff
    style P5 fill:#E9C46A,stroke:#C9A44A,color:#333
    style P6 fill:#E9C46A,stroke:#C9A44A,color:#333
    style P7 fill:#2A9D8F,stroke:#1A6D5F,color:#fff

The dead chunk query deserves its own entry because it is rarely implemented but consistently valuable:

-- Dead chunks: indexed but never retrieved in the past 30 days-- Requires a separate chunk_registry table tracking all indexed chunksSELECT    cr.chunk_id,    cr.source_document,    cr.indexed_at,    cr.chunk_text_previewFROM chunk_registry crLEFT JOIN (    SELECT DISTINCT arrayJoin(retrieved_chunk_ids) AS chunk_id    FROM rag_retrieval_events    WHERE event_ts >= now() - INTERVAL 30 DAY) active ON cr.chunk_id = active.chunk_idWHERE active.chunk_id IS NULL  AND cr.indexed_at < now() - INTERVAL 7 DAY  -- exclude recently addedORDER BY cr.indexed_at ASCLIMIT 500;

Dead chunks are a direct signal of index bloat. They inflate your vector store size, slow down retrieval, and introduce noise into approximate nearest neighbor searches. A chunk that has never been retrieved in thirty days of production traffic is either redundant (covered by a better chunk), malformed (embedded incorrectly), or genuinely irrelevant (should not have been indexed). Prune them.

The Failure Mode Catalog

Beyond the four core metrics, there are specific failure patterns that only emerge after you have been running production instrumentation for several weeks. These are not theoretical — they appear in almost every system that gets past a few thousand daily active users.

The Semantic Gravity Well

A single chunk or small cluster of chunks dominates retrieval across a disproportionate share of queries. The chunk typically comes from an introductory section of a document — an executive summary, a product overview, a glossary — that uses broad vocabulary matching many query types.

The signature is: one chunk appearing in the top-3 of more than 20% of all queries, with a positive feedback rate that is lower than your system average. The chunk is not actually answering most of those queries well; it is just proximate to too many query embeddings.

Fix: identify the gravity well chunks, examine whether their content is genuinely high-value or just broad. If they are summaries that should not be independently retrieved, either remove them from the index or add them to a negative example list for your embedding fine-tuning.

Embedding Model Drift

You update your embedding model. The new model produces better embeddings on your benchmark queries. You reindex. Scores jump.

But some queries that were working before now return different chunks. The semantic space has shifted. Queries that previously resolved to chunk A now resolve to chunk B, which may be worse. The aggregate score improvement masks individual regressions.

The signature: after an index version transition, your score-feedback correlation query shows a temporary drop in positive rate on specific query clusters, while aggregate scores improve. Without per-cluster analysis you will miss this.

Fix: when you reindex with a new model, run both indexes in shadow mode for a week. Log retrievals from both. Compare positive feedback rates by query cluster. Only cut over when the new index is clearly better on clusters where the old index was weak, without regressing on clusters where it was strong.

The Silent Reranker

Rerankers are added to systems to fix ranking failures. Often they work. Sometimes they make things worse on specific query types, and nobody notices because aggregate metrics do not change much.

The signature: comparing pre_rerank_chunk_ids[1] vs context_chunk_ids[1] — the pre-rerank first result versus the post-rerank first result — shows that the reranker is frequently swapping a high-score chunk to a lower position on queries where feedback is positive. The reranker learned a preference that does not match your users.

-- Reranker impact: cases where reranker moved top chunk downSELECT    count() AS events,    countIf(feedback_signal = 1) AS positive,    countIf(feedback_signal = -1) AS negativeFROM rag_retrieval_eventsWHERE reranker_used = 1  AND length(pre_rerank_chunk_ids) > 0  AND length(context_chunk_ids) > 0  AND pre_rerank_chunk_ids[1] != context_chunk_ids[1]  -- reranker changed top result  AND event_ts >= now() - INTERVAL 30 DAY;

Feedback Signal Decay

Your feedback mechanism (thumbs up/down, correction input) sees heavy engagement at launch and declining engagement over time. Six months in, 90% of your events have feedback_signal = 0. Your quality metrics are now based on 10% of traffic, which is not representative.

This happens in every system. Users who give feedback are systematically different from users who do not — they are more engaged, more likely to be power users, more likely to flag bad responses. Your feedback sample is biased toward the extremes.

Fix: implement implicit feedback signals that do not require user action. Session abandonment after a response (user did not continue the conversation) as a weak negative signal. Copy action (user copied the response) as a weak positive signal. Follow-up question that rephrases the original query as a strong negative signal (the first answer did not satisfy). These signals are noisy but unbiased, and they fill in the 90% gap.

Closing the Loop: The Operational Cadence

All of this instrumentation creates value only if you act on it. Most teams build logging infrastructure, produce dashboards, and then never schedule time to actually review them. The data accumulates. The system degrades. The alerts fire. The response is reactive.

The minimum viable RAG quality review cadence looks like this:

Weekly — 30 minutes:

Run the gravity well query. If any chunk's retrieval frequency has increased more than 50% week-over-week, investigate it. Run the score-feedback correlation. If the positive rate for any score bucket has dropped more than 5 percentage points, flag it.

Monthly — 2 hours:

Run the dead chunk query. Prune any chunks that have not been retrieved in 30 days and have been in the index for more than 7 days. Review the index staleness signal. If score drift is visible in the current index version, schedule a reindex. Compare embedding model performance across query clusters — if you have explored a new model, run the shadow mode comparison.

On-demand — after any corpus change:

When you add a significant number of new documents, or update existing ones, run the index version comparison query immediately. New documents can introduce gravity wells, can shift embedding centroids, and can deprecate existing chunks that were previously the best answer to a query.

The operational pattern mirrors what precision medicine has done with continuous glucose monitoring: you are not running a monthly checkup, you are maintaining continuous visibility and acting on deviations. The difference is that in RAG systems, the intervention is your own — a reindex, a chunk pruning run, a reranker recalibration — not a prescription.

What This Architecture Actually Costs

Before deploying any of this, understand the cost structure.

A single-node ClickHouse instance on a mid-range cloud machine (16 vCPU, 64GB RAM, 1TB NVMe) handles around 50 million retrieval events per month comfortably. At 10,000 daily active users with 5 queries each, you are at 1.5 million events per month — well within single-node range. At 100,000 daily active users you are approaching the boundary where you want ClickHouse Cloud or a replicated cluster.

The bigger cost is storage. Each retrieval event row, with arrays of chunk IDs and scores, compresses to approximately 800-1200 bytes after ClickHouse's column compression. At 10 million events per month with a 12-month TTL, you are at roughly 100GB of compressed storage. Manageable.

The logging infrastructure itself adds approximately 2-5ms of overhead to your retrieval path if you are using the async queue pattern. This is negligible. If you are doing synchronous writes — do not do synchronous writes.

The feedback writeback via ALTER TABLE UPDATE is a batch operation. Run it every 5 minutes for events that have received feedback signals. Do not run it per-event. ALTER TABLE UPDATE in ClickHouse is a mutation that rewrites data parts; batching is essential.

Summary

RAG quality measurement in production is an analytics architecture problem first and a metrics problem second. The failure modes — retrieval failure, ranking failure, generation failure, index quality failure — require different instrumentation to distinguish, and you cannot distinguish them from trace logs alone.

ClickHouse is the right storage layer for this problem not because it is faster than Postgres in benchmarks, but because the query patterns for RAG quality analysis are inherently columnar and aggregate: scans across millions of events, filtering on score buckets, computing feedback correlations, identifying chunk-level patterns. Row-oriented storage makes these queries expensive at production scale. Column-oriented storage makes them routine.

The schema matters more than the tooling. Capturing pre-truncation retrieval results, pre-reranker rankings, and chunk-level IDs at ingest time is what separates a useful observability stack from a logging infrastructure that tells you nothing when something breaks.

The loop is only closed when you have an operational cadence that acts on what the data surfaces. Logging without review is expensive background noise. The disciplines of weekly gravity well checks, monthly dead chunk pruning, and on-demand index comparisons after corpus changes are what transform observability into a quality engineering practice.

Your RAG system is not a pipeline you deploy and monitor from the outside. It is a system whose internal behavior you need to understand at event granularity. That requires a storage layer designed for the access pattern. That requires instrumentation that captures the right data points. And that requires the discipline to actually use what you have built.

References

ClickHouse MergeTree Engine documentation: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree
ClickHouse Array Functions: https://clickhouse.com/docs/en/sql-reference/functions/array-functions
Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217
Gao, Y., et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085

More Articles

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications: