Question Answer Chatbot using RAG, Llama and Qdrant

1. Introduction

I have created this teaching chatbot that can answer questions from class IX, subject SST, on the topic “Democratic politics“. I have used RAG (Retrieval-Augmented Generation), Llama Model as LLM (Large Language Model), Qdrant as a vector database, Langchain, and Streamlit.

2. How to run the code?

Github repository link: https://github.com/ranjankumar-gh/teaching-bot/

Steps to run the code

  1. git clone https://github.com/ranjankumar-gh/teaching-bot.git
  2. cd teaching-bot
  3. python -m venv env
  4. Activate the environment from the env directory.
  5. python -m pip install -r requirements.txt
  6. Before running the following line, Qdrant should be running and available on localhost. If it’s running on a different machine, make appropriate URL changes to the code.
    python data_ingestion.py
    After running this, http://localhost:6333/dashboard#/collections should appear like figure 1.
  7. Run the web application for the chatbot by running the following command. The web application is powered by Streamlit.
    streamlit run app.py
    The interface of the chatbot appears as in Figure 2.

Figure 1: Screenshot of the Qdrant dashboard after running the data_ingestion.py

Figure 2: Screenshot of the chatbot web application

3. Data Ingestion

Data: PDF files have been downloaded from the NCERT website for Class IX, subject SST, from the topic “Democratic politics”. These files are stored in the directory ix-sst-ncert-democratic-politics. The following are the steps for data ingestion:

  1. PDF files are loaded from the directory.
  2. Text contents are extracted from the PDF.
  3. Text content is divided into chunks of text.
  4. These chunks are transformed into vector embeddings.
  5. These vector embeddings are stored in the Qdrant vector database.
  6. This data is stored in Qdrant with the collection name “ix-sst-ncert-democratic-politics“.

The following is the code snippet for data_ingestion.py.

###############################################################
# Data ingestion pipeline 
# 1. Taking the input pdf file
# 2. Extracting the content
# 3. Divide into chunks
# 4. Use embeddings model to convet to the embedding vector
# 5. Store the embedding vectors to the qdrant (vector database)
################################################################
import os
from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import CharacterTextSplitter
from qdrant_client import QdrantClient

path = "ix-sst-ncert-democratic-politics"
filenames = next(os.walk(path))[2]

for i, file_name in enumerate(filenames):
    print(f"Data ingestion for the chapter: {i}")

    # 1. Load the pdf document and extract text from it
    loader = PDFMinerLoader(path + "/" + file_name)
    pdf_content = loader.load()
    print(pdf_content)

    # 2. Split the text into small chunks
    CHUNK_SIZE = 1000 # chunk size not greater than 1000 chars
    CHUNK_OVERLAP = 30 # a bit of overlap is required for continued context

    text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
    docs = text_splitter.split_documents(pdf_content)

    # Make a list of split docs
    documents = []
    for doc in docs:
        documents.append(doc.page_content)

    # 3. Create vectordatabase(qdrant) client 
    qdrant_client = QdrantClient(url="http://localhost:6333")

    # 4. Add document chunks in vectordb
    qdrant_client.add(
        collection_name="ix-sst-ncert-democratic-politics",
        documents=documents,
        #metadata=metadata,
        #ids=ids
    )

    # 5. Make a query from the vectordb(qdrant)
    search_results = qdrant_client.query(
        collection_name="ix-sst-ncert-democratic-politics",
        query_text="What is democracy?"
    )

    for search_result in search_results:
        print(search_result.document, search_result.score)

4. Chatbot Web Application

The web application is powered by Streamlit. Following are the steps:

  1. A connection to the Qdrant vector database is created.
  2. User questions are captured through the web interface.
  3. The question text is transformed into a vector embedding.
  4. This vector embedding is searched in the Qdrant vector database to find the most relevant content similar to the question.
  5. The text returned by the Qdrant acts as the context for the LLM.
  6. I have used Llama LLM. The query, along with context, is sent to the Llama for an answer to be generated.
  7. The answer is displayed on the web interface as the answer from the bot.

Following is the code snippet for app.py.

# Initialize chat history
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# React to user input
if query := st.chat_input("What is up?"):
    # Display user message in chat message container
    st.chat_message("user").markdown(query)
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": query})

    # Connect with vector db for getting the context
    search_results = qdrant_client.query(
    collection_name="ix-sst-ncert-democratic-politics",
    query_text=query
    )
    context = ""
    no_of_docs = 2
    count = 1
    for search_result in search_results:
        if search_result.score >= 0.8:
            #print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}")
            context = context + search_result.document
        if count >= no_of_docs:
            break
        count = count + 1

    # Using LLM for forming the answer
    template = """Instruction: {instruction}
    Context: {context}
    Query: {query}
    """
    prompt = ChatPromptTemplate.from_template(template)

    model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model

    chain = prompt | model

    bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", 
            "context": context,
            "query": query
            })

    print(f'\nBot: {bot_response}')

    #response = f"Echo: {prompt}"
    # Display assistant response in chat message container
    with st.chat_message("assistant"):
        st.markdown(bot_response)
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": bot_response})

On Emergent Abilities of Large Language Models

An ability is emergent if it is not present in smaller models but is present in larger models. [1]

Scaling up language models has been shown to improve predictably the performance and sample efficiency on a wide range of downstream tasks. Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. This raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. [1]

Today’s language models have been scaled primarily along three factors:

  1. amount of computation,
  2. number of model parameters, and
  3. training data size

The following table lists the emergent abilities of large language models and the scale at which abilities emerge. [1]

Tasks that language models cannot currently do are prime candidates for future emergence; for instance, there are dozens of tasks in BIG-Bench[3] for which even the largest GPT-3 and PaLM models do not achieve above-random performance. [1] Similar to emergent abilities, emergent risks could also emerge, such as w.r.t. truthfulness, bias, and toxicity in LLMs, backdoor vulnerabilities, inadvertent deception, or harmful content synthesis.

But Rylan Schaeffer et al., in their paper [3], claim that the sudden appearance of emergent abilities is just a consequence of the way researchers measure the LLM’s performance. The article “How Quickly Do Large Language Models Learn Unexpected Skills?” by Stephen Ornes [4] beautifully summarises the two papers.

References

  1. Emergent Abilities of Large Language Models by Jason Wei et al. – https://openreview.net/pdf?id=yzkSU5zdwD
  2. Are Emergent Abilities of Large Language Models a Mirage? by Rylan Schaeffer et al. – https://arxiv.org/pdf/2304.15004
  3. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models by Aarohi et al. – https://arxiv.org/pdf/2206.04615
  4. How Quickly Do Large Language Models Learn Unexpected Skills? by Stephen Ornes – https://www.quantamagazine.org/how-quickly-do-large-language-models-learn-unexpected-skills-20240213/

Prompt Engineering Deep Dive: Parameters, Chains, Reasoning, and Guardrails

1. Introduction

Prompt engineering is the practice of designing and refining the text (prompt) that we pass to a Generative AI (GenAI) model. The prompt acts as an instruction or query, and the model generates responses based on it. Prompts can be questions, statements, or detailed instructions.

Prompt engineering serves three purposes:

  1. Enhancing output quality – refining how the model responds.
  2. Evaluating model behavior – testing the output against requirements.
  3. Ensuring safety – reducing harmful or biased responses.

There is no single “perfect” prompt. Instead, prompt design is an iterative process involving optimization and experimentation.

Figure 1: A basic example of the prompt

2. Controlling Model Output by Adjusting Model Parameters

The behavior of large language models (LLMs) can be fine-tuned using parameters such as temperature, top_p, and top_k. For these to take effect, do_sample=True must be set, allowing the model to sample tokens instead of always choosing the most likely one.

  • Temperature controls randomness.
    • temperature=0: deterministic output (always the same response).
    • Higher values → more diverse responses.
    • Example: 0.2 = focused, coherent; 0.8 = more creative.
  • Top_p (nucleus sampling) restricts token choices to the smallest set whose cumulative probability ≥ p.
    • top_p=1: consider all tokens.
    • Lower values → more focused output.
  • Top_k limits the selection to the k most likely tokens.

By tuning these, one can strike a balance between deterministic/focused and creative/diverse outputs.

3. Instruction-Based Prompting

Instruction-based prompting is one of the most fundamental and widely used approaches in working with large language models (LLMs). It involves providing the model with explicit, structured, and unambiguous instructions that guide how the response should be generated.

At its core, an instruction-based prompt consists of two essential components:

  1. Instruction – what the model is supposed to do (e.g., “Summarize the text in one sentence.”).
  2. Data – the input on which the instruction operates (e.g., the paragraph to be summarized).

A simple example:

Prompt

Instruction: Summarize the following text in one sentence.  
Data: Artificial Intelligence is revolutionizing industries such as healthcare, finance, and education by automating tasks and enabling data-driven decision-making.  

Output

AI is transforming industries by automating tasks and enabling smarter decisions.  

The following diagram depicts a basic instruction prompt. Please note the instructions and data in the prompt.

Figure 2: Instruction Prompt

3.1 Adding Output Indicators

Sometimes instructions alone are not enough. To make the response more constrained and predictable, we can add output indicators – predefined answer formats or expected categories.

For example:

Prompt

Instruction: Classify the sentiment of the following review.  
Data: “The product is amazing and works perfectly.”  
Output options: Positive | Negative  

Output

Positive  

The following diagram depicts the instruction prompt with an output indicator.

Figure 3: Instruction prompt with output indicators

3.2 Task-Specific Prompt Formats

Different NLP tasks require slightly different instruction structures. For example:

  • Summarization: “Summarize the following paragraph in 2–3 sentences.”
  • Classification: “Classify the following text as spam or not spam.”
  • Named Entity Recognition (NER): “Extract all names of organizations mentioned in the following text and list them as a JSON array.”

These formats not only help the model but also make evaluation easier for humans.

The following diagram illustrates example formats for summarization, classification, and named-entity recognition.

Figure 4: Prompt format for summarization, classification, and NER task

3.3 Prompting Techniques for Better Results

Instruction-based prompting can be improved using several best practices:

  • Specificity
    Be as precise as possible. Instead of “Explain photosynthesis”, say “Explain photosynthesis in 3 simple steps, suitable for a 10-year-old student.”
  • Hallucination Control
    LLMs sometimes generate plausible but false answers (hallucinations). To mitigate this, include safeguards:
    If you are not sure about the answer, respond with: "I don't know."
  • Order Matters (Primacy and Recency Effects)
    LLMs pay more attention to the beginning and end of a prompt. Place the most important instructions either at the start or the end to maximize compliance.

3.4 Beyond the Basics: Modular Prompt Components

While instruction, data, and output indicators are the core building blocks, prompts can be made more modular and adaptive by incorporating additional components such as:

  • Personalization – “You are a financial advisor…”
  • Instruction – “Summarize, compare, explain…”
  • Context – background information that helps the model.
  • Format – JSON, table, bullet points, plain text.
  • Audience – “Explain as if to a high school student.”
  • Tone – formal, casual, persuasive, neutral.
  • Data – the actual input for the model to process.

This modularity makes prompts flexible. You can add or remove components depending on the use case, and observe how the output changes.

The following is an example that uses the above prompt components. This example demonstrates the modular nature of prompting. We can experiment by adding or removing components to see the effect.

Figure 5: Example of a prompt showing the use of the various components.

3.5 Example: Modular Prompt

Prompt

Personalization: You are a career coach.  
Instruction: Provide advice on writing a resume.  
Context: The user is applying for a software engineering role at a mid-sized startup.  
Format: List your advice as 5 bullet points.  
Audience: Entry-level job seeker with 1 year of internship experience.  
Tone: Encouraging and practical.  
Data: “How should I improve my resume?”  

Output

1. Highlight your technical skills (Python, JavaScript, databases).  
2. Add details about your internship projects—focus on impact.  
3. Tailor your resume to the startup by showing adaptability and problem-solving.  
4. Keep the format clean, concise, and one page.  
5. Add a short section for personal projects or open-source contributions.  

This demonstrates how instruction-based prompting becomes powerful and adaptable when multiple components are layered together.

4. In-Context Learning – Providing examples

Large Language Models (LLMs) do not “learn” in the traditional sense during inference. Instead, they adapt to patterns given in the prompt itself. This ability to condition their behavior on a few examples provided at runtime is called In-Context Learning (ICL).

4.1 The Idea Behind ICL

By showing the model examples of the task and the desired outputs, we “teach” it on the fly. The model does not change its weights; rather, it uses the examples as a temporary pattern guide to align its responses with the given format.

This makes ICL especially powerful when:

  • We don’t want to fine-tune the model.
  • Training data for fine-tuning is small or unavailable.
  • We want flexibility to change tasks quickly.

4.2 Types of In-Context Learning

1. Zero-shot prompting

    • No examples are provided, only instructions.
    • Works best when the task is common or well-aligned with the model’s pretraining.
    • Example:
      Instruction: Translate the following English sentence into French.  
      Data: "How are you?"  

      Output: “Comment ça va ?”

      2. One-shot prompting

      • A single example is given to demonstrate the expected behavior.
      • Useful when the task requires clarity in format or style.
      • Example:
      User: Translate the following English sentence into French.  
      Example Input: "Good morning" → Example Output: "Bonjour"  
      Task Input: "How are you?"  

      Output: “Comment ça va ?”

      3. Few-shot prompting

      • Multiple examples are given before the actual task.
      • Works well when tasks are ambiguous or domain-specific.
      • Example:
      Task: Classify the sentiment of the following reviews as Positive or Negative.  
      
      Review: "I love this phone, the battery lasts long." → Positive  
      Review: "The screen cracked within a week." → Negative  
      Review: "Excellent sound quality and fast processor." → Positive  
      
      Now classify: "The camera is blurry and disappointing."  

      Output: Negative

      The following diagram illustrates the examples of in-context learning.

      Figure 6: Examples of in-context learning

      4.3 Importance of Role Differentiation

      When writing few-shot prompts, clearly distinguishing roles (e.g., User: and Assistant: or Q: and A:) helps the model mimic the structure consistently. Without role markers, the model may drift into producing unstructured responses.

      For example:

      User: What is 2 + 2?  
      Assistant: 4  
      User: What is 5 + 3?  
      Assistant: 8  
      User: What is 7 + 6?  
      Assistant:

      This encourages the model to continue in the same call-and-response pattern.

      4.4 Benefits of In-Context Learning

      • Flexibility – You can “train” the model on a new task instantly without modifying its parameters.
      • Rapid prototyping – Great for testing new use cases before investing in fine-tuning.
      • Control – Helps enforce formatting (e.g., JSON, tables, bullet points).

      4.5 Limitations of In-Context Learning

      • Context length constraints – Too many examples may exceed the model’s context window.
      • Random sampling – Even with examples, the model may ignore instructions if randomness (temperature, top_p) is high.
      • Cost & latency – Longer prompts = higher compute and inference cost.
      • Inconsistency – The same examples may yield slightly different outputs.

      4.6 Advanced Variants of ICL

      • Instruction + Demonstration Hybrid: Combine explicit task instructions with few-shot examples for stronger guidance.
      • Chain-of-Thought with ICL: Provide examples that include reasoning steps, so the model learns to “think out loud” before answering.
      • Style Transfer with ICL: Use few-shot examples to enforce a particular writing style (e.g., Shakespearean, academic, casual).

      5. Chain Prompting: Breaking up the Problem

      When dealing with complex tasks, asking a large language model (LLM) to solve everything in a single prompt often leads to suboptimal results. The model may lose focus, misinterpret requirements, or generate incomplete answers. Chain prompting is a structured strategy where we break down a large problem into smaller subtasks, design prompts for each subtask, and then link them sequentially, passing outputs from one prompt as inputs to the next. This creates a pipeline of prompts that together achieve the final solution.

      This approach mirrors how humans naturally solve complex problems—by breaking them into manageable steps rather than attempting everything at once.

      5.1 Key Benefits of Prompt Chaining

      1. Better Performance
        • By focusing each prompt on a single subtask, the LLM can generate more accurate and high-quality responses.
        • Reduces cognitive overload for the model.
      2. Transparency
        • Each intermediate step in the chain is visible and explainable.
        • Makes it easier for developers and users to trace how the final output was constructed.
      3. Controllability and Reliability
        • Developers can adjust or fine-tune only the prompts for the weaker subtasks instead of rewriting the entire large prompt.
        • More control over model behavior.
      4. Debugging
        • Since outputs are broken into stages, it’s easier to identify where an error occurs and fix it.
      5. Incremental Improvement
        • You can evaluate the performance of each subtask independently and selectively improve weak links in the chain.
      6. Conversational Assistants
        • Useful for designing chatbots where conversation naturally involves sequential reasoning (e.g., clarifying intent → retrieving information → generating response).
      7. Personalization
        • Chains can be designed to collect user preferences at one step and then apply those preferences consistently across subsequent prompts.

      5.2 Common Use Cases

      1. Response Validation
        • Prompt 1: Generate an answer.
        • Prompt 2: Ask the model (or another model) to evaluate correctness, consistency, or bias in the answer.
        • Example: LLM generates an explanation of a concept, then another LLM verifies if the explanation is factually correct.
      2. Parallel Prompts
        • Sometimes, different subtasks can be run simultaneously.
        • Example: One prompt generates a list of features, another generates customer pain points, and later prompts merge them to design marketing copy.
      3. Creative Writing / Storytelling
        • Prompt 1: Generate character descriptions.
        • Prompt 2: Use characters to generate a plot outline.
        • Prompt 3: Expand the outline into a full story.
      4. Business Use Case – Marketing Flow
        • Step 1 (Prompt 1): Generate a catchy product name.
        • Step 2 (Prompt 2): Using the product name + product features, generate a short slogan.
        • Step 3 (Prompt 3): Using the product name, features, and slogan, generate a full sales pitch.
        • This modular approach ensures the final pitch is consistent, creative, and logically structured.

      5.3 Prompt Chain Example

      The following example illustrates the prompt chain that first creates a product name, then uses this name with product features to create a slogan, and finally uses features, product name, and slogan to create the sales pitch.

      Figure 7: Example of a prompt chain

      Step 1 – Product Naming

      Instruction: “Suggest a creative name for a new smartwatch that focuses on health tracking and long battery life.”
      Output: “PulseMate”

      Step 2 – Slogan Generation

      Instruction: “Using the product name ‘PulseMate’ and the features (health tracking, long battery life), create a short catchy slogan.”
      Output: “PulseMate – Your Health, Powered All Day.”

      Step 3 – Sales Pitch

      Instruction: “Using the product name ‘PulseMate,’ its slogan ‘Your Health, Powered All Day,’ and the features (health tracking, long battery life), write a compelling sales pitch for customers.”
      Output: “Meet PulseMate, the smartwatch designed to keep up with your lifestyle. Track your health seamlessly while enjoying a battery that lasts for days. PulseMate—Your Health, Powered All Day.”

      5.4 Variants of Prompt Chaining

      • Sequential Chaining – Output of one prompt feeds directly into the next (step-by-step). The above example in Figure 7 demonstrates sequential chaining.
      • Branching Chaining – One output is used to create multiple different paths of prompts.
      • Merging Chains – Combine results from different parallel chains into a unified final response.
      • Iterative Chaining – Loop a prompt multiple times for refinement (e.g., “revise this until it’s concise and clear”).

        6. Reasoning with Generative Models

        LLMs don’t “reason” like humans. They excel at pattern completion over very large text corpora. With careful prompting, scaffolding, and verification, we can simulate aspects of reasoning and markedly improve reliability.

        6.1 System 1 vs. System 2 (Kahneman) — and LLMs

        • System 1 (fast, intuitive): In LLMs this looks like single-shot answers, low token budget, low/no deliberation. Good for well-trodden tasks (grammar fixes, casual Q&A).
        • System 2 (slow, deliberate): In LLMs this is multi-step prompting, intermediate reasoning, tool use (calculator/RAG), sampling multiple candidates, and verification. Good for math, logic, policy checks, multi-constraint generation, and anything high-stakes.

        In practice: choose System 1 for speed/low risk; escalate to System 2 scaffolds when accuracy, traceability, or multi-constraint synthesis matters.

        6.2 Techniques to Induce Deliberation

        6.2.1 Chain-of-Thought (CoT): “Think before answering”

        Elicit intermediate reasoning steps prior to the final answer.

        Zero-shot CoT trigger (minimal):

        You are solving a reasoning task.
        First, think step-by-step in brief bullet points.
        Then, give the final answer on a new line prefixed with "Answer:".
        Question: <problem>

        Few-shot CoT (when format matters): include 1–3 worked examples showing short, crisp reasoning and a clearly marked Answer line.

        Tips

        • Keep thoughts succinct to reduce cost and drift.
        • For production UIs, you can ask the model to hide the rationale and output only the final answer + a confidence or citation list (see “Reasoning privacy” below).

        When to use: arithmetic/logic puzzles, planning, constraint satisfaction, data transformation with edge cases.

        The following figure demonstrates standard prompting vs C-o-T Prompting:

        Figure 9: Chain-of-thought example; reasoning process is highlighted – source [3]

        The following is an example of zero-shot chain-of-thought.

        Figure 10: Example of zero-shot chain-of-thought – source[1]

        6.2.2 Self-Consistency: sample multiple rationales and vote

        Rather than trusting the first reasoning path, sample k solutions and aggregate.

        Template

        Task: <problem>
        
        Instruction:
        Generate a short, step-by-step rationale and final answer.
        Vary your approach each time.
        
        [Run this prompt k times with temperature ~0.71.0]
        Aggregator:
        - Extract the final answer from each sample.
        - Choose the majority answer (tie-break: pick the one supported by the clearest rationale).
        - Return "Final:" <answer> and "Support count:" <m/n>.

        Practical defaults

        • k = 5–15 (trade accuracy vs. latency/cost)
        • temperature: 0.7–1.0
        • top_p: 0.9–1.0

        When to use: problems with one correct output but many valid reasoning paths (math, logical deduction, label inference).

        The following diagram illustrates the concept of self-consistency.

        Figure 11: Example of self-consistency in CoT[4]

        6.2.3 Tree of Thoughts (ToT): explore and evaluate branches

        Generalizes CoT into a search over alternative “thoughts” (states). You expand multiple partial solutions, score them, prune weak ones, and continue until a budget is reached.

        Lightweight ToT pseudo-workflow

        state0 = problem description
        frontier = [state0]
        
        for depth in 1..D:
          candidates = []
          for s in frontier:
            thoughts = LLM("Propose 2-3 next-step thoughts for: " + s)
            for t in thoughts:
              v = LLM("Rate this partial approach 1-5 for promise. Be strict.\nThought: " + t)
              candidates.append((t, v))
          frontier = top_k(candidates, by=v, k=K)
        
        best = argmax(frontier, by=v)
        answer = LLM("Given this best chain of thoughts, produce the final answer:\n" + best)

        Tuning knobs

        • D (max depth), K (beam width), value function (how you score thoughts), and token budget.
        • Use “look-ahead” prompts: “Simulate next two steps; if dead-end, backtrack.”

        When to use: multi-step planning (itineraries, workflows), puzzle solving, coding strategies, complex document transformations.

        The following diagram illustrates the various approaches to problem-solving with LLMs. Each rectangular box represents a thought.

        Figure 12: Various approaches to problem-solving with LLMs.

        6.2.4 Related, practical reasoning scaffolds

        • ReAct (Reason + Act): Interleave “Thought → Action (tool call/RAG) → Observation” until done. Great for tasks that need tools, search, or databases.
        • Program-of-Thoughts (PoT): Ask the model to output code (e.g., Python) to compute the answer; execute it; return result. Excellent for math, data wrangling, and reproducibility.
        • Debate / Critic-Judge: Have model A propose an answer, model B critique it, and a judge (or the same model) select/merge. Pairs well with self-consistency.
        • Plan-then-Execute: Prompt 1 creates a plan/checklist; Prompt 2 executes step by step; Prompt 3 verifies outputs against the plan.
        • Retrieval-Augmented Reasoning: Prepend cited context (docs, policies) and require grounded (“quote-and-justify”) answers.

        6.3 Putting it together: a robust System-2 pipeline

        Use case: Policy compliance check for marketing copy.

        1. Extract constraints (CoT):
          “List policy rules relevant to social ads, each with an ID and short paraphrase.”
        2. Assess violations (ReAct/PoT):
          For each rule, analyze the ad text; return pass|fail with span references.
        3. Self-consistency vote:
          Sample assessments 7× and majority-vote each rule outcome.
        4. Summarize & justify:
          Compose a final verdict with a table of rules, decisions, and cited spans.
        5. Verifier pass:
          A separate prompt re-checks logical consistency and that every failure has evidence.
        6. Guarded output:
          Enforce schema (JSON) and redact PII.

        This gives you accuracy (deliberation), transparency (artifacts per step), and control (schema + verifier).

        6.4 Operational Guidance

        6.4.1 Prompt templates

        CoT (short)

        Solve the problem. First give 3-5 brief reasoning bullets. 
        Then output the final result as: "Answer: <value>".
        Question: <...>

        Self-consistency runner (controller code)

        answers = []
        for i in range(k):
          ans = call_llm(prompt, temperature=0.8, top_p=0.95)
          answers.append(extract_final(ans))
        final = majority_vote(answers)

        ReAct skeleton

        Thought: I need the latest spec section.
        Action: search("<query>")
        Observation: <top snippet>
        Thought: Summarize the relevant passage and apply the rule.
        ...
        Final Answer: <concise verdict + citation>

        ToT node expansion

        Propose 3 distinct next-step ideas to advance the solution.
        For each: give a one-sentence rationale and a 1-5 promise score.
        Return JSON: [{"idea":..., "rationale":..., "score":...}]

        6.5 Evaluation & QA

        • Task accuracy: EM/F1, pass@k, BLEU/ROUGE (for summarization), or domain metrics.
        • Process metrics: step validity rate, verifier agreement, citation coverage.
        • Ablations: compare single-shot vs. CoT vs. CoT+SC vs. ToT to quantify lift.
        • Cost/latency: track tokens per step; cache intermediate artifacts.

        6.6 Safety & reliability

        • Reasoning privacy: In user-facing products, prefer internal deliberation (hidden rationale) and a concise final answer to avoid leaking sensitive chain-of-thought.
        • Guardrails: constrain outputs (JSON schema, regex, grammars), require citations for claims, and run a policy/PII filter on intermediate and final outputs.
        • Determinism knobs: lower temperature and/or use self-consistency with majority vote for stable

        6.7 When not to use heavy reasoning

        • Simple, well-known tasks where single-shot responses are accurate.
        • Ultra-tight latency budgets.
        • Very small models without enough capacity/context window.

        Quick chooser: which technique when?

        SituationRecommended scaffold
        Arithmetic/logic puzzleCoT → Self-consistency (k=5–15)
        Multi-step planning / puzzle searchToT (small D, K), optional ReAct for tools
        Needs external data/toolsReAct (with retrieval/calculator/code)
        Deterministic data transformationPoT (code execution) + schema constraints
        High-stakes, audited outputsCoT/ToT + Verifier + Guardrails + Logged artifacts

        7. Output Verification

        In real-world deployments, verifying and controlling the output of generative AI models is crucial to ensure safety, robustness, and reliability. LLMs, while powerful, are prone to errors, hallucinations, ethical risks, or unstructured responses that can cause failures in production systems.

        Without proper verification, issues such as malformed data, offensive content, or incorrect facts can undermine user trust and lead to business or compliance risks.

        7.1 Why Output Verification Matters

        1. Structured Output
          • Many applications require the output in machine-readable formats (e.g., JSON, XML, CSV).
          • An unstructured answer can break downstream systems expecting strict schemas.
        2. Valid Output Choices
          • Even if the model is instructed to choose among fixed options (e.g., “positive” or “negative”), it may generate something outside the list (e.g., “neutral” or “very positive”).
          • Output validation ensures strict adherence to predefined categories.
        3. Ethical Compliance
          • Outputs must be free of profanity, bias, harmful stereotypes, or PII (Personally Identifiable Information).
          • Regulatory compliance (GDPR, HIPAA, etc.) requires strict filtering of sensitive or discriminatory outputs.
        4. Accuracy and Reliability
          • LLMs can hallucinate — produce factually wrong but confident-sounding information.
          • Verification steps such as grounding with external knowledge bases or post-checking factual claims can prevent misinformation.

        7.2 Methods to Control Output

        Apart from tweaking generation parameters like temperature (randomness) and top_p (nucleus sampling), there are three primary strategies for enforcing correct outputs:

        7.2.1 Providing Examples (Few-Shot Structured Prompts)

        • How it works:
          • Supply the model with examples of desired output in the correct format (e.g., JSON, Markdown tables).
          • The model uses these as patterns to mimic.
        • Example Prompt:
        {
          "name": "Alice",
          "sentiment": "positive"
        }

        Now classify the following:
        Input: “The movie was fantastic!”
        Output:

        Limitations:

        • Models may still deviate, especially under ambiguous inputs.
        • Reliability varies across models — some are better at following formatting instructions than others.

        7.2.2 Grammar-Based Constrained Sampling

        Instead of relying only on examples, grammars and constraints can be enforced at the token generation level. This guarantees that outputs match the expected structure.

        • Techniques & Tools:🔹 Guidance
          • A framework for programmatically controlling LLM outputs.
          • Uses regex, context-free grammars (CFGs), and structured templates.
          • Supports conditionals, loops, and tool calls inside prompt templates.
          • Advantage: Reduced cost and latency compared to brute-force fine-tuning.
          🔹 Guardrails
          • Python framework to build safe, reliable AI pipelines.
          • Key features:
            • Input/Output Guards to catch risks (bias, toxicity, PII leaks).
            • Schema enforcement (ensures outputs comply with JSON, XML, etc.).
            • Ecosystem of reusable validators via Guardrails Hub.
          • Example: Ensuring LLM output is a safe, validated JSON object representing a chatbot reply.
          🔹 LMQL (Language Model Query Language)
          • Specialized programming language for LLM prompting.
          • Provides types, templates, and constraints for robust prompting.
          • Runtime ensures the model adheres to the defined schema during decoding.
        • Low-level Constrained Decoding Example (llama-cpp-python):
        response = llm(
            "Classify the sentiment.",
            response_format = {"type": "json_object"}
        )

        Forces the model to output a valid JSON object instead of free text.

        7.2.3 Fine-Tuning for Desired Outputs

        • How it works:
          • Retrain or fine-tune the base model on domain-specific datasets that already contain the desired output style.
          • Example: A customer support LLM fine-tuned only on safe, structured responses in JSON.
        • Benefits:
          • Reduces variance and unpredictability.
          • Makes structured outputs more native to the model (less prompt engineering overhead).
        • Limitations:
          • Requires labeled data in the target output format.
          • Costly and time-consuming compared to prompting or grammar constraints.

        7.3 Output Verification Pipeline (Best Practice)

        A robust production system often combines multiple techniques:

        1. Prompt-level control → Provide few-shot examples of structured output.
        2. Grammar/Constraint enforcement → Enforce schema compliance (Guidance, Guardrails, LMQL, or constrained decoding APIs).
        3. Post-generation validation → Apply validators for ethics, factuality, and compliance.
        4. Fallback mechanism → If verification fails, rerun the model with tighter constraints or route to a human-in-the-loop system.

        Output verification transforms LLMs from unpredictable text generators into reliable components of enterprise systems. By combining structured examples, constrained grammar, and fine-tuning, developers can build trustworthy AI applications that are safe, accurate, and production-ready.

        References

        1. Book: Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
        2. https://www.promptingguide.ai/
        3. Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et. al., Google Research, Brain Team
        4. Paper: Self-Consistency improves Chain-of-Thought Reasoning in Language Models by Wang et. al., Google Research, Brain Team
        5. Tree of Thoughts: Deliberate Problem Solving with Large Language Models by Shunyu yao et al. NIPS – 2023
        6. Report on a general problem solving program by A. Newell et al. in IFIP congress – 1959

        Text Clustering and Topic Modeling using Large Language Models (LLMs)

        1. Introduction

        Text clustering is an unsupervised approach that helps in discovering patterns in data. Grouping similar texts according to their semantic content, meaning, relationships, etc. is the goal of text clustering. This makes it easier to cluster vast amounts of unstructured text and perform exploratory data analysis quickly. With recent advancements in large language models (LLMs), we can obtain extremely precise contextual and semantic representations of text. This has improved text clustering’s efficacy even more. Use cases for text clustering include identifying outliers, accelerating labelling, identifying data that has been erroneously labelled, and more.

        Topic modelling facilitates the identification of (abstract) topics that arise in huge textual data collections. Clusters of text documents can be given meaning through this method.

        We will learn how to use embedding models for text clustering and text-clustering-inspired method of topic modeling, namely BERTopic, generating labels using LLM given the keywords of the topic.

        2. Pipeline for Text Clustering

        The following diagram depicts the pipeline for text clustering that consists of the following three steps:

        1. Use an embedding model to transform the input documents into embeddings.
        2. Using a dimensionality reduction model, lower the dimensionality of embeddings.
        3. Use a cluster model to identify groups of documents that share semantic similarities.

        2.1 Embedding Documents

        We know that embeddings are the numerical representations of text that attempt to capture its meaning. We can use embedding models that are optimized for semantic similarity tasks, for transforming the documents to embeddings in the first step. We can use the Massive Text Embedding Benchmark (MTEB) leaderboard, for selecting the embedding model for our requirement. For example “thenlper/gte-small” is a small but performant model with fast inference.

        2.2 Dimensionality Reduction of Embeddings

        It is difficult for clustering techniques to identify meaningful clusters if the dimension of the data is high. These techniques preserve the global structure of the data by finding the low-dimension representations. This techniques act as compression techniques, so these do not remove dimensions arbitrarily. Popular algorithms are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Compared to PCA, UMAP is better in handling nonlinear relationships and structures.

        2.3 Clustering the Reduced Embeddings

        The following diagram depicts the methods of text clustering:

        Density-based clustering algorithms calculate the number of dimensions itself and do not force all data points to be part of the cluster. The data points not part of any cluster are marked as outliers. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a hierarchical variation of the DBSCAN algorithm that specializes in finding dense(micro)-clusters without being told the number of clusters in advance.

        3. Text Clustering to Topic Modeling

        Topic modelling is the term used to describe the process of identifying themes or hidden topics in a set of textual data. Traditionally, it involves finding a group of keywords or phrases that best represent and capture the essence of the topic. We need to understand the meaning of the topic through these keywords or phrases. Latent Dirichlet Allocation (LDA) is one such algorithm. Let’s discuss BERTopic in the following sections, which is a highly modular text clustering and topic modeling framework.

        3.1 BERTopic: A Modular Topic Modeling Framework

        Steps for performing topic modeling follow three steps of text clustering. The output of the third step of text clustering is fed to the fourth step of topic modeling. The following diagram depicts the steps (4th and 5th) for topic modeling:

        4th step calculates the class-based term frequency, i.e., frequency (tf) of word (X) in cluster (C). This term is then multiplied with IDF (inverse document frequency) in the 5th step. The goal is to give more weight to the words in a cluster and less weight to the words appearing across all clusters.

        The following diagram depicts the full pipeline from clustering to topic modeling. Though topic modeling follows clustering, they are largely independent of each other, and each component is modular. BERTopic can be customised, and we can choose another algorithm instead of the default ones.

        3.2 Reranking in BERTopic

        c-TF-IDF does not take into account the semantic structures, so BERTopic leverages the representation models (e.g. KeyBERTInspired, Maximal marginal relevance – MMR, spaCy) to rerank the topics found out in the previously discussed 5th step. This reranking is applied on each topic instead of every document. Many of the representation models are LLMs. Now with this step final pipeline extends to become the following:

        3.3 Using LLM to generate a label for Topic

        The following diagram explains how keywords combined with documents, along with a prompt, can be passed on to LLM for generating the label for the topic given keywords.

        Final pipeline is as follows:

        Final detailed pipeline:

        References

        1. Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
        2. Hugging Face – https://huggingface.co/

        Text Classification using Large Language Models (LLMs)

        1. Introduction

        A common task in natural language processing (NLP) is text classification. Use cases of text classification include sentiment analysis, intent detection, entity extraction, and language detection. This article will delve into how to use LLMs for text classification. We will see representation models and generative models. Under the representation model, we will see how to use task-specific models and embedding models to classify the text. Under the generative models, we will see open source and closed source models. While both generative and representation models can be applied to classification, they take different approaches.

        2. Text Classification with Representation Models

        Task-specific models and embedding models are two types of representation models that can be used for text classification. To obtain task-specific models, representation models, like bidirectional encoder representations from transformers (BERT), are trained for a particular task, like sentiment analysis. On the other hand, general-purpose models such as embedding models can be applied to a range of tasks outside classification, such as semantic search.

        As it can be seen in the below diagram, when used in the text classification use case, representation models are kept frozen (untrainable). As the task-specific models are specially trained for the given task, when the text is given as input, it can classify the given text as per the task at hand. But when we are using the embedding model, we need to generate embeddings for the texts in the training set. Then train a classifier on the train dataset that has embeddings and corresponding labels. Once the classifier is trained, it can be used for classification.

        3. Model Selection

        The factors we should look into for selecting the model for text classification:

        1. How does it fit the use case?
        2. What is its language capability?
        3. What is the underlying architecture?
        4. What is the size of the model?
        5. How is the performance? etc.

        Underlying Architecture

        BERT is an encoder-only architecture and is a popular choice for creating task-specific models and embedding models, and falls into the category of representation models. Generative Pre-trained Transformer (GPT) is a decoder-only architecture that falls into the generative models category. Encoder-only models are normally small in size. Variations of BERT are RoBERTa, DistilBERT, DeBERTa, etc. For task-specific use cases such as sentiment analysis, Twitter-RoBERTa-base can be a good starting point. For embedding models sentence-transformers/all-mpnet-base-v2 can be a good starting point as this is a small but performant model.

        4. Text Classification using Task-specific Models

        This is pretty straight forward. Text is passed to the tokenizer that splits the text into tokens. These tokens are consumed by the task-specific model for predicting the class of the text.

        This is fine if we could find the task-specific models for our use case. Otherwise, if we have to fine-tune the model ourselves, we would need to check if we have sufficient budget(time, cost) for it. Another option is to resort to using the general-purpose embedding models.

        5. Text Classification using Embedding Models

        We can generate features using an embedding model rather than directly using the task-specific representation model for classification. These features can be used for training a classifier such as logistic regression.

        5.1 What if we do not have the labeled data?

        We have the definition of the labels, but we do not have the labeled data, we can utilize what is called “zero-shot classification“. Zero-shot model predicts the labels of input text even if it was not trained on them. Following diagram depicts the concept.

        We can use zero-shot classification using embeddings. We can describe our labels based on what they should represent. The following diagram describes the concept.

        To assign the labels to the input text/document, we can calculate the cosine similarity with the label embeddings to check which label it is close to.

        6. Text Classification with Generative Models

        Generative models are trained for a wide variety of tasks, so it will not work with the text classification out of the box. To make the generative model understand our context, we need to use the concept of prompt engineering. The prompt needs to be cleverly written so that the model understands what it is expected to do, what the candidate labels, etc.

        6.1 Using Text-to-Text Transfer Model (T5)

        The following diagram summarizes the different categories of the models:

        The following diagram depicts the training steps:

        We need to prefix each document with the prompt “Is the following sentence positive or negative?

        6.2 ChatGPT for Classification

        The following diagram describes the training procedure of ChatGPT:

        The model is trained using human preference data to generate text that resembles human preference.

        For text classification, following is the sample prompt:

        prompt = """Predict whether the following document is a positive or negative movie review:

        [DOCUMENT]

        If if is positive return 1 and if it is negative return 0. Do not give any other answers.
        """

        References

        1. Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
        2. Hugging Face – https://huggingface.co/

        Inside the LLM Inference Engine: Architecture, Optimizations, Tools, Key Concepts and Best Practices

        1. Introduction

        LLM inference and serving refer to the process of deploying large language models and making them accessible for use — whether locally for personal projects or in production for large-scale applications.
        Depending on your needs, you may opt for a lightweight local deployment or a robust, enterprise-grade solution. The right choice depends on factors like performance, scalability, latency, and infrastructure availability.

        2. LLM Inference & Serving Architecture

        The following is an overview of the architecture for LLMs’ inference and serving. This architecture represents how an application interacts with a deployed Large Language Model (LLM) to generate predictions (inference). It highlights the role of inference servers, inference engines, and the hardware layer.

        Figure: Typical architecture of LLM Inference and Serving

        2.1. Application Layer

        • The Application is the client (e.g., a chatbot, API consumer, or internal tool) that sends requests to the inference server.
        • These requests are typically made over HTTP or gRPC for low-latency, high-performance communication.
        • Example: A frontend UI sends a user’s prompt to the server for processing.

        2.2 Inference Server

        The Inference Server acts as the bridge between the application and the model.
        It contains three main responsibilities as shown in the diagram:

        2.2.1 Query Queue Scheduler

        • Purpose: Manages incoming queries in a queue to avoid overwhelming the model.
        • Function:
          • Receives requests from the application.
          • Places them in a queue.
          • Uses a scheduler to decide when and how to process them.
        • Batching Opportunity: The scheduler can group multiple requests together into a batch, improving GPU utilisation and throughput.

        2.2.2. Metrics Module

        • Purpose: Collects real-time statistics on system performance.
        • Metrics Tracked:
          • Throughput: Tokens generated per second.
          • Latency: Total response time.
          • TTFT (Time to First Token): Time from request to first token generated.
          • Resource utilisation (GPU/CPU/memory).
        • This data is essential for monitoring, scaling, and debugging.

        2.3. Inference Engine

        The Inference Engine is the core computation unit that runs the LLM.

        2.3.1 Batching

        • Groups multiple queued queries into a single execution batch on the GPU.
        • Benefits:
          • Reduces overhead from individual GPU calls.
          • Improves parallel processing efficiency.

        2.3.2 Model Execution

        • Runs the LLM model itself.
        • The model takes the batched input and generates output tokens.
        • Can utilise optimisations such as:
          • KV Caching for faster token generation.
          • Quantisation for memory efficiency.
          • Speculative Decoding for speed.

        2.3.3 Query Response

        • Gathers model outputs.
        • Splits them back into individual responses for each original request.
        • Sends results back to the application over HTTP/gRPC.

        2.4 Hardware Layer

        • GPU/CPU hardware actually runs the model computation.
        • For LLMs, GPUs (often with large VRAM) are preferred for:
          • Parallel processing of large tensor computations.
          • Efficient handling of multi-batch workloads.
        • CPUs can be used for smaller models or less latency-sensitive tasks.

        2.5. Workflow Summary

        1. Application sends HTTP/gRPC request with prompt.
        2. Query Queue Scheduler stores and batches incoming requests.
        3. Batching Module groups requests for efficient GPU execution.
        4. Model generates predictions (tokens).
        5. Query Response sends formatted results back.
        6. Metrics Module continuously tracks performance.
        7. Results return to the application.

        3. Evaluation of LLM Inference and Serving

        When evaluating LLM inference (the process of generating outputs from a trained large language model) and LLM serving (the infrastructure and software that delivers model predictions to end users), the two primary performance metrics are:

        1. Throughput – Measures the total volume of output tokens generated per second.
          • Example: An LLM serving system producing 2,000 tokens/sec can handle more concurrent requests or generate longer responses faster than one producing 500 tokens/sec.
          • High throughput is critical for scenarios like batch inference, multi-user chatbots, or real-time content generation in high-traffic applications.
          • Throughput depends on factors such as:
            • Model size and architecture (e.g., LLaMA vs. GPT-style transformers).
            • GPU/TPU hardware capabilities and memory bandwidth.
            • Request batching efficiency.
            • Quantisation and weight compression techniques.
        2. Latency – Measures how quickly a model responds to an individual request.
          • Key metric: Time to First Token (TTFT) – the delay between receiving a prompt and starting to produce output.
          • After TTFT, token generation latency is often measured as milliseconds per token (ms/token).
          • Low latency is especially important for interactive applications like real-time chat, virtual assistants, and code autocompletion.
          • Latency can be influenced by:
            • Model loading time (cold start vs. warm start).
            • Network communication overhead.
            • Prompt length (longer prompts mean longer context processing time).
            • Decoding strategy (e.g., greedy search, beam search, sampling).

        3.1 Optimisation Strategies for High Throughput and Low Latency

        LLM inference engines and serving frameworks focus heavily on memory utilisation and computational efficiency in production:

        • Model Optimisation
          • Quantisation – Reduce precision (e.g., FP16 → INT8 or 4-bit) to speed up inference with minimal accuracy loss.
          • Pruning – Remove redundant weights to reduce model size.
          • Speculative Decoding – Use a smaller model to “guess” future tokens and confirm with the main model.
          • LoRA / PEFT – Use parameter-efficient fine-tuning to avoid reloading huge models.
        • Serving Architecture
          • Request Batching – Combine multiple user queries into a single forward pass for better GPU utilisation.
          • Pipeline Parallelism – Split model layers across multiple GPUs.
          • Tensor Parallelism – Split the tensor computations across devices.
          • KV Cache Reuse – Store intermediate attention key-value pairs to avoid recomputation in autoregressive decoding.
        • Infrastructure-Level Improvements
          • Use GPUs with high VRAM and bandwidth (e.g., NVIDIA A100/H100).
          • Place inference servers close to end users to reduce network latency (edge inference).
          • Use asynchronous request handling and efficient scheduling.

        3.2 Beyond Throughput and Latency – Additional Considerations

        While throughput and latency form the backbone of LLM performance evaluation, real-world deployments often require additional criteria to ensure stability, scalability, and cost-effectiveness.

        3.2.1 Scalability

        • What it means: The ability of the inference infrastructure to handle sudden spikes in traffic without sacrificing speed or accuracy.
        • Why it matters:
          • LLM-based customer support systems may experience traffic surges during product launches.
          • AI coding assistants can see unpredictable query bursts during hackathons or exams.
        • How it’s achieved:
          • Auto-scaling mechanisms in Kubernetes (HPA/VPA) or serverless GPU backends.
          • Load balancing across multiple GPU/TPU nodes.
          • Model sharding for extremely large models (e.g., Megatron-LM, DeepSpeed ZeRO-3).

        3.2.2 Cost Efficiency

        • What it means: Delivering optimal performance per dollar spent, especially important in pay-per-token or per-hour GPU rental models.
        • Why it matters:
          • Cloud GPU instances (A100, H100) are expensive; inefficient deployments can burn budgets fast.
          • Inference for large models may cost more than fine-tuning if poorly optimized.
        • Strategies:
          • Use quantization (e.g., INT8, FP16) to reduce GPU memory usage and increase batch size.
          • Employ dynamic batching to process multiple requests simultaneously.
          • Choose spot/preemptible GPU instances for non-critical workloads.

        3.2.3 Ease of Deployment

        • What it means: How quickly and reliably the LLM stack can be set up and updated.
        • Why it matters:
          • Shorter deployment cycles reduce time-to-market.
          • DevOps teams prefer infrastructure that integrates into existing CI/CD pipelines.
        • Implementation best practices:
          • Package inference servers using Docker.
          • Deploy using Helm charts for Kubernetes clusters.
          • Integrate with GitHub Actions / GitLab CI for automated rollouts.

        3.2.4 Fault Tolerance & Reliability

        • What it means: The ability of the system to keep running even if one or more nodes fail.
        • Why it matters:
          • LLM applications like healthcare assistants or financial chatbots can’t afford downtime.
        • Techniques:
          • Redundant model replicas with active-active failover.
          • Checkpointing model states so recovery is quick.
          • Health checks and graceful degradation (e.g., fall back to a smaller, local model if GPU fails).

        3.2.5 Multi-Model Support

        • What it means: Running different LLMs (or different versions of the same LLM) simultaneously.
        • Why it matters:
          • Some applications may require domain-specific models alongside general-purpose ones.
          • Allows A/B testing for performance evaluation before production rollout.
        • Examples:
          • vLLM and Triton Inference Server can host multiple models and route requests accordingly.

        3.2.6 Security & Compliance

        • What it means: Protecting data in transit and ensuring compliance with legal and organizational standards.
        • Why it matters:
          • LLMs often process sensitive data (e.g., PII, financial records, medical notes).
          • Non-compliance with regulations like GDPR, HIPAA, or SOC 2 can lead to heavy penalties.
        • Security Measures:
          • TLS encryption for all API calls.
          • Role-based access control (RBAC) and API key authentication.
          • Audit logs for every request to track usage.

        4. Prominent Products

        When selecting an LLM inference and serving framework, the decision often hinges on performance, scalability, hardware compatibility, and ease of integration with existing workflows. Below is an expanded look at some of the most prominent solutions in the space.

        4.1 vLLM

        • Origin & Background: Developed at the Sky Computing Lab, UC Berkeley, vLLM quickly gained popularity for its advanced scheduling and memory management optimizations.
        • Key Strengths:
          • PagedAttention: An innovative attention mechanism for faster inference and reduced memory footprint.
          • Supports continuous batching, making it ideal for serving multiple requests with minimal latency spikes.
        • Enterprise Adoption: Neural Magic’s nm-vllm (post Red Hat acquisition) adds enterprise-grade optimizations like quantization, CPU acceleration, and Kubernetes deployment tooling.

        4.2 LightLLM

        • Language & Approach: Pure Python implementation for simplicity and developer friendliness.
        • Key Strengths:
          • Extremely lightweight — minimal dependencies.
          • Designed for fast setup and low-resource deployment scenarios.
        • Ideal Use Cases: Edge devices, personal projects, or small-scale cloud deployments where quick prototyping is needed.

        4.3 LMDeploy

        • Focus: Deployment toolkit for compressing, quantizing, and serving large models.
        • Key Strengths:
          • Multi-backend support (ONNX, TensorRT, PyTorch).
          • Integrated with MPT, LLaMA, and other popular architectures.
        • Best Fit: Enterprises that want to reduce model size while keeping reasonable accuracy.

        4.4 SGLang

        • Scope: Targets both text and vision-language models.
        • Key Strengths:
          • Supports fine-tuned and multi-modal model serving.
          • Optimized for GPU and distributed setups.
        • Typical Users: Research teams and startups building chatbots with multi-modal capabilities.

        4.5 OpenLLM

        • Mission: Make running any open-source LLM as easy as running openllm start.
        • Key Strengths:
          • OpenAI-compatible APIs, enabling drop-in replacement in existing code.
          • CLI and Docker-friendly deployment.
        • Example Models Supported: Llama 3.3, Qwen2.5, Phi-3, Mistral, and more.

        4.6 Triton Inference Server with TensorRT-LLM

        • Developed by: NVIDIA.
        • Key Strengths:
          • TensorRT-LLM optimizes transformer inference with CUDA Graphs, FP8 precision, and kernel fusion.
          • Triton supports multi-framework serving (PyTorch, TensorFlow, ONNX) in one server.
        • Best Fit: Enterprises with NVIDIA GPU clusters aiming for maximum throughput and lowest latency.

        4.7 Ray Serve

        • From: The Ray ecosystem.
        • Key Strengths:
          • Horizontal scaling for ML model APIs.
          • Model composition — chain multiple models with Python code.
        • Typical Scenario: Building a complex AI service combining embeddings, retrieval, and multiple LLM calls.

        4.8 Hugging Face – Text Generation Inference (TGI)

        • Purpose: Highly optimized serving backend for text generation models.
        • Key Strengths:
          • Supports FlashAttention, tensor parallelism, and streaming output.
          • Native integration with Hugging Face Hub.
        • Use Case: Deploying HF-hosted models in enterprise or local infrastructure.

        4.9 DeepSpeed-MII

        • From: Microsoft’s DeepSpeed team.
        • Key Strengths:
          • Specializes in low-latency, high-throughput serving with quantization and sparsity support.
          • Leverages DeepSpeed inference optimizations like kernel fusion and ZeRO-offload.
        • Ideal For: Ultra-large models that need aggressive GPU memory optimizations.

        4.10 CTranslate2

        • Focus: Transformer inference for translation and seq2seq tasks.
        • Key Strengths:
          • CPU and GPU support, quantization for small footprint.
          • Exceptional speed for encoder-decoder models.
        • Typical Users: Machine translation systems, document summarization.

        4.11 BentoML

        • Scope: Unified inference and deployment framework for any model type.
        • Key Strengths:
          • Abstracts away serving details; works with ML frameworks like PyTorch, TensorFlow, scikit-learn.
          • Easy API packaging and containerization.
        • Great For: ML engineers who want one platform for models across use cases.

        4.12 MLC LLM

        • Mission: Compile and deploy LLMs anywhere — from cloud GPUs to mobile devices.
        • Key Strengths:
          • Uses ML compilation techniques (TVM) to optimize for target hardware.
          • Browser and mobile-friendly deployment options.
        • Notable Edge: Run LLMs in WebAssembly or Metal (Apple) without server dependencies.

        4.13 Others

        • Ollama: Local LLM runner with easy model downloading and CLI interaction.
        • WebLLM: Runs LLM inference entirely in the browser using WebGPU.
        • LM Studio: Desktop app to run and test local models.
        • GPT4ALL: Open-source, offline-capable chatbot environment.
        • llama.cpp: Lightweight C++ implementation for running LLaMA-family models on CPUs.

          5. LLM Inference & Serving Frameworks – Comparison Table

          Framework / ToolLanguage & EcosystemDeployment ModelKey StrengthsOptimisationsBest For
          vLLMPython (PyTorch), UC Berkeley originLocal & CloudHigh throughput, easy APIPagedAttention, efficient KV cacheGeneral-purpose high-speed serving
          LightLLMPythonLocal, CloudLightweight, minimal dependenciesAsync IO, KV cache, batchingResource-constrained environments
          LMDeployPython + C++Local, Edge, CloudModel compression + serving toolkitQuantisation, distillationDeploying smaller/faster LLMs
          SGLangPythonLocal, CloudFast for LLM & VLMOptimised batching, KV cacheMultimodal LLM serving
          OpenLLMPython (BentoML)Local, CloudOpenAI-compatible APIs for any modelAPI standardisationDevelopers needing API parity with OpenAI
          Triton + TensorRT-LLMC++, Python (NVIDIA)On-prem & CloudEnterprise-grade GPU servingTensorRT optimisations, FP8/INT8 quantGPU-heavy workloads
          Ray ServePythonCloud, KubernetesMulti-model compositionScaling, autoschedulingComplex inference pipelines
          Hugging Face TGIPythonLocal, CloudOptimised for text generationSpeculative decoding, quantisationHugging Face model ecosystem
          DeepSpeed-MIIPythonCloud & On-PremHigh throughput, low latencyTensor parallelism, quantisationLarge model production serving
          CTranslate2C++, PythonLocalFast Transformer inferenceQuantisation, CPU optimisedCPU-based serving
          BentoMLPythonLocal, CloudUnified inference for any modelAPI packaging, scalingTeams serving mixed model types
          MLC LLMPython, RustLocal, Mobile, WebRuns anywhere (cross-platform)ML compilationDeploying LLMs to edge/mobile
          OllamaCLILocal (Mac/Linux)Simple local LLM servingPre-packaged modelsNon-technical local usage
          WebLLMJavaScriptBrowserNo server neededWebGPU executionRunning LLMs fully in-browser
          LM StudioGUI AppLocalEasy local model download & runBuilt-in chat interfaceOffline local use
          GPT4ALLPython, GUILocalWide model supportCPU optimisedPrivacy-focused offline inference
          llama.cppC++LocalLightweight, portableQuantisation to 4-bitRunning LLMs on low-spec hardware

          6. Important Terminologies

          6.1 KV Cache (Key-Value Cache)

          Key-value caching is a technique used in transformer models where the key and value matrices computed in earlier decoding steps are stored for reuse during subsequent token generation.

          • Benefits: Reduces redundant computation, leading to faster inference times.
          • Trade-offs: Increased memory consumption, especially for long context windows.
          • Optimizations:
            • Cache Invalidation – removing unused portions of the cache when switching contexts.
            • Cache Reuse – sharing parts of the cache between similar prompts or multi-turn conversations.
            • Quantized Cache – storing KV cache in lower precision (e.g., FP16, INT8) to save memory.

          6.2 PagedAttention

          A memory management strategy for KV cache inspired by virtual memory paging in operating systems. Instead of storing keys and values contiguously, they are stored in fixed-size memory pages, allowing flexible allocation and reuse of GPU memory.

          • Advantages:
            • Efficient use of GPU VRAM.
            • Avoids large contiguous memory allocations.
          • Implementations: Used in libraries like vLLM to enable very large batch sizes.

          6.3 Batching

          Combining multiple inference requests into a single forward pass to improve GPU utilization and throughput.

          • Types:
            • Static Batching – fixed batch size; efficient but may introduce latency.
            • Dynamic Batching – requests are grouped on the fly based on arrival time and sequence length.
          • Key Libraries: Hugging Face TGI, vLLM, TensorRT-LLM, Ray Serve.

          6.4 Support for Quantisation

          Reducing model precision to decrease memory footprint and increase inference speed.

          • Common Precisions: FP32 → FP16 → BF16 → INT8 → INT4.
          • Benefits: Lower memory bandwidth usage, higher cache efficiency.
          • Popular Methods:
            • GPTQ – post-training quantization.
            • AWQ – activation-aware weight quantization.
            • QLoRA – quantized low-rank adapters.

          6.5 LoRA (Low-Rank Adaptation)

          A fine-tuning technique that freezes pre-trained weights and injects small trainable rank decomposition matrices into transformer layers. This drastically reduces the number of trainable parameters for downstream tasks, making fine-tuning cost-efficient.

          6.6 Tool Calling / Function Calling

          Allows an LLM to invoke external APIs or tools when it needs information it was not trained on, or to take actions in the real world.

          • Example: An LLM calling a weather API when asked “What’s the weather in Mumbai right now?”.

          6.7 Reasoning Models

          Models optimized for multi-step problem solving with intermediate reasoning chains.

          • Examples: DeepSeek-R1, OpenAI o1-preview.
          • Techniques: Chain-of-Thought (CoT), Tree-of-Thought (ToT), Graph-of-Thought (GoT).

          6.8 Structured Outputs

          Ensuring the model generates responses in a strict format.

          • Outlines – hierarchical text planning before full generation.
          • LMFE (Language Model Format Enforcer) – enforces output to match JSON Schema, regex, or XML.
          • xgrammar – flexible grammar-based generation.

          6.9 Automatic Prefix Caching (APC)

          Reuses cached prefix computations for similar queries, reducing token processing time for repeated or partially overlapping prompts.

          6.10 Speculative Decoding

          A technique where a smaller, faster “draft” model generates candidate tokens, and the larger main model only verifies and finalizes them—reducing latency significantly.

          6.11 Chunked Prefill

          Splitting long input sequences into manageable chunks for faster prefill operations without overwhelming GPU memory.

          6.12 Prompt Adapter

          A lightweight fine-tuning approach where small adapter layers are trained to inject task-specific knowledge into a base LLM without retraining the entire model.

          6.13 Beam Search

          A decoding strategy that keeps track of multiple candidate sequences at each generation step, selecting the most probable one at the end.

          6.14 Guided Decoding

          Constrained generation to follow specific patterns, constraints, or external logic. Useful for generating SQL queries, code, or structured data.

          6.15 AsyncLM

          Enables asynchronous processing, allowing the LLM to generate and execute multiple function calls or tasks concurrently—reducing idle GPU time.

          6.16 Prompt Logprobs

          Logarithmic probability values for each generated token, useful for evaluating model confidence and detecting hallucinations.

          6.17 KServe

          A standardized, serverless ML inference platform for Kubernetes. Supports scaling, canary deployments, and integrates with GPU/TPU backends.

          6.18 KubeAI

          An AI inference operator for Kubernetes that simplifies deployment of LLMs, VLMs, embedding models, and speech-to-text pipelines.

          6.19 Llama Stack

          A composable set of tools, APIs, and services designed for building applications with Meta’s LLaMA models.

          6.20 Additional Serving & Inference Terms

          • Continuous Batching – Dynamically merging new requests into ongoing batches for maximum throughput (used by vLLM).
          • Request Scheduling – Prioritizing inference requests to meet SLAs for latency-sensitive workloads.
          • Token Parallelism – Splitting token generation across multiple GPUs to improve throughput.
          • Pipeline Parallelism – Splitting the model layers across multiple GPUs.
          • Tensor Parallelism – Splitting individual tensors across GPUs for large model inference.
          • MoE (Mixture of Experts) – Activating only a subset of model parameters per token to reduce compute cost.
          • FlashAttention – An optimized attention algorithm that reduces memory usage and speeds up computation.
          • vLLM – A high-performance inference engine with PagedAttention and continuous batching for serving large language models efficiently.
          • TensorRT-LLM – NVIDIA’s optimized LLM serving library with quantization, fused kernels, and multi-GPU support.
          • Serving Gateway – A request router/load balancer for distributing LLM inference requests across multiple workers.

          7. References

          1. Best LLM Inference Engines and Servers to Deploy LLMs in Production – Overview of popular inference backends.
          2. Efficient Memory Management for Large Language Model Serving with PagedAttention – Core memory optimization paper behind vLLM.
          3. LoRA: Low-Rank Adaptation of Large Language Models – Efficient fine-tuning approach for LLMs.
          4. Fast Inference from Transformers via Speculative Decoding – Reducing token generation latency.
          5. Looking Back at Speculative Decoding – Retrospective analysis of speculative decoding trade-offs.
          6. Efficient Generative Large Language Model Serving – Practical techniques for faster inference.
          7. Ten Ways to Serve Large Language Models: A Comprehensive Guide – High-level serving strategies.
          8. The 6 Best LLM Tools To Run Models Locally – Lightweight deployment options.
          9. Benchmarking LLM Inference Backends – Performance metrics comparison.
          10. Transformers Key-Value Caching Explained – How KV caching accelerates LLMs.
          11. LLM Inference Series: 3. KV Caching Explained – Deep dive on caching internals.
          12. vLLM and PagedAttention: A Comprehensive Overview – End-to-end guide to vLLM.
          13. Understanding Reasoning LLMs – Reasoning capabilities in inference.

          8. Further Reads

          14. DeepSpeed-MII: High-Throughput and Low-Latency Inference for Transformers – Microsoft’s optimized inference stack.
          15. Ray Serve for Distributed LLM Inference – Scaling LLM inference across nodes.
          16. Serving Multiple Models Efficiently with NVIDIA Triton – Multi-model and GPU scheduling strategies.
          17. FlashAttention: Fast and Memory-Efficient Attention – Key innovation for attention speedups.
          18. SGLang: Structured Generation for Large Language Models – Efficient structured text generation.
          19. Quantization-Aware LLM Serving with GPTQ – Speed + memory optimizations through quantization.
          20. MII vs vLLM vs HuggingFace Transformers Benchmarks – Comparative analysis of popular inference engines.
          21. Accelerating LLM Inference with TensorRT-LLM – NVIDIA’s low-level optimization library.
          22. Dynamic Batching for LLM Inference – Improving throughput without hurting latency.
          23. LLMOps: Operational Challenges and Best Practices – Managing LLM inference in production environments.
          24. Speculative Beam Search for Faster LLM Inference – Combining speculative decoding with beam search.
          25. Serving LLMs with Kubernetes and KServe – Cloud-native deployment approaches.

          How do you choose among competing open-source products? Example comparison of open-source vector databases.

          Following are the questions that should be running in your mind when you are to choose from competing open-source products:

          1. Are there any other users out there?
          2. Is it the most popular in this category?
          3. Is this technology in decline?

          The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.

          Following is the comparison of vector databases: qdrant, chroma, weaviate, marqo, milvus, and vespa.

          Star History
          Star History

          Hands-on Tutorial on Making an Audio Bot using LLM, and RAG

          1. Learning Outcome

          1. Learn about Large Language Models (LLMs), their installation, access through HTTP API, the Ollama framework, etc.
          2. Introduction of Retrieval-Augmented Generation (RAG)
          3. Learn about the Data Ingestion Pipeline for the Qdrant vector database
          4. Learn about the RAG Pipeline
          5. Access the prototype using audio-based input/output (audio bot).
          6. Audio bot using speech-to-text and text-to-speech
          7. Making Qdrant client for making queries
          8. Creating context using documents retrieved from the Qdrant database
          9. Using Llama 3.2 as a Large Language Model (LLM) using Ollama and Langchain framework
          10. Create a prompt template using instruction and context, and make a query using the template and langchain
          11. Using Llama to generate the answer to the question using the given context

          2. Large Language Model

          Large Language Model (LLM) is an artificial intelligence (AI) model with billions of parameters, trained on huge amounts of data to comprehend (human-generated text understanding) and produce language similar to a human’s (text generation comparable to that generated by humans). LLMs have learned linguistic structures, relationships, and patterns in human-generated data. It has also gained a huge amount of internal knowledge (in the form of model weights) from the data on which these models have been trained.

          Transformer architecture is often the foundation of large language models, allowing them to

          1. process sequential data that makes it suitable for tasks like text generation, translation, and question-answering)
          2. learn contextual relationships such as word meanings, syntax, and semantics.
          3. generate human-like language, i.e., produce coherent, context-specific text that resembles human-generated content

          Some key characteristics of large language models include:

          1. Trained on vast training data:
          2. Scalability: They can handle long sequences and complex tasks.
          3. Parallel processing

          Figure 1

          LLMs are divided into three categories: 1) encoder only, 2) decoder only, and 3) encoder-decoder models. Examples of LLMs are BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer), respectively. General use cases are shown in the Figure 1 above.

          Major concerns regarding LLMs are:

          1. Data bias: LLMs have the potential to reinforce biases found in the training data.
          2. Interpretability: It can be difficult to comprehend the decision-making process of an LLM.
          3. Security: Adversarial attacks can target LLMs.

          2.1 Llama

          Llama is the popular open-source LLM model from Facebook. Being open-source, we can fine-tune, distill, and deploy and use in our use cases (provided the above-mentioned concerns are taken care of). Llam is a decoder-only language model, which means that it uses a transformer architecture with only decoder layers to generate text.

          At the time of writing this article, current versions are available in three flavors:

          1. Llama 3.1: With Multilingual capability and available in two versions 1) 8B with 8 billion parameters and 2) 405B with 405 billion parameters
          2. Llama 3.2: Lightweight and Multimodal and available in 1) Lightweight 1B and 3B 2) Multimodal 11B and 90B
          3. Llama 3.3: Multilingual with 70B parameters

          In the current prototype, I have used Llama 3.2.

          2.2 Install, run, and different ways to access Llama

          Please refer to my separate post on this topic, Install, run, and access Llama using Ollama.

          3. Retrieval-Augmented Generation (RAG)

          Large language models are locked in time. It has learned the knowledge that was available till the time when it was being trained and released. These models are trained on Internet-scale open data. So when you ask general questions, it would give very good answers, but it may fail to answer or hallucinate if you go very specific to your personal or enterprise data. The reason is that it usually does not have the right context of your requirements that are very specific to your application.

          Retrieval-augmented generation (RAG) combines the strength of generative AI and retrieval techniques. It helps in providing the right context for the LLM along with the question being asked. This way we get the LLM to generate more accurate and relevant content. This is a cost-effective way to improve the output of LLMs without retraining them. The following diagram depicts the RAG architecture:

          Fig1: Conceptual flow of using RAG with LLM

          There are two primary components of the RAG:

          1. Retrieval: This component is responsible for searching and retrieving relevant information related to the user’s query from various knowledge sources such as documents, articles, databases, etc.
          2. Generation: This component does an excellent job of crafting coherent and contextually rich responses to user queries.

          A question submitted by the user is routed to the retrieval component. Using the embedding model, the retrieval component converts the query text to the embedding vector. After that, it looks through the vector database to locate a small number of vectors that match the query text and satisfy the threshold requirements for the similarity score and distance metric. These vectors are transformed back to the text and used as the context. This context, along with the prompt and query, is put in the prompt template and sent to the LLM. LLM returns the generated text that is more correct and relevant to the user’s query.

          4. RAG (Data Ingestion Pipeline)

          In order for the retrieval component to have a searchable index of preprocessed data, we must first build a data input pipeline. The following diagram in fig2 depicts the data ingestion pipeline. Knowledge sources can be web pages, text documents, pdf documents, etc. Texts need to be extracted from these sources. I am using PDF documents as the only knowledge source for this prototype.

          1. Text Extraction: To extract the text from the PDF, various Python libraries can be used, such as PyPDF2, pdf2txt, PDFMiner, etc. If PDF is scanned PDF, libraries such as unstructured, pdf2image, and pytesseract can be utilized. The quality of the text can be maintained by performing cleanups such as removing extraneous characters, fixing formatting issues, whitespace, special characters, punctuation, spell checking, etc. Language detection may also be required if knowledge sources can have text coming in multiple languages, or a single document may contain multiple languages.
          2. Handling Multiple Pages: Maintaining the context across pages is important. It is recommended that the document be segmented into logical units, such as paragraphs or sections, to preserve the context. Extracting metadata such as document titles, authors, page numbers, creation dates, etc., is crucial for improving searchability and answering user queries.

          Fig2: RAG data ingestion pipeline

          Note: I have manually downloaded the PDFs of all the chapters of the book “Democratic Politics” of class IX of the NCERT curriculum. These PDFs will be the knowledge source for our application.

          4.1 Implementation step by step

          Step 1: Install the necessary libraries

          1. pip install pdfminer.six
          2. pip install langchain-ollama

          Imports:

          from langchain_community.document_loaders import PDFMinerLoader
          from langchain.text_splitter import CharacterTextSplitter
          from qdrant_client import QdrantClient

          Step 2: Load the pdf file and extract the text from it

          loader = PDFMinerLoader(path + "/" + file_name)
          pdf_content = loader.load()

          Step 3: Split the text into smaller chunks with overlap

          CHUNK_SIZE = 1000 # chunk size not greater than 1000 chars
          CHUNK_OVERLAP = 30 # a bit of overlap is required for continued context
          
          text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
          docs = text_splitter.split_documents(pdf_content)
          
          # Make a list of split docs
          documents = []
          for doc in docs:
              documents.append(doc.page_content)

          Step 4: Embed and store the documents in the vector database

          FastEmbed is a lightweight, fast Python library built for embedding generation. Qdrant vector database uses this embedding library by default. Following is the code snippet for inserting in the vector database.

          # 3. Create vectordatabase(qdrant) client 
          qdrant_client = QdrantClient(url="http://localhost:6333")
          
          # 4. Add document chunks in vectordb
          qdrant_client.add(
              collection_name="ix-sst-ncert-democratic-politics",
              documents=documents,
              #metadata=metadata,
              #ids=ids
          )

          Step 5: Making a sample query

          # 5. Make a query from the vectordb(qdrant)
          search_results = qdrant_client.query(
              collection_name="ix-sst-ncert-democratic-politics",
              query_text="What is democracy?"
          )
          
          for search_result in search_results:
              print(search_result.document, search_result.score)

          4.2 Complete Code data_ingestion.py

          ###############################################################
          # Data ingestion pipeline 
          # 1. Taking the input pdf file
          # 2. Extracting the content
          # 3. Divide into chunks
          # 4. Use embeddings model to convet to the embedding vector
          # 5. Store the embedding vectors to the qdrant (vector database)
          ################################################################
          import os
          from langchain_community.document_loaders import PDFMinerLoader
          from langchain.text_splitter import CharacterTextSplitter
          from qdrant_client import QdrantClient
          
          path = "ix-sst-ncert-democratic-politics"
          filenames = next(os.walk(path))[2]
          
          for i, file_name in enumerate(filenames):
              print(f"Data ingestion for the chapter: {i}")
          
              # 1. Load the pdf document and extract text from it
              loader = PDFMinerLoader(path + "/" + file_name)
              pdf_content = loader.load()
              print(pdf_content)
          
              # 2. Split the text into small chunks
              CHUNK_SIZE = 1000 # chunk size not greater than 1000 chars
              CHUNK_OVERLAP = 30 # a bit of overlap is required for continued context
          
              text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
              docs = text_splitter.split_documents(pdf_content)
          
              # Make a list of split docs
              documents = []
              for doc in docs:
                  documents.append(doc.page_content)
          
              # 3. Create vectordatabase(qdrant) client 
              qdrant_client = QdrantClient(url="http://localhost:6333")
          
              # 4. Add document chunks in vectordb
              qdrant_client.add(
                  collection_name="ix-sst-ncert-democratic-politics",
                  documents=documents,
                  #metadata=metadata,
                  #ids=ids
              )
          
              # 5. Make a query from the vectordb(qdrant)
              search_results = qdrant_client.query(
                  collection_name="ix-sst-ncert-democratic-politics",
                  query_text="What is democracy?"
              )
          
              for search_result in search_results:
                  print(search_result.document, search_result.score)

          Qdrant Dashboard can be accessed at: http://localhost:6333/dashboard

          5. RAG (Information Retrieval and Generation) – Audio Bot

          I am making an audio bot that will answer questions from the chapters of the book “Democratic Politics” of class IX of the NCERT(India) curriculum. If you want to learn about making an audio bot, you can read my article on the topic “Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB“.

          5.1 Audio Bot Implementation

          The following diagram depicts the overall flow of the audio bot and how it interacts with the RAG system. A user interacts with the audio bot using the microphone. The microphone captures the speech audio signal and passes it on to the speech-to-text library (I am using faster_whisper) which in turn converts to a text query that is in turn passed on to the RAG system as a query. When the RAG system comes up with the response text, this text is passed on to the text-to-speech library (I am using pyttsx3) that in turn converts text to audio which is then played by the speaker so the user can listen to the response.

          5.1.1 Recording Audio from Microphone

          I have a detailed blog on this topic. Please refer: How to Record, Save and Play Audio in Python?

          5.1.2 Speech-to-Text

          faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

          Installation: pip install faster-whisper

          Save the following code in Python file say "speech-to-text.py" and run python speech-to-text.py

          from faster_whisper import WhisperModel
          
          model_size = "small.en"
          model = WhisperModel(model_size, device="cpu", 
                               compute_type="int8")
          
          # Transcribe
          transcription = model.transcribe(
              audio="basic_output1.wav",
              language="en",
          )
          
          seg_text = ''
          for segment in transcription[0]:
              seg_text = segment.text
          
          print(seg_text)

          Sample input audio file:

          Output text: “Please ask me something. I’m listening now”

          5.1.3 Text-to-Speech

          The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.

          Installation: pip install pyttsx3

          Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py

          Code Snippet:

          import pyttsx3
          
          engine = pyttsx3.init()
          engine.setProperty('volume', 1)
          engine.setProperty('rate', 130)
          voices = engine.getProperty('voices')
          engine.setProperty('voice', voices[1].id)
          engine.setProperty('voice', 'english+f3')
          text_to_speak = "I got your question. Please bear " \
              "with me while I retrieve the answer."
          engine.say(text_to_speak)
          # Folloing is the optional line: If you want 
          # also to save audio file
          engine.save_to_file(text_to_speak, 'speech.wav') 
          engine.runAndWait()

          Sample input text: “I got your question. Please bear with me while I retrieve the answer.”

          5.1.4 Play Audio on the Speaker

          I have a detailed blog on this topic. Please refer: How to Record, Save and Play Audio in Python?

          5.2 RAG (Generation) using LLM (Llama3.2)

          Following code snippet makes a query to the qdrant, retrieves relevant documents, selects few documents based on the similarity score.

          search_results = qdrant_client.query(
              collection_name="ix-sst-ncert-democratic-politics",
              query_text=query
          )
          contaxt = ""
          for search_result in search_results:
              if search_result.score >= 0.7:
                  contaxt + search_result.document

          The following code snippet creates a template, the template is used to create the prompt, create a reference to the llama model, chain (langchain pipeline for executing the query) is created using prompt and model, and finally chain is invoked to execute the query to get the response formed by LLM using the retrieved context.

          # 4. Using LLM for forming the answer
          template = """Instruction: {instruction}
          Contaxt: {contaxt}
          Query: {query}
          """
          
          prompt = ChatPromptTemplate.from_template(template)
          
          model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model
          
          chain = prompt | model
          
          bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", 
                  "contaxt": contaxt,
                  "query": query
                })

          5.3 Complete Code audiobot.py

          Following is the code snippet for the audio bot. Save the file as audiobot.py

          import pyaudio
          import wave
          import pyttsx3
          from qdrant_client import QdrantClient
          from langchain_ollama.llms import OllamaLLM
          from langchain_core.prompts import ChatPromptTemplate
          from faster_whisper import WhisperModel
          
          # Load the Speech to Text Model (faster-whisper: pip install faster-whisper)
          whishper_model_size = "small.en"
          whishper_model = WhisperModel(whishper_model_size, device="cpu", 
                               compute_type="int8")
          
          CHUNK = 512 
          FORMAT = pyaudio.paInt16 #paInt8
          CHANNELS = 1
          RATE = 44100 #sample rate
          RECORD_SECONDS = 7
          WAVE_OUTPUT_FILENAME = "pyaudio-output.wav"
          
          def speak(text_to_speak):
              engine = pyttsx3.init()
              engine.setProperty('volume', 1)
              engine.setProperty('rate', 130)
              voices = engine.getProperty('voices')
              engine.setProperty('voice', voices[1].id)
              engine.setProperty('voice', 'english+f3')
              engine.say(text_to_speak)
              engine.runAndWait()
          
          speak("I am an AI bot. I have learned the book \"democratic politics\" of class 9 published by N C E R T. You can ask me questions from this book.")
          
          while True:
              speak("I am listening now for you question.")
          
              p = pyaudio.PyAudio()
              stream = p.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=RATE,
                              input=True,
                              frames_per_buffer=CHUNK) #buffer
          
              print("* recording")
              frames = []
          
              for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
                  data = stream.read(CHUNK)
                  frames.append(data) # 2 bytes(16 bits) per channel
          
              stream.stop_stream()
              stream.close()
              p.terminate()
          
              wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
              wf.setnchannels(CHANNELS)
              wf.setsampwidth(p.get_sample_size(FORMAT))
              wf.setframerate(RATE)
              wf.writeframes(b''.join(frames))
              wf.close()
          
              # Transcribe
              transcription = whishper_model.transcribe(
                  audio=WAVE_OUTPUT_FILENAME,
                  language="en",
              )
              seg_text = ''
              for segment in transcription[0]:
                  seg_text = segment.text
              
              print(f'\nUser: {seg_text}')
          
              if seg_text == '':
                  speak("Probably you did not say anything.")
                  continue
              else:
                  text_to_speak = "I got your question. Please bear with me " \
                      + "while I retrieve about the answer."
                  speak(text_to_speak)
          
              # 1. Create vector database(qdrant) client 
              qdrant_client = QdrantClient(url="http://localhost:6333")
          
              # 2. Make a query to the vectordb (qdrant)
              #query = "explain democracy in estonia?"
              query = seg_text
          
              search_results = qdrant_client.query(
                  collection_name="ix-sst-ncert-democratic-politics",
                  query_text=query
              )
          
              context = ""
              no_of_docs = 2
              count = 1
              for search_result in search_results:
                  if search_result.score >= 0.8:
                      print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}")
                      context = context + search_result.document
                  if count >= no_of_docs:
                      break
                  count = count + 1
          
              #print(f"Retrieved Context: {context}")
          
              if context == "":
                  print("Context is blank. Could not find any relevant information in the given sources.")
                  speak("I did not find anything in the book about the question.")
                  continue
          
              # 4. Using LLM for forming the answer
              template = """Instruction: {instruction}
              Context: {context}
              Query: {query}
              """
          
              prompt = ChatPromptTemplate.from_template(template)
          
              model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model
          
              chain = prompt | model
          
              bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", 
                          "context": context,
                          "query": query
                          })
          
              print(f'\nBot: {bot_response}')
           
              speak(bot_response)

          6. Libraries Used

          Following is the list of libraries used in the prototype implementation. These can be installed from the Python pip command.

          1. qdrant-client
          2. pdfminer.six
          3. fastembed
          4. pyaudio
          5. pyttsx3
          6. langchain-ollama
          7. faster-whisper
          8. langchain-community

          7. Steps

          1. Create env python -m venv env1
          2. Activate using the activate command in env1\Scripts\activate if on Windows. Activate command is there in the bin directory on the Linux system.
          3. python -m pip install -r requirements.txt
          4. python data_ingestion.py
          5. python audiobot.py

          10. My Conversation with the Audio Bot

          9. Further Improvement

          In the current prototype, the chunk size is of fixed length, CHUNK_SIZE = 1000 and CHUNK_OVERLAP = 30. For further improvement, the document can be split into logical units, such as paragraphs/sections, to maintain a better context.

          10. References

          1. A Practical Approach to Retrieval Augmented Generation Systems by Mehdi Allahyari and Angelina Yang
          2. Install, run, and access Llama using Ollama. – link
          3. How to Record, Save, and Play Audio in Python? – link
          4. Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB – link
          5. fastembed library – link
          6. Qdrant – link
          7. pyttsx3 – link