Text clustering is an unsupervised approach that helps in discovering patterns in data. Grouping similar texts according to their semantic content, meaning, relationships, etc. is the goal of text clustering. This makes it easier to cluster vast amounts of unstructured text and perform exploratory data analysis quickly. With recent advancements in large language models (LLMs), we can obtain extremely precise contextual and semantic representations of text. This has improved text clustering’s efficacy even more. Use cases for text clustering include identifying outliers, accelerating labelling, identifying data that has been erroneously labelled, and more.
Topic modelling facilitates the identification of (abstract) topics that arise in huge textual data collections. Clusters of text documents can be given meaning through this method.
We will learn how to use embedding models for text clustering and text-clustering-inspired method of topic modeling, namely BERTopic, generating labels using LLM given the keywords of the topic.
2. Pipeline for Text Clustering
The following diagram depicts the pipeline for text clustering that consists of the following three steps:
Use an embedding model to transform the input documents into embeddings.
Using a dimensionality reduction model, lower the dimensionality of embeddings.
Use a cluster model to identify groups of documents that share semantic similarities.
2.1 Embedding Documents
We know that embeddings are the numerical representations of text that attempt to capture its meaning. We can use embedding models that are optimized for semantic similarity tasks, for transforming the documents to embeddings in the first step. We can use the Massive Text Embedding Benchmark (MTEB) leaderboard, for selecting the embedding model for our requirement. For example “thenlper/gte-small” is a small but performant model with fast inference.
2.2 Dimensionality Reduction of Embeddings
It is difficult for clustering techniques to identify meaningful clusters if the dimension of the data is high. These techniques preserve the global structure of the data by finding the low-dimension representations. This techniques act as compression techniques, so these do not remove dimensions arbitrarily. Popular algorithms are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Compared to PCA, UMAP is better in handling nonlinear relationships and structures.
2.3 Clustering the Reduced Embeddings
The following diagram depicts the methods of text clustering:
Density-based clustering algorithms calculate the number of dimensions itself and do not force all data points to be part of the cluster. The data points not part of any cluster are marked as outliers. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a hierarchical variation of the DBSCAN algorithm that specializes in finding dense(micro)-clusters without being told the number of clusters in advance.
3. Text Clustering to Topic Modeling
Topic modelling is the term used to describe the process of identifying themes or hidden topics in a set of textual data. Traditionally, it involves finding a group of keywords or phrases that best represent and capture the essence of the topic. We need to understand the meaning of the topic through these keywords or phrases. Latent Dirichlet Allocation (LDA) is one such algorithm. Let’s discuss BERTopic in the following sections, which is a highly modular text clustering and topic modeling framework.
3.1 BERTopic: A Modular Topic Modeling Framework
Steps for performing topic modeling follow three steps of text clustering. The output of the third step of text clustering is fed to the fourth step of topic modeling. The following diagram depicts the steps (4th and 5th) for topic modeling:
4th step calculates the class-based term frequency, i.e., frequency (tf) of word (X) in cluster (C). This term is then multiplied with IDF (inverse document frequency) in the 5th step. The goal is to give more weight to the words in a cluster and less weight to the words appearing across all clusters.
The following diagram depicts the full pipeline from clustering to topic modeling. Though topic modeling follows clustering, they are largely independent of each other, and each component is modular. BERTopic can be customised, and we can choose another algorithm instead of the default ones.
3.2 Reranking in BERTopic
c-TF-IDF does not take into account the semantic structures, so BERTopic leverages the representation models (e.g. KeyBERTInspired, Maximal marginal relevance – MMR, spaCy) to rerank the topics found out in the previously discussed 5th step. This reranking is applied on each topic instead of every document. Many of the representation models are LLMs. Now with this step final pipeline extends to become the following:
3.3 Using LLM to generate a label for Topic
The following diagram explains how keywords combined with documents, along with a prompt, can be passed on to LLM for generating the label for the topic given keywords.
Final pipeline is as follows:
Final detailed pipeline:
References
Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
A common task in natural language processing (NLP) is text classification. Use cases of text classification include sentiment analysis, intent detection, entity extraction, and language detection. This article will delve into how to use LLMs for text classification. We will see representation models and generative models. Under the representation model, we will see how to use task-specific models and embedding models to classify the text. Under the generative models, we will see open source and closed source models. While both generative and representation models can be applied to classification, they take different approaches.
2. Text Classification with Representation Models
Task-specific models and embedding models are two types of representation models that can be used for text classification. To obtain task-specific models, representation models, like bidirectional encoder representations from transformers (BERT), are trained for a particular task, like sentiment analysis. On the other hand, general-purpose models such as embedding models can be applied to a range of tasks outside classification, such as semantic search.
As it can be seen in the below diagram, when used in the text classification use case, representation models are kept frozen (untrainable). As the task-specific models are specially trained for the given task, when the text is given as input, it can classify the given text as per the task at hand. But when we are using the embedding model, we need to generate embeddings for the texts in the training set. Then train a classifier on the train dataset that has embeddings and corresponding labels. Once the classifier is trained, it can be used for classification.
3. Model Selection
The factors we should look into for selecting the model for text classification:
How does it fit the use case?
What is its language capability?
What is the underlying architecture?
What is the size of the model?
How is the performance? etc.
Underlying Architecture
BERT is an encoder-only architecture and is a popular choice for creating task-specific models and embedding models, and falls into the category of representation models. Generative Pre-trained Transformer (GPT) is a decoder-only architecture that falls into the generative models category. Encoder-only models are normally small in size. Variations of BERT are RoBERTa, DistilBERT, DeBERTa, etc. For task-specific use cases such as sentiment analysis, Twitter-RoBERTa-base can be a good starting point. For embedding models sentence-transformers/all-mpnet-base-v2 can be a good starting point as this is a small but performant model.
4. Text Classification using Task-specific Models
This is pretty straight forward. Text is passed to the tokenizer that splits the text into tokens. These tokens are consumed by the task-specific model for predicting the class of the text.
This is fine if we could find the task-specific models for our use case. Otherwise, if we have to fine-tune the model ourselves, we would need to check if we have sufficient budget(time, cost) for it. Another option is to resort to using the general-purpose embedding models.
5. Text Classification using Embedding Models
We can generate features using an embedding model rather than directly using the task-specific representation model for classification. These features can be used for training a classifier such as logistic regression.
5.1 What if we do not have the labeled data?
We have the definition of the labels, but we do not have the labeled data, we can utilize what is called “zero-shot classification“. Zero-shot model predicts the labels of input text even if it was not trained on them. Following diagram depicts the concept.
We can use zero-shot classification using embeddings. We can describe our labels based on what they should represent. The following diagram describes the concept.
To assign the labels to the input text/document, we can calculate the cosine similarity with the label embeddings to check which label it is close to.
6. Text Classification with Generative Models
Generative models are trained for a wide variety of tasks, so it will not work with the text classification out of the box. To make the generative model understand our context, we need to use the concept of prompt engineering. The prompt needs to be cleverly written so that the model understands what it is expected to do, what the candidate labels, etc.
6.1 Using Text-to-Text Transfer Model (T5)
The following diagram summarizes the different categories of the models:
The following diagram depicts the training steps:
We need to prefix each document with the prompt “Is the following sentence positive or negative?“
6.2 ChatGPT for Classification
The following diagram describes the training procedure of ChatGPT:
The model is trained using human preference data to generate text that resembles human preference.
For text classification, following is the sample prompt:
prompt = """Predict whether the following document is a positive or negative movie review:
[DOCUMENT]
If if is positive return 1 and if it is negative return 0. Do not give any other answers. """
References
Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
LLM inference and serving refer to the process of deploying large language models and making them accessible for use — whether locally for personal projects or in production for large-scale applications. Depending on your needs, you may opt for a lightweight local deployment or a robust, enterprise-grade solution. The right choice depends on factors like performance, scalability, latency, and infrastructure availability.
2. LLM Inference & Serving Architecture
The following is an overview of the architecture for LLMs’ inference and serving. This architecture represents how an application interacts with a deployed Large Language Model (LLM) to generate predictions (inference). It highlights the role of inference servers, inference engines, and the hardware layer.
Figure: Typical architecture of LLM Inference and Serving
2.1. Application Layer
The Application is the client (e.g., a chatbot, API consumer, or internal tool) that sends requests to the inference server.
These requests are typically made over HTTP or gRPC for low-latency, high-performance communication.
Example: A frontend UI sends a user’s prompt to the server for processing.
2.2 Inference Server
The Inference Server acts as the bridge between the application and the model. It contains three main responsibilities as shown in the diagram:
2.2.1 Query Queue Scheduler
Purpose: Manages incoming queries in a queue to avoid overwhelming the model.
Function:
Receives requests from the application.
Places them in a queue.
Uses a scheduler to decide when and how to process them.
Batching Opportunity: The scheduler can group multiple requests together into a batch, improving GPU utilisation and throughput.
2.2.2. Metrics Module
Purpose: Collects real-time statistics on system performance.
Metrics Tracked:
Throughput: Tokens generated per second.
Latency: Total response time.
TTFT (Time to First Token): Time from request to first token generated.
Resource utilisation (GPU/CPU/memory).
This data is essential for monitoring, scaling, and debugging.
2.3. Inference Engine
The Inference Engine is the core computation unit that runs the LLM.
2.3.1 Batching
Groups multiple queued queries into a single execution batch on the GPU.
Benefits:
Reduces overhead from individual GPU calls.
Improves parallel processing efficiency.
2.3.2 Model Execution
Runs the LLM model itself.
The model takes the batched input and generates output tokens.
Can utilise optimisations such as:
KV Caching for faster token generation.
Quantisation for memory efficiency.
Speculative Decoding for speed.
2.3.3 Query Response
Gathers model outputs.
Splits them back into individual responses for each original request.
Sends results back to the application over HTTP/gRPC.
2.4 Hardware Layer
GPU/CPU hardware actually runs the model computation.
For LLMs, GPUs (often with large VRAM) are preferred for:
Parallel processing of large tensor computations.
Efficient handling of multi-batch workloads.
CPUs can be used for smaller models or less latency-sensitive tasks.
2.5. Workflow Summary
Application sends HTTP/gRPC request with prompt.
Query Queue Scheduler stores and batches incoming requests.
Batching Module groups requests for efficient GPU execution.
Model generates predictions (tokens).
Query Response sends formatted results back.
Metrics Module continuously tracks performance.
Results return to the application.
3. Evaluation of LLM Inference and Serving
When evaluating LLM inference (the process of generating outputs from a trained large language model) and LLM serving (the infrastructure and software that delivers model predictions to end users), the two primary performance metrics are:
Throughput – Measures the total volume of output tokens generated per second.
Example: An LLM serving system producing 2,000 tokens/sec can handle more concurrent requests or generate longer responses faster than one producing 500 tokens/sec.
High throughput is critical for scenarios like batch inference, multi-user chatbots, or real-time content generation in high-traffic applications.
Throughput depends on factors such as:
Model size and architecture (e.g., LLaMA vs. GPT-style transformers).
GPU/TPU hardware capabilities and memory bandwidth.
Request batching efficiency.
Quantisation and weight compression techniques.
Latency – Measures how quickly a model responds to an individual request.
Key metric: Time to First Token (TTFT) – the delay between receiving a prompt and starting to produce output.
After TTFT, token generation latency is often measured as milliseconds per token (ms/token).
Low latency is especially important for interactive applications like real-time chat, virtual assistants, and code autocompletion.
Latency can be influenced by:
Model loading time (cold start vs. warm start).
Network communication overhead.
Prompt length (longer prompts mean longer context processing time).
3.1 Optimisation Strategies for High Throughput and Low Latency
LLM inference engines and serving frameworks focus heavily on memory utilisation and computational efficiency in production:
Model Optimisation
Quantisation – Reduce precision (e.g., FP16 → INT8 or 4-bit) to speed up inference with minimal accuracy loss.
Pruning – Remove redundant weights to reduce model size.
Speculative Decoding – Use a smaller model to “guess” future tokens and confirm with the main model.
LoRA / PEFT – Use parameter-efficient fine-tuning to avoid reloading huge models.
Serving Architecture
Request Batching – Combine multiple user queries into a single forward pass for better GPU utilisation.
Pipeline Parallelism – Split model layers across multiple GPUs.
Tensor Parallelism – Split the tensor computations across devices.
KV Cache Reuse – Store intermediate attention key-value pairs to avoid recomputation in autoregressive decoding.
Infrastructure-Level Improvements
Use GPUs with high VRAM and bandwidth (e.g., NVIDIA A100/H100).
Place inference servers close to end users to reduce network latency (edge inference).
Use asynchronous request handling and efficient scheduling.
3.2 Beyond Throughput and Latency – Additional Considerations
While throughput and latency form the backbone of LLM performance evaluation, real-world deployments often require additional criteria to ensure stability, scalability, and cost-effectiveness.
3.2.1 Scalability
What it means: The ability of the inference infrastructure to handle sudden spikes in traffic without sacrificing speed or accuracy.
Why it matters:
LLM-based customer support systems may experience traffic surges during product launches.
AI coding assistants can see unpredictable query bursts during hackathons or exams.
How it’s achieved:
Auto-scaling mechanisms in Kubernetes (HPA/VPA) or serverless GPU backends.
Load balancing across multiple GPU/TPU nodes.
Model sharding for extremely large models (e.g., Megatron-LM, DeepSpeed ZeRO-3).
3.2.2 Cost Efficiency
What it means: Delivering optimal performance per dollar spent, especially important in pay-per-token or per-hour GPU rental models.
Why it matters:
Cloud GPU instances (A100, H100) are expensive; inefficient deployments can burn budgets fast.
Inference for large models may cost more than fine-tuning if poorly optimized.
Strategies:
Use quantization (e.g., INT8, FP16) to reduce GPU memory usage and increase batch size.
Employ dynamic batching to process multiple requests simultaneously.
Choose spot/preemptible GPU instances for non-critical workloads.
3.2.3 Ease of Deployment
What it means: How quickly and reliably the LLM stack can be set up and updated.
Why it matters:
Shorter deployment cycles reduce time-to-market.
DevOps teams prefer infrastructure that integrates into existing CI/CD pipelines.
Implementation best practices:
Package inference servers using Docker.
Deploy using Helm charts for Kubernetes clusters.
Integrate with GitHub Actions / GitLab CI for automated rollouts.
3.2.4 Fault Tolerance & Reliability
What it means: The ability of the system to keep running even if one or more nodes fail.
Why it matters:
LLM applications like healthcare assistants or financial chatbots can’t afford downtime.
Techniques:
Redundant model replicas with active-active failover.
Checkpointing model states so recovery is quick.
Health checks and graceful degradation (e.g., fall back to a smaller, local model if GPU fails).
3.2.5 Multi-Model Support
What it means: Running different LLMs (or different versions of the same LLM) simultaneously.
Why it matters:
Some applications may require domain-specific models alongside general-purpose ones.
Allows A/B testing for performance evaluation before production rollout.
Examples:
vLLM and Triton Inference Server can host multiple models and route requests accordingly.
3.2.6 Security & Compliance
What it means: Protecting data in transit and ensuring compliance with legal and organizational standards.
Why it matters:
LLMs often process sensitive data (e.g., PII, financial records, medical notes).
Non-compliance with regulations like GDPR, HIPAA, or SOC 2 can lead to heavy penalties.
Security Measures:
TLS encryption for all API calls.
Role-based access control (RBAC) and API key authentication.
Audit logs for every request to track usage.
4. Prominent Products
When selecting an LLM inference and serving framework, the decision often hinges on performance, scalability, hardware compatibility, and ease of integration with existing workflows. Below is an expanded look at some of the most prominent solutions in the space.
Origin & Background: Developed at the Sky Computing Lab, UC Berkeley, vLLM quickly gained popularity for its advanced scheduling and memory management optimizations.
Key Strengths:
PagedAttention: An innovative attention mechanism for faster inference and reduced memory footprint.
Supports continuous batching, making it ideal for serving multiple requests with minimal latency spikes.
Enterprise Adoption: Neural Magic’s nm-vllm (post Red Hat acquisition) adds enterprise-grade optimizations like quantization, CPU acceleration, and Kubernetes deployment tooling.
Key-value caching is a technique used in transformer models where the key and value matrices computed in earlier decoding steps are stored for reuse during subsequent token generation.
Benefits: Reduces redundant computation, leading to faster inference times.
Trade-offs: Increased memory consumption, especially for long context windows.
Optimizations:
Cache Invalidation – removing unused portions of the cache when switching contexts.
Cache Reuse – sharing parts of the cache between similar prompts or multi-turn conversations.
Quantized Cache – storing KV cache in lower precision (e.g., FP16, INT8) to save memory.
6.2 PagedAttention
A memory management strategy for KV cache inspired by virtual memory paging in operating systems. Instead of storing keys and values contiguously, they are stored in fixed-size memory pages, allowing flexible allocation and reuse of GPU memory.
Advantages:
Efficient use of GPU VRAM.
Avoids large contiguous memory allocations.
Implementations: Used in libraries like vLLM to enable very large batch sizes.
6.3 Batching
Combining multiple inference requests into a single forward pass to improve GPU utilization and throughput.
Types:
Static Batching – fixed batch size; efficient but may introduce latency.
Dynamic Batching – requests are grouped on the fly based on arrival time and sequence length.
Key Libraries: Hugging Face TGI, vLLM, TensorRT-LLM, Ray Serve.
6.4 Support for Quantisation
Reducing model precision to decrease memory footprint and increase inference speed.
A fine-tuning technique that freezes pre-trained weights and injects small trainable rank decomposition matrices into transformer layers. This drastically reduces the number of trainable parameters for downstream tasks, making fine-tuning cost-efficient.
6.6 Tool Calling / Function Calling
Allows an LLM to invoke external APIs or tools when it needs information it was not trained on, or to take actions in the real world.
Example: An LLM calling a weather API when asked “What’s the weather in Mumbai right now?”.
6.7 Reasoning Models
Models optimized for multi-step problem solving with intermediate reasoning chains.
Ensuring the model generates responses in a strict format.
Outlines – hierarchical text planning before full generation.
LMFE (Language Model Format Enforcer) – enforces output to match JSON Schema, regex, or XML.
xgrammar – flexible grammar-based generation.
6.9 Automatic Prefix Caching (APC)
Reuses cached prefix computations for similar queries, reducing token processing time for repeated or partially overlapping prompts.
6.10 Speculative Decoding
A technique where a smaller, faster “draft” model generates candidate tokens, and the larger main model only verifies and finalizes them—reducing latency significantly.
6.11 Chunked Prefill
Splitting long input sequences into manageable chunks for faster prefill operations without overwhelming GPU memory.
6.12 Prompt Adapter
A lightweight fine-tuning approach where small adapter layers are trained to inject task-specific knowledge into a base LLM without retraining the entire model.
6.13 Beam Search
A decoding strategy that keeps track of multiple candidate sequences at each generation step, selecting the most probable one at the end.
6.14 Guided Decoding
Constrained generation to follow specific patterns, constraints, or external logic. Useful for generating SQL queries, code, or structured data.
6.15 AsyncLM
Enables asynchronous processing, allowing the LLM to generate and execute multiple function calls or tasks concurrently—reducing idle GPU time.
6.16 Prompt Logprobs
Logarithmic probability values for each generated token, useful for evaluating model confidence and detecting hallucinations.
6.17 KServe
A standardized, serverless ML inference platform for Kubernetes. Supports scaling, canary deployments, and integrates with GPU/TPU backends.
6.18 KubeAI
An AI inference operator for Kubernetes that simplifies deployment of LLMs, VLMs, embedding models, and speech-to-text pipelines.
6.19 Llama Stack
A composable set of tools, APIs, and services designed for building applications with Meta’s LLaMA models.
6.20 Additional Serving & Inference Terms
Continuous Batching – Dynamically merging new requests into ongoing batches for maximum throughput (used by vLLM).
Request Scheduling – Prioritizing inference requests to meet SLAs for latency-sensitive workloads.
Token Parallelism – Splitting token generation across multiple GPUs to improve throughput.
Pipeline Parallelism – Splitting the model layers across multiple GPUs.
Tensor Parallelism – Splitting individual tensors across GPUs for large model inference.
MoE (Mixture of Experts) – Activating only a subset of model parameters per token to reduce compute cost.
FlashAttention – An optimized attention algorithm that reduces memory usage and speeds up computation.
vLLM – A high-performance inference engine with PagedAttention and continuous batching for serving large language models efficiently.
TensorRT-LLM – NVIDIA’s optimized LLM serving library with quantization, fused kernels, and multi-GPU support.
Serving Gateway – A request router/load balancer for distributing LLM inference requests across multiple workers.
7. References
Best LLM Inference Engines and Servers to Deploy LLMs in Production – Overview of popular inference backends.
Efficient Memory Management for Large Language Model Serving with PagedAttention – Core memory optimization paper behind vLLM.
LoRA: Low-Rank Adaptation of Large Language Models – Efficient fine-tuning approach for LLMs.
Fast Inference from Transformers via Speculative Decoding – Reducing token generation latency.
Looking Back at Speculative Decoding – Retrospective analysis of speculative decoding trade-offs.
Efficient Generative Large Language Model Serving – Practical techniques for faster inference.
Ten Ways to Serve Large Language Models: A Comprehensive Guide – High-level serving strategies.
The 6 Best LLM Tools To Run Models Locally – Lightweight deployment options.
Transformers Key-Value Caching Explained – How KV caching accelerates LLMs.
LLM Inference Series: 3. KV Caching Explained – Deep dive on caching internals.
vLLM and PagedAttention: A Comprehensive Overview – End-to-end guide to vLLM.
Understanding Reasoning LLMs – Reasoning capabilities in inference.
8. Further Reads
14. DeepSpeed-MII: High-Throughput and Low-Latency Inference for Transformers – Microsoft’s optimized inference stack. 15. Ray Serve for Distributed LLM Inference – Scaling LLM inference across nodes. 16. Serving Multiple Models Efficiently with NVIDIA Triton – Multi-model and GPU scheduling strategies. 17. FlashAttention: Fast and Memory-Efficient Attention – Key innovation for attention speedups. 18. SGLang: Structured Generation for Large Language Models – Efficient structured text generation. 19. Quantization-Aware LLM Serving with GPTQ – Speed + memory optimizations through quantization. 20. MII vs vLLM vs HuggingFace Transformers Benchmarks – Comparative analysis of popular inference engines. 21. Accelerating LLM Inference with TensorRT-LLM – NVIDIA’s low-level optimization library. 22. Dynamic Batching for LLM Inference – Improving throughput without hurting latency. 23. LLMOps: Operational Challenges and Best Practices – Managing LLM inference in production environments. 24. Speculative Beam Search for Faster LLM Inference – Combining speculative decoding with beam search. 25. Serving LLMs with Kubernetes and KServe – Cloud-native deployment approaches.
Following are the questions that should be running in your mind when you are to choose from competing open-source products:
Are there any other users out there?
Is it the most popular in this category?
Is this technology in decline?
The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.
Learn about Large Language Models (LLMs), their installation, access through HTTP API, the Ollama framework, etc.
Introduction of Retrieval-Augmented Generation (RAG)
Learn about the Data Ingestion Pipeline for the Qdrant vector database
Learn about the RAG Pipeline
Access the prototype using audio-based input/output (audio bot).
Audio bot using speech-to-text and text-to-speech
Making Qdrant client for making queries
Creating context using documents retrieved from the Qdrant database
Using Llama 3.2 as a Large Language Model (LLM) using Ollama and Langchain framework
Create a prompt template using instruction and context, and make a query using the template and langchain
Using Llama to generate the answer to the question using the given context
2. Large Language Model
Large Language Model (LLM) is an artificial intelligence (AI) model with billions of parameters, trained on huge amounts of data to comprehend (human-generated text understanding) and produce language similar to a human’s (text generation comparable to that generated by humans). LLMs have learned linguistic structures, relationships, and patterns in human-generated data. It has also gained a huge amount of internal knowledge (in the form of model weights) from the data on which these models have been trained.
Transformer architecture is often the foundation of large language models, allowing them to
process sequential data that makes it suitable for tasks like text generation, translation, and question-answering)
learn contextual relationships such as word meanings, syntax, and semantics.
generate human-like language, i.e., produce coherent, context-specific text that resembles human-generated content
Some key characteristics of large language models include:
Trained on vast training data:
Scalability: They can handle long sequences and complex tasks.
Parallel processing
Figure 1
LLMs are divided into three categories: 1) encoder only, 2) decoder only, and 3) encoder-decoder models. Examples of LLMs are BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer), respectively. General use cases are shown in the Figure 1 above.
Major concerns regarding LLMs are:
Data bias: LLMs have the potential to reinforce biases found in the training data.
Interpretability: It can be difficult to comprehend the decision-making process of an LLM.
Security: Adversarial attacks can target LLMs.
2.1 Llama
Llama is the popular open-source LLM model from Facebook. Being open-source, we can fine-tune, distill, and deploy and use in our use cases (provided the above-mentioned concerns are taken care of). Llam is a decoder-only language model, which means that it uses a transformer architecture with only decoder layers to generate text.
At the time of writing this article, current versions are available in three flavors:
Llama 3.1: With Multilingual capability and available in two versions 1) 8B with 8 billion parameters and 2) 405B with 405 billion parameters
Llama 3.2: Lightweight and Multimodal and available in 1) Lightweight 1B and 3B 2) Multimodal 11B and 90B
Llama 3.3: Multilingual with 70B parameters
In the current prototype, I have used Llama 3.2.
2.2 Install, run, and different ways to access Llama
Large language models are locked in time. It has learned the knowledge that was available till the time when it was being trained and released. These models are trained on Internet-scale open data. So when you ask general questions, it would give very good answers, but it may fail to answer or hallucinate if you go very specific to your personal or enterprise data. The reason is that it usually does not have the right context of your requirements that are very specific to your application.
Retrieval-augmented generation (RAG) combines the strength of generative AI and retrieval techniques. It helps in providing the right context for the LLM along with the question being asked. This way we get the LLM to generate more accurate and relevant content. This is a cost-effective way to improve the output of LLMs without retraining them. The following diagram depicts the RAG architecture:
Fig1: Conceptual flow of using RAG with LLM
There are two primary components of the RAG:
Retrieval: This component is responsible for searching and retrieving relevant information related to the user’s query from various knowledge sources such as documents, articles, databases, etc.
Generation: This component does an excellent job of crafting coherent and contextually rich responses to user queries.
A question submitted by the user is routed to the retrieval component. Using the embedding model, the retrieval component converts the query text to the embedding vector. After that, it looks through the vector database to locate a small number of vectors that match the query text and satisfy the threshold requirements for the similarity score and distance metric. These vectors are transformed back to the text and used as the context. This context, along with the prompt and query, is put in the prompt template and sent to the LLM. LLM returns the generated text that is more correct and relevant to the user’s query.
4. RAG (Data Ingestion Pipeline)
In order for the retrieval component to have a searchable index of preprocessed data, we must first build a data input pipeline. The following diagram in fig2 depicts the data ingestion pipeline. Knowledge sources can be web pages, text documents, pdf documents, etc. Texts need to be extracted from these sources. I am using PDF documents as the only knowledge source for this prototype.
Text Extraction: To extract the text from the PDF, various Python libraries can be used, such as PyPDF2, pdf2txt, PDFMiner, etc. If PDF is scanned PDF, libraries such as unstructured, pdf2image, and pytesseract can be utilized. The quality of the text can be maintained by performing cleanups such as removing extraneous characters, fixing formatting issues, whitespace, special characters, punctuation, spell checking, etc. Language detection may also be required if knowledge sources can have text coming in multiple languages, or a single document may contain multiple languages.
Handling Multiple Pages: Maintaining the context across pages is important. It is recommended that the document be segmented into logical units, such as paragraphs or sections, to preserve the context. Extracting metadata such as document titles, authors, page numbers, creation dates, etc., is crucial for improving searchability and answering user queries.
Fig2: RAG data ingestion pipeline
Note: I have manually downloaded the PDFs of all the chapters of the book “Democratic Politics” of class IX of the NCERT curriculum. These PDFs will be the knowledge source for our application.
4.1 Implementation step by step
Step 1: Install the necessary libraries
pip install pdfminer.six
pip install langchain-ollama
Imports:
from langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClient
Step 2: Load the pdf file and extract the text from it
Step 3: Split the text into smaller chunks with overlap
CHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued contexttext_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)docs = text_splitter.split_documents(pdf_content)# Make a list of split docsdocuments = []for doc in docs: documents.append(doc.page_content)
Step 4: Embed and store the documents in the vector database
FastEmbed is a lightweight, fast Python library built for embedding generation. Qdrant vector database uses this embedding library by default. Following is the code snippet for inserting in the vector database.
# 5. Make a query from the vectordb(qdrant)search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?")for search_result in search_results:print(search_result.document, search_result.score)
4.2 Complete Code data_ingestion.py
################################################################ Data ingestion pipeline # 1. Taking the input pdf file# 2. Extracting the content# 3. Divide into chunks# 4. Use embeddings model to convet to the embedding vector# 5. Store the embedding vectors to the qdrant (vector database)################################################################import osfrom langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClientpath ="ix-sst-ncert-democratic-politics"filenames =next(os.walk(path))[2]for i, file_name inenumerate(filenames):print(f"Data ingestion for the chapter: {i}")# 1. Load the pdf document and extract text from it loader = PDFMinerLoader(path +"/"+ file_name) pdf_content = loader.load()print(pdf_content)# 2. Split the text into small chunksCHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued context text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP) docs = text_splitter.split_documents(pdf_content)# Make a list of split docs documents = []for doc in docs: documents.append(doc.page_content)# 3. Create vectordatabase(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 4. Add document chunks in vectordb qdrant_client.add(collection_name="ix-sst-ncert-democratic-politics",documents=documents,#metadata=metadata,#ids=ids )# 5. Make a query from the vectordb(qdrant) search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?" )for search_result in search_results:print(search_result.document, search_result.score)
5. RAG (Information Retrieval and Generation) – Audio Bot
I am making an audio bot that will answer questions from the chapters of the book “Democratic Politics” of class IX of the NCERT(India) curriculum. If you want to learn about making an audio bot, you can read my article on the topic “Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB“.
5.1 Audio Bot Implementation
The following diagram depicts the overall flow of the audio bot and how it interacts with the RAG system. A user interacts with the audio bot using the microphone. The microphone captures the speech audio signal and passes it on to the speech-to-text library (I am using faster_whisper) which in turn converts to a text query that is in turn passed on to the RAG system as a query. When the RAG system comes up with the response text, this text is passed on to the text-to-speech library (I am using pyttsx3) that in turn converts text to audio which is then played by the speaker so the user can listen to the response.
faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
Installation: pip install faster-whisper
Save the following code in Python file say "speech-to-text.py" and run python speech-to-text.py
from faster_whisper import WhisperModelmodel_size ="small.en"model = WhisperModel(model_size, device="cpu", compute_type="int8")# Transcribetranscription = model.transcribe(audio="basic_output1.wav",language="en",)seg_text =''for segment in transcription[0]: seg_text = segment.textprint(seg_text)
Sample input audio file:
Output text: “Please ask me something. I’m listening now”
5.1.3 Text-to-Speech
The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.
Installation:pip install pyttsx3
Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py
Code Snippet:
import pyttsx3engine = pyttsx3.init()engine.setProperty('volume', 1)engine.setProperty('rate', 130)voices = engine.getProperty('voices')engine.setProperty('voice', voices[1].id)engine.setProperty('voice', 'english+f3')text_to_speak ="I got your question. Please bear " \"with me while I retrieve the answer."engine.say(text_to_speak)# Folloing is the optional line: If you want # also to save audio fileengine.save_to_file(text_to_speak, 'speech.wav') engine.runAndWait()
Sample input text: “I got your question. Please bear with me while I retrieve the answer.”
The following code snippet creates a template, the template is used to create the prompt, create a reference to the llama model, chain (langchain pipeline for executing the query) is created using prompt and model, and finally chain is invoked to execute the query to get the response formed by LLM using the retrieved context.
# 4. Using LLM for forming the answertemplate ="""Instruction: {instruction}Contaxt: {contaxt}Query: {query}"""prompt = ChatPromptTemplate.from_template(template)model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm modelchain = prompt | modelbot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "contaxt": contaxt,"query": query })
5.3 Complete Code audiobot.py
Following is the code snippet for the audio bot. Save the file as audiobot.py
import pyaudioimport waveimport pyttsx3from qdrant_client import QdrantClientfrom langchain_ollama.llms import OllamaLLMfrom langchain_core.prompts import ChatPromptTemplatefrom faster_whisper import WhisperModel# Load the Speech to Text Model (faster-whisper: pip install faster-whisper)whishper_model_size ="small.en"whishper_model = WhisperModel(whishper_model_size, device="cpu", compute_type="int8")CHUNK=512FORMAT= pyaudio.paInt16 #paInt8CHANNELS=1RATE=44100#sample rateRECORD_SECONDS=7WAVE_OUTPUT_FILENAME="pyaudio-output.wav"defspeak(text_to_speak): engine = pyttsx3.init() engine.setProperty('volume', 1) engine.setProperty('rate', 130) voices = engine.getProperty('voices') engine.setProperty('voice', voices[1].id) engine.setProperty('voice', 'english+f3') engine.say(text_to_speak) engine.runAndWait()speak("I am an AI bot. I have learned the book \"democratic politics\" of class 9 published by N C E R T. You can ask me questions from this book.")whileTrue: speak("I am listening now for you question.") p = pyaudio.PyAudio() stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK) #bufferprint("* recording") frames = []for i inrange(0, int(RATE/CHUNK*RECORD_SECONDS)): data = stream.read(CHUNK) frames.append(data) # 2 bytes(16 bits) per channel stream.stop_stream() stream.close() p.terminate() wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') wf.setnchannels(CHANNELS) wf.setsampwidth(p.get_sample_size(FORMAT)) wf.setframerate(RATE) wf.writeframes(b''.join(frames)) wf.close()# Transcribe transcription = whishper_model.transcribe(audio=WAVE_OUTPUT_FILENAME,language="en", ) seg_text =''for segment in transcription[0]: seg_text = segment.textprint(f'\nUser: {seg_text}')if seg_text =='': speak("Probably you did not say anything.")continueelse: text_to_speak ="I got your question. Please bear with me " \+"while I retrieve about the answer." speak(text_to_speak)# 1. Create vector database(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 2. Make a query to the vectordb (qdrant)#query = "explain democracy in estonia?" query = seg_text search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text=query ) context ="" no_of_docs =2 count =1for search_result in search_results:if search_result.score >=0.8:print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}") context = context + search_result.documentif count >= no_of_docs:break count = count +1#print(f"Retrieved Context: {context}")if context =="":print("Context is blank. Could not find any relevant information in the given sources.") speak("I did not find anything in the book about the question.")continue# 4. Using LLM for forming the answer template ="""Instruction: {instruction} Context: {context} Query: {query} """ prompt = ChatPromptTemplate.from_template(template) model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model chain = prompt | model bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "context": context,"query": query })print(f'\nBot: {bot_response}') speak(bot_response)
6. Libraries Used
Following is the list of libraries used in the prototype implementation. These can be installed from the Python pip command.
Activate using the activate command in env1\Scripts\activate if on Windows. Activate command is there in the bin directory on the Linux system.
python -m pip install -r requirements.txt
python data_ingestion.py
python audiobot.py
10. My Conversation with the Audio Bot
9. Further Improvement
In the current prototype, the chunk size is of fixed length, CHUNK_SIZE = 1000 and CHUNK_OVERLAP = 30. For further improvement, the document can be split into logical units, such as paragraphs/sections, to maintain a better context.
10. References
A Practical Approach to Retrieval Augmented Generation Systems by Mehdi Allahyari and Angelina Yang
Install, run, and access Llama using Ollama. – link
How to Record, Save, and Play Audio in Python? – link
Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB – link
In this article, I describe my experiment on making a talking bot using a Large Language Model Llama3.2:1b and running it successfully on the Raspberry Pi 4 Model-B with 4GB RAM. Llama3.2:1b is the quantized version of the Llama model with 1 billion parameters for use on resource-constrained devices from Facebook. I have kept this bot primarily in question-answering mode to keep things simple. The bot is supposed to answer all the questions that llama3.2:1b can answer from its learned knowledge in the model. The objective is to run this completely offline without needing the Internet.
My Setup
The following picture describes my setup which consists of a Raspberry Pi to host the LLM (llama3.2:1b), a mic for asking questions, and a pair of speakers to play the answers from the bot. I have used the Internet while doing the installation etc. but the bot works in offline mode.
Following is the overall design explaining the end-to-end flow.
The user asks the question in the external microphone connected to the Raspberry Pi. This audio signal captured by the microphone is converted to text using a speech-to-text library. Text is sent to the Llama model running on the Raspberry Pi. The Llama model answers the question in the form of text that is sent to the text-to-speech library. The output of the text-to-speech is audio that is played and can be listened to by the user on the speaker.
Following are the steps of the setup:
Install, run, and access Llama
Installation and accessing Speech-to-text library
Installation of text-to-speech library
Record, Save, and Play audio
Running the code (the complete code)
1. Install, run, and how to access llama using API
The Llama model is the core of this bot product. So before we move further, this should be installed and running. Please refer to the separate post on the topic, “Install, run, and access Llama using Ollama“. This post also describes the details of how to access the running model using the API.
2. Installation of speech-to-text library and how to use
I tried many speech-to-text libraries and finally satteled with “faster-whisper“. With the help of CTranslate2, a quick inference engine for Transformer models, faster-whisper is a reimplementation of OpenAI’s Whisper model. The performance of this library on the Raspberry Pi was also satisfactory. Works offline.
Installation: pip install faster-whisper
Save the following code in Pythone file say "speech-to-text.py" and run python speech-to-text.py
Code Snippet:
from faster_whisper import WhisperModelmodel_size ="small.en"model = WhisperModel(model_size, device="cpu", compute_type="int8")# Transcribetranscription = model.transcribe(audio="basic_output1.wav",language="en",)seg_text =''for segment in transcription[0]: seg_text = segment.textprint(seg_text)
Sample input audio file:
Output text: “Please ask me something. I’m listening now”
3. Installation of text-to-speech library and how to use
The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.
Installation:pip install pyttsx3
Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py
Code Snippet:
import pyttsx3engine = pyttsx3.init()engine.setProperty('volume', 1)engine.setProperty('rate', 130)voices = engine.getProperty('voices')engine.setProperty('voice', voices[1].id)engine.setProperty('voice', 'english+f3')text_to_speak ="I got your question. Please bear " \"with me while I retrieve the answer."engine.say(text_to_speak)# Folloing is the optional line: If you want # also to save audio fileengine.save_to_file(text_to_speak, 'speech.wav') engine.runAndWait()
Sample input text: “I got your question. Please bear with me while I retrieve the answer.”
How do you install the Llama model using the Ollama framework?
Running Llama models
Different ways to access the Ollama model
Access the deployed models using Page Assist plugin in the Web Browsers
Access the Llama model using HTTP API in Python Language
Access the Llama model using the Langchain Library
1. What is Ollama?
Ollama is an open-source tool/framework that facilitates users in running large language models (LLMs) on their local computers, such as PCs, edge devices like Raspberry Pi, etc.
2. How to install it?
Visit https://ollama.com/download. You can download and install it based on your PC’s OS, such as Mac, Linux, and Windows.
3. Running Llama 3.2
Five versions of Llama 3.2 models are available: 1B, 3B, 11B, and 90B. ‘B’ indicates billions. For example, 1B means that the model has been trained on 1 billion parameters. 1B and 3B are text-only models, whereas 11B and 90B are multimodal (text and images).
Run 1B model:ollama run llama3.2:1b
Run 3B model:ollama run llama3.2
After running these models on the terminal, we can interact with the model using the terminal.
4. Access the deployed models using Web Browsers
Page Assist is an open-source browser extension that provides a sidebar and web UI for your local AI model. It allows you to interact with your model from any webpage.
5. Access the Llama model using HTTP API in Python Language
Before running the following code, you should have performed steps 1 and 2 mentioned above. That is, you have installed Ollama, and you are running the Llama model under the Ollama environment. When you run the Llama under Ollama, it provides access to the model in the following two ways: 1) through the command line and 2) through an HTTP API on port 11434. The following code is nothing but Python code for accessing the model through the HTTP API.
import jsonimport requestsdata ='{}'data = json.loads(data)data["model"] ="llama3.2:1b"data["stream"] =Falsedata["prompt"] ="What is Newton's law of motion?"+" Answer in short."# Sent to Chatbotr = requests.post('http://127.0.0.1:11434/api/generate', json=data)response_data = json.loads(json.dumps(r.json()))# Print User and Bot Messageprint(f'\nUser: {data["prompt"]}')bot_response = response_data['response']print(f'\nBot: {bot_response}')
6. Access the Llama model using the Langchain Library
from langchain_ollama.llms import OllamaLLMfrom langchain_core.prompts import ChatPromptTemplatequery ="What is Newton's law of motion?"template ="""Instruction: {instruction}Query: {query}"""prompt = ChatPromptTemplate.from_template(template)model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm modelchain = prompt | modelbot_response = chain.invoke({"instruction": "Answer the question. If you cannot answer the question, answer with \"I don't know.\"", "query": query })print(f'\nUser: {query}')print(f'\nBot: {bot_response}')
7. Running ollama for remote access
By default, the Ollama service runs locally and is not accessible remotely. To make Ollama remotely accessible, we need to set the following environment variables: