An ability is emergent if it is not present in smaller models but is present in larger models. [1]
Scaling up language models has been shown to improve predictably the performance and sample efficiency on a wide range of downstream tasks. Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. This raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. [1]
Today’s language models have been scaled primarily along three factors:
amount of computation,
number of model parameters, and
training data size
The following table lists the emergent abilities of large language models and the scale at which abilities emerge. [1]
Tasks that language models cannot currently do are prime candidates for future emergence; for instance, there are dozens of tasks in BIG-Bench[3] for which even the largest GPT-3 and PaLM models do not achieve above-random performance. [1] Similar to emergent abilities, emergent risks could also emerge, such as w.r.t. truthfulness, bias, and toxicity in LLMs, backdoor vulnerabilities, inadvertent deception, or harmful content synthesis.
But Rylan Schaeffer et al., in their paper [3], claim that the sudden appearance of emergent abilities is just a consequence of the way researchers measure the LLM’s performance. The article “How Quickly Do Large Language Models Learn Unexpected Skills?” by Stephen Ornes [4] beautifully summarises the two papers.
Prompt is a text set passed to the GenAI model as instructions. Given the prompt, the model responds with some generated text. A prompt can be questions, statements, or instructions. Prompt engineering facilitates the design of the prompt to enhance the generated text. It is also 1) a tool to evaluate the output of the model and 2) a tool for safety mitigation methods. There is no perfect prompt design. Prompt optimization and experimentation are done iteratively. The following figure 1 depicts a very basic example of the prompt.
Figure 1: A basic example of the prompt
Controlling Model Output by Adjusting Model Parameters
temperature and top_p parameters control the randomness of the output. Before a large language model (LLM) generates a token, it has many possible choices of tokens with different likelihoods assigned. Some tokens are most likely whereas some are least likely. For these two parameters to work, do_sample parameter should be set to True i.e. do_sample=True. It means that we are allowing the next token to be sampled from the set of all likely tokens.
temperature defines how likely the model will choose the least likely token.temperature=0 means the model will generate the same response every time, because it will always choose the most likely token every time. With the higher value of temperature, the model will give more chance to other less likely tokens also. With temperature=1, chances of any probable token for being selected will be mostly equally likely. For example, a value of 0.8 will produce the more diverse output, whereas the value of 0.2 will produce deterministic output. So we can say that temperature induces stochastic behaviour.
top_p also known as nucleus sampling allows the model to only select a subset of all likely tokens. Based on its value, the model will stop sampling as soon as the cumulative probability reaches the value of top_p. Value of 1 means, it will consider all tokens.
top_k parameter tells the model to select the number of tokens exactly equal to its value.
Based on the requirements of the use case, we choose to set these parameters. We need to find the right balance of randomness/diverse vs. deterministic/focused/coherent outputs.
Instruction-Based Prompting
Providing a large language model (AI) with clear, specific, and structured instructions to guide the response of the model is referred to as instruction-based prompting. The most basic prompt consists of two components:
the instruction itself and
the data required for the instruction.
The following diagram depicts a basic instruction prompt. Please note the instruction and data part of the prompt.
Figure 2: Instruction Prompt
To make the model very specific about the output, for example, if we want the output to be either “positive” or “negative”, we can use output indicators. The following diagram depicts the instruction prompt with output indicator.
Figure 3: Instruction prompt with output indicators
Different types of tasks require different formats of the prompt. The following diagram illustrates example formats for summarization, classification, and named-entity recognition.
Figure 4: Prompt format for summarization, classification and NER task
Following is the non-exclusive list of prompting techniques for improving the quality of the output.
Specificity: Accurately describe what you want to achieve.
Hallucination: LLMs can generate incorrect information with high confidence, called hallucination. To avoid this, we need to tell the model that if it does not know the answer, please respond with “I don’t know”.
Order: Either begin or end your prompt with the instruction. LLMs tend to focus more on two ends of the prompt (beginning – primacy effect and end – recency effect). It mostly forgets the middle part in the long prompt.
As we saw above, common components of prompts are instruction, data, and output indicators. However, prompts are not limited to these components; we can build up a prompt that is as complex as we want. Other common components are
Personal
Instruction
Context
Format
Audience
Tone
Data
The following is an example from book 1 that uses the above prompt components. This example demonstrates the modular nature of prompting. We can experiment by adding or removing components to see the effect.
Figure 5: Example of prompt showing use of the various components.
In-Context Learning – Providing examples
Giving examples to the LLM, greatly influences the output of the prompt. This is referred to as in-context learning. Zero-shot prompting uses no example, whereas one-shot prompting uses one example, and few-shot prompting uses two or more examples. The following diagram illustrates the examples of in-context learning.
Figure 6: Examples of in-context learning
While giving the examples, the user and the assistant should be differentiated clearly by mentioning the role as user or the role as assistant. By giving examples, we can be more clear in describing the model. But the model can still choose through random sampling and ignore the instruction.
Chain Prompting: Breaking up the Problem
We already know that we can break the prompt into modular components of the prompt to enhance the output of LLMs. Next level of strategy is to break the problem/task into subproblems/subtasks. We use separate prompts for subtasks and then chain the prompts in a sequence by passing the output of one prompt to the input of the other prompt, thus creating a continuous chain of interactions to solve our problem. This is called chain of prompt operations or prompt chaining. The prompt chaining can help in
achieving better performance,
boost the transparency of LLM application,
increases controllability and reliability,
debug problems with model responses more easily,
improve performance in the different stages that need improvement,
useful in building LLM-powered conversational assistants,
improve the personalization and user experience of your application.
Use cases include
Response validation: We can ask the LLM to validate the previously generated output or other LLM’s output.
Parallel prompts: There can be use cases where we would be running multiple prompts in parallel, and then we would be merging the parallel outputs.
Writing stories
Following is the example from the reference book1. This example illustrates the prompt chain that first creates a product name, then uses this name with product features to create a slogan, and finally uses features, product name, and slogan for creating the sales pitch.
Figure 7: Example of prompt chain
Reasoning with Generative Models
Reasoning is an important trait of human intelligence. LLMs as of today resemble this reasoning behaviour by memorization of training data and pattern matching. We need to work with LLMs by leveraging prompt engineering so that they can mimic the reasoning processes, and the output could be enhanced.
System 1 and System 2 Thinking Process by Daniel Kahneman
Daniel Kahneman in his famous book “Thinking Fast and Slow” introduced the concept of System 1 and System 2 thinking process in humans. According to him, “System 1” represents our fast, automatic, and intuitive thinking mode, while “System 2” is our slower, more deliberate and conscious mode of thinking, which requires effort and attention; essentially, System 1 is “thinking fast” and System 2 is “thinking slow.”.
Figure 8: System 1 and System 2 thinking, Image Source
Inducing System 1 and System 2 Thinking in LLMs
The majority of LLMs today rely on System 1 thinking, but researchers are working on techniques to encourage more System 2-type behaviour by using prompting methods like “Chain of Thought” to elicit intermediate levels of reasoning before arriving at a final response.
Chain-of-Thought: Thinking Before Answering
The main aim of chain-of-thought is to push the model towards system 2 thinking i.e. think before answering, and allowing the model to distribute more compute for the reasoning process. Here, reasoning is referred to as thoughts.
Chain of thought – a series of intermediate reasoning steps – significantly improves the ability of large language models to perform complex reasoning [3]. Prompting using chain-of-thought is called chain-of-thought prompting. This prompting technique enables LLMs to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning process is highlighted in the following example taken from the paper[3].
Figure 9: Chain-of-thought example; reasoning process is highlighted – source [3]
Chain-of-thought also requires one or more examples of reasoning. But there is something called zero-shot chain-of-thought that can be achieved by simply using the phrase “Let’s think step-by-step“. Though this phrase does not need to be exactly the same. Small variation should be fine. The following is an example of zero-shot chain-of-thought.
Figure 10: Example of zero-shot chain-of-thought – source[1]
Self-Consistency: Sampling Outcomes
The paper[2] writes about the self-consistency as follows: “It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.“
We first prompt the language model with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, ‘sample-and-marginalize’ decoding procedure is followed. Sample-and-marginalize decoding procedure:
prompt the language model with chain-of-thought (CoT) prompting
replace the ‘greedy decode’ in CoT prompting by sampling from the language model’s decode to generate a diverse set of reasoning paths, and
marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.
The following diagram, from the paper[4], illustrates the concept.
Figure 11: Example of self-consistency in CoT[4]
Tree of Thoughts: Deliberate Problem Solving
This is another effort in the direction of pushing the model towards system 2 level of thinking in humans. Following is a quote from Newell et al.[6]
” A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution in finally discovered.”
Paper[5] explains the Tree-of-Thought (ToT) as follows. ToT generalizes over the popular “Chain of Thought” approach to prompting language models, and enables exploration over a coherent unit of text (“thoughts”) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.
The following diagram from the paper[5] illustrates the various approaches to problem solving with LLMs. Each rectangle box represents a thought.
Figure 12: Various approaches to problem solving with LLMs.
Output Verification
It is important to verify and control the output of the model to avoid breakdown of the model in production and to create a robust AI system. Reasons for validating the output may include
Structure output: For example, we would need the output in JSON format.
Valid output: Even if we restrict the output to few choices, it may still come up with new choice.
Ethics: Free of profanity, personally identifiable information (PII), bias, cultural stereotypes, etc.
Accuracy: Checking if output is factually accurate, coherent or free from hallucination
Except from controlling the parameters temperature and top_p, the following are the three ways to control the output of the GenAI model:
Examples: Provide the number of examples of the expected output.
Grammar: Control the token selection process
Fine-tuning: Tune the model on data that contain the expected output.
Providing examples
To control the structure of the output, e.g., in JSON format, we can provide a few examples in this format for guiding the model to produce the output in the desired format. Still, the model will behave in a certain way, is not guaranteed. Some models may be better than others in following instructions.
Grammar: Constrained Sampling
Libraries have been developed to constrain and validate the output of generative models such as:
Guidance: An efficient programming paradigm for steering language models. With Guidance, you can control how output is structured and get high-quality output for your use case—while reducing latency and cost vs. conventional prompting or fine-tuning. It allows users to constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditionals, loops, tool use) and generation seamlessly.
Guardrails: This is a Python framework that helps build reliable AI applications by performing two key functions:
Guardrails runs Input/Output Guards in your application that detect, quantify and mitigate the presence of specific types of risks. To look at the full suite of risks, check out Guardrails Hub.
Guardrails help you generate structured data from LLMs.
LMQL: This is a programming language for LLMs. Robust and modular LLM prompting using types, templates, constraints and an optimizing runtime.
There is another way where we can define grammars or rules that LLM should follow when choosing the next token. For example, in llama-cpp-python we can specify response_format as JSON object, if we want the output in the JSON format.
References
Book: Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
Text clustering is an unsupervised approach that helps in discovering patterns in data. Grouping similar texts according to their semantic content, meaning, relationships, etc. is the goal of text clustering. This makes it easier to cluster vast amounts of unstructured text and perform exploratory data analysis quickly. With recent advancements in large language models (LLMs), we can obtain extremely precise contextual and semantic representations of text. This has improved text clustering’s efficacy even more. Use cases for text clustering include identifying outliers, accelerating labelling, identifying data that has been erroneously labelled, and more.
Topic modelling facilitates the identification of (abstract) topics that arise in huge textual data collections. Clusters of text documents can be given meaning through this method.
We will learn how to use embedding models for text clustering and text-clustering-inspired method of topic modeling, namely BERTopic, generating labels using LLM given the keywords of the topic.
2. Pipeline for Text Clustering
The following diagram depicts the pipeline for text clustering that consists of the following three steps:
Use an embedding model to transform the input documents into embeddings.
Using a dimensionality reduction model, lower the dimensionality of embeddings.
Use a cluster model to identify groups of documents that share semantic similarities.
2.1 Embedding Documents
We know that embeddings are the numerical representations of text that attempt to capture its meaning. We can use embedding models that are optimized for semantic similarity tasks, for transforming the documents to embeddings in the first step. We can use the Massive Text Embedding Benchmark (MTEB) leaderboard, for selecting the embedding model for our requirement. For example “thenlper/gte-small” is a small but performant model with fast inference.
2.2 Dimensionality Reduction of Embeddings
It is difficult for clustering techniques to identify meaningful clusters if the dimension of the data is high. These techniques preserve the global structure of the data by finding the low-dimension representations. This techniques act as compression techniques, so these do not remove dimensions arbitrarily. Popular algorithms are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Compared to PCA, UMAP is better in handling nonlinear relationships and structures.
2.3 Clustering the Reduced Embeddings
The following diagram depicts the methods of text clustering:
Density-based clustering algorithms calculate the number of dimensions itself and do not force all data points to be part of the cluster. The data points not part of any cluster are marked as outliers. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a hierarchical variation of the DBSCAN algorithm that specializes in finding dense(micro)-clusters without being told the number of clusters in advance.
3. Text Clustering to Topic Modeling
Topic modelling is the term used to describe the process of identifying themes or hidden topics in a set of textual data. Traditionally, it involves finding a group of keywords or phrases that best represent and capture the essence of the topic. We need to understand the meaning of the topic through these keywords or phrases. Latent Dirichlet Allocation (LDA) is one such algorithm. Let’s discuss BERTopic in the following sections, which is a highly modular text clustering and topic modeling framework.
3.1 BERTopic: A Modular Topic Modeling Framework
Steps for performing topic modeling follow three steps of text clustering. The output of the third step of text clustering is fed to the fourth step of topic modeling. The following diagram depicts the steps (4th and 5th) for topic modeling:
4th step calculates the class-based term frequency, i.e., frequency (tf) of word (X) in cluster (C). This term is then multiplied with IDF (inverse document frequency) in the 5th step. The goal is to give more weight to the words in a cluster and less weight to the words appearing across all clusters.
The following diagram depicts the full pipeline from clustering to topic modeling. Though topic modeling follows clustering, they are largely independent of each other, and each component is modular. BERTopic can be customised, and we can choose another algorithm instead of the default ones.
3.2 Reranking in BERTopic
c-TF-IDF does not take into account the semantic structures, so BERTopic leverages the representation models (e.g. KeyBERTInspired, Maximal marginal relevance – MMR, spaCy) to rerank the topics found out in the previously discussed 5th step. This reranking is applied on each topic instead of every document. Many of the representation models are LLMs. Now with this step final pipeline extends to become the following:
3.3 Using LLM to generate a label for Topic
The following diagram explains how keywords combined with documents, along with a prompt, can be passed on to LLM for generating the label for the topic given keywords.
Final pipeline is as follows:
Final detailed pipeline:
References
Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
A common task in natural language processing (NLP) is text classification. Use cases of text classification include sentiment analysis, intent detection, entity extraction, and language detection. This article will delve into how to use LLMs for text classification. We will see representation models and generative models. Under the representation model, we will see how to use task-specific models and embedding models to classify the text. Under the generative models, we will see open source and closed source models. While both generative and representation models can be applied to classification, they take different approaches.
2. Text Classification with Representation Models
Task-specific models and embedding models are two types of representation models that can be used for text classification. To obtain task-specific models, representation models, like bidirectional encoder representations from transformers (BERT), are trained for a particular task, like sentiment analysis. On the other hand, general-purpose models such as embedding models can be applied to a range of tasks outside classification, such as semantic search.
As it can be seen in the below diagram, when used in the text classification use case, representation models are kept frozen (untrainable). As the task-specific models are specially trained for the given task, when the text is given as input, it can classify the given text as per the task at hand. But when we are using the embedding model, we need to generate embeddings for the texts in the training set. Then train a classifier on the train dataset that has embeddings and corresponding labels. Once the classifier is trained, it can be used for classification.
3. Model Selection
The factors we should look into for selecting the model for text classification:
How does it fit the use case?
What is its language capability?
What is the underlying architecture?
What is the size of the model?
How is the performance? etc.
Underlying Architecture
BERT is an encoder-only architecture and is a popular choice for creating task-specific models and embedding models, and falls into the category of representation models. Generative Pre-trained Transformer (GPT) is a decoder-only architecture that falls into the generative models category. Encoder-only models are normally small in size. Variations of BERT are RoBERTa, DistilBERT, DeBERTa, etc. For task-specific use cases such as sentiment analysis, Twitter-RoBERTa-base can be a good starting point. For embedding models sentence-transformers/all-mpnet-base-v2 can be a good starting point as this is a small but performant model.
4. Text Classification using Task-specific Models
This is pretty straight forward. Text is passed to the tokenizer that splits the text into tokens. These tokens are consumed by the task-specific model for predicting the class of the text.
This is fine if we could find the task-specific models for our use case. Otherwise, if we have to fine-tune the model ourselves, we would need to check if we have sufficient budget(time, cost) for it. Another option is to resort to using the general-purpose embedding models.
5. Text Classification using Embedding Models
We can generate features using an embedding model rather than directly using the task-specific representation model for classification. These features can be used for training a classifier such as logistic regression.
5.1 What if we do not have the labeled data?
We have the definition of the labels, but we do not have the labeled data, we can utilize what is called “zero-shot classification“. Zero-shot model predicts the labels of input text even if it was not trained on them. Following diagram depicts the concept.
We can use zero-shot classification using embeddings. We can describe our labels based on what they should represent. The following diagram describes the concept.
To assign the labels to the input text/document, we can calculate the cosine similarity with the label embeddings to check which label it is close to.
6. Text Classification with Generative Models
Generative models are trained for a wide variety of tasks, so it will not work with the text classification out of the box. To make the generative model understand our context, we need to use the concept of prompt engineering. The prompt needs to be cleverly written so that the model understands what it is expected to do, what the candidate labels, etc.
6.1 Using Text-to-Text Transfer Model (T5)
The following diagram summarizes the different categories of the models:
The following diagram depicts the training steps:
We need to prefix each document with the prompt “Is the following sentence positive or negative?“
6.2 ChatGPT for Classification
The following diagram describes the training procedure of ChatGPT:
The model is trained using human preference data to generate text that resembles human preference.
For text classification, following is the sample prompt:
prompt = """Predict whether the following document is a positive or negative movie review:
[DOCUMENT]
If if is positive return 1 and if it is negative return 0. Do not give any other answers. """
References
Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
You might have encountered some performance issues while executing LLM (Large Language Model) models for the production environment. You may consider using an inference engine or server that handles various such issues off the shelf for you. Following is an overview of the architecture for LLMs’ inference and serving.
Several LLM inference engines and servers are available to deploy and serve LLMs in production. The following are a few and most prominent among them:
Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT) which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.
LLM inference engines and servers are intended to optimise LLM memory utilisation and performance in the production environment. They assist you with achieving high throughput and low latency, guaranteeing that your LLMs can handle a huge number of requests while responding rapidly. Based on your specific use case, you may also have additional factors that would influence your decision to select a particular engine/server.
Inferene Engine vs. Inference Server
Inference engines run the models and are in charge of everything needed for the generating process. Inference Servers handle incoming and outgoing HTTP and gRPC queries from end users for your application, as well as metrics for measuring the deployment performance of your LLM.
Techniques/Terminologies used across these Frameworks
KV cache
PagedAttention
Batching
Support for quantisation
LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Tool calling: Tool calling, also known as function calling, is a technique that enables Large Language Models (LLMs) to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
Following are the questions that should be running in your mind when you are to choose from competing open-source products:
Are there any other users out there?
Is it the most popular in this category?
Is this technology in decline?
The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.
Learn about Large Language Models (LLMs), their installation, access through HTTP API, the Ollama framework, etc.
Introduction of Retrieval-Augmented Generation (RAG)
Learn about the Data Ingestion Pipeline for the Qdrant vector database
Learn about the RAG Pipeline
Access the prototype using audio-based input/output (audio bot).
Audio bot using speech-to-text and text-to-speech
Making Qdrant client for making queries
Creating context using documents retrieved from the Qdrant database
Using Llama 3.2 as a Large Language Model (LLM) using Ollama and Langchain framework
Create a prompt template using instruction, and context, and make a query using the template and langchain
Using Llama to generate the answer to the question using the given context
2. Large Language Model
Large Language Model (LLM) is an artificial intelligence (AI) model trained to comprehend and produce language similar to a human’s. It can learn linguistic structures, relationships, and patterns since it has been trained on enormous volumes of text data. Transformer architecture is often the foundation of large language models, allowing them to
Large Language Model (LLM) is an artificial intelligence (AI) model trained to comprehend and produce language similar to a human’s. It can learn linguistic structures, relationships, and patterns since it has been trained on enormous volumes of text data. Transformer architecture is often the foundation of large language models, allowing them to
process sequential data that makes it suitable for tasks like text generation, translation, and question-answering)
learn contextual relationships such as word meanings, syntax, and semantics.
generate human-like language, i.e. produce coherent, context-specific text that resembles human-generated content
Some key characteristics of large language models include:
Trained on vast training data:
Scalability: They can handle long sequences and complex tasks.
Parallel processing
Some of the popular examples of Large Language Models are BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly Optimized BERT Pretraining Approach), XLNet, T5 (Text-to-Text Transfer Transformer), Llama, etc.
Some of the popular use cases are:
Language Translation
Text summarization
Sentiment analysis
Chatbots and virtual assistants
Question answering
Content generation
Major concerns regarding LLMs are
Data bias: LLMs have the potential to reinforce biases found in the training data.
Interpretability: It can be difficult to comprehend the decision-making process of an LLM.
Security: Adversarial attacks can target LLMs.
3.1 Llama
Llama is an open-source AI model you can fine-tune, distill, and deploy anywhere. Current versions are available in three flavors:
Llama 3.1: With Multilingual capability and available in two versions 1) 8B with 8 billion parameters and 2) 405B with 405 billion parameters
Llama 3.2: Lightweight and Multimodal and available in 1) Lightweight 1B and 3B 2) Multimodal 11B and 90B
Llama 3.3: Multilingual with 70B parameters
In the current prototype, I have used Llama 3.2.
3.2 Install, run, and different ways to access Llama
Large language models are locked in time. It has learned the knowledge that was available till the time when it was being trained and released. These models are trained on Internet-scale open data. So when you ask general questions, it would give very good answers, but it may fail to answer or hallucinate if you go very specific to your personal or enterprise data. The reason is that it usually does not have the right context of your requirements that are very specific to your application.
Retrieval-augmented generation (RAG) combines the strength of generative AI and retrieval techniques. It helps in providing the right context for the LLM along with the question being asked. This way we get the LLM to generate more accurate and relevant content. This is a cost-effective way to improve the output of LLMs without retraining them. The following diagram depicts the RAG architecture:
Fig1: Conceptual flow of using RAG with LLM
There are two primary components of the RAG:
Retrieval: This component is responsible for searching and retrieving relevant information related to the user’s query from various knowledge sources such as documents, articles, databases, etc.
Generation: This component does an excellent job of crafting coherent and contextually rich responses to user queries.
A question submitted by the user is routed to the retrieval component. Using the embedding model, the retrieval component converts the query text to the embedding vector. After that, it looks through the vector database to locate a small number of vectors that match the query text and satisfy the threshold requirements for the similarity score and distance metric. These vectors are transformed back to the text and used as the context. This context, along with the prompt and query, is put in the prompt template and sent to the LLM. LLM returns the generated text that is more correct and relevant to the user’s query.
4. RAG (Data Ingestion Pipeline)
In order for the retrieval component to have a searchable index of preprocessed data, we must first build a data input pipeline. The following diagram in fig2 depicts the data ingestion pipeline. Knowledge sources can be web pages, text documents, pdf documents, etc. Texts need to be extracted from these sources. I am using PDF documents as the only knowledge source for this prototype.
Text Extraction: To extract the text from the PDF, various Python libraries can be used, such as PyPDF2, pdf2txt, PDFMiner, etc. If PDF is scanned PDF, libraries such as unstructured, pdf2image, and pytesseract can be utilized. The quality of the text can be maintained by performing cleanups such as removing extraneous characters, fixing formatting issues, whitespace, special characters, punctuation, spell checking, etc. Language detection may also be required if knowledge sources can have text coming in multiple languages, or a single document may contain multiple languages.
Handling Multiple Pages: Maintaining the context across pages is important. It is recommended that the document be segmented into logical units, such as paragraphs or sections, to preserve the context. Extracting metadata such as document titles, authors, page numbers, creation dates, etc., is crucial for improving searchability and answering user queries.
Fig2: RAG data ingestion pipeline
Note: I have manually downloaded the PDFs of all the chapters of the book “Democratic Politics” of class IX of the NCERT curriculum. These PDFs will be the knowledge source for our application.
4.1 Implementation step by step
Step 1: Install the necessary libraries
pip install pdfminer.six
pip install langchain-ollama
Imports:
from langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClient
Step 2: Load the pdf file and extract the text from it
Step 3: Split the text into smaller chunks with overlap
CHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued contexttext_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)docs = text_splitter.split_documents(pdf_content)# Make a list of split docsdocuments = []for doc in docs: documents.append(doc.page_content)
Step 4: Embed and store the documents in the vector database
FastEmbed is a lightweight, fast Python library built for embedding generation. Qdrant vector database uses this embedding library by default. Following is the code snippet for inserting in the vector database.
# 5. Make a query from the vectordb(qdrant)search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?")for search_result in search_results:print(search_result.document, search_result.score)
4.2 Complete Code data_ingestion.py
################################################################ Data ingestion pipeline # 1. Taking the input pdf file# 2. Extracting the content# 3. Divide into chunks# 4. Use embeddings model to convet to the embedding vector# 5. Store the embedding vectors to the qdrant (vector database)################################################################import osfrom langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClientpath ="ix-sst-ncert-democratic-politics"filenames =next(os.walk(path))[2]for i, file_name inenumerate(filenames):print(f"Data ingestion for the chapter: {i}")# 1. Load the pdf document and extract text from it loader = PDFMinerLoader(path +"/"+ file_name) pdf_content = loader.load()print(pdf_content)# 2. Split the text into small chunksCHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued context text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP) docs = text_splitter.split_documents(pdf_content)# Make a list of split docs documents = []for doc in docs: documents.append(doc.page_content)# 3. Create vectordatabase(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 4. Add document chunks in vectordb qdrant_client.add(collection_name="ix-sst-ncert-democratic-politics",documents=documents,#metadata=metadata,#ids=ids )# 5. Make a query from the vectordb(qdrant) search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?" )for search_result in search_results:print(search_result.document, search_result.score)
5. RAG (Information Retrieval and Generation) – Audio Bot
I am making an audio bot that will answer questions from the chapters of the book “Democratic Politics” of class IX of the NCERT(India) curriculum. If you want to learn about making an audio bot, you can read my article on the topic “Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB“.
5.1 Audio Bot Implementation
The following diagram depicts the overall flow of the audio bot and how it interacts with the RAG system. A user interacts with the audio bot using the microphone. The microphone captures the speech audio signal and passes it on to the speech-to-text library (I am using faster_whisper) which in turn converts to a text query that is in turn passed on to the RAG system as a query. When the RAG system comes up with the response text, this text is passed on to the text-to-speech library (I am using pyttsx3) that in turn converts text to audio which is then played by the speaker so the user can listen to the response.
faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
Installation: pip install faster-whisper
Save the following code in Python file say "speech-to-text.py" and run python speech-to-text.py
from faster_whisper import WhisperModelmodel_size ="small.en"model = WhisperModel(model_size, device="cpu", compute_type="int8")# Transcribetranscription = model.transcribe(audio="basic_output1.wav",language="en",)seg_text =''for segment in transcription[0]: seg_text = segment.textprint(seg_text)
Sample input audio file:
Output text: “Please ask me something. I’m listening now”
5.1.3 Text-to-Speech
The best offline text-to-speech library that works on resource-constrained devices is “pyttsx3“.
Installation:pip install pyttsx3
Save the following code in a Python file say "text-to-speech.py" and run python text-to-speech.py
Code Snippet:
import pyttsx3engine = pyttsx3.init()engine.setProperty('volume', 1)engine.setProperty('rate', 130)voices = engine.getProperty('voices')engine.setProperty('voice', voices[1].id)engine.setProperty('voice', 'english+f3')text_to_speak ="I got your question. Please bear " \"with me while I retrieve the answer."engine.say(text_to_speak)# Folloing is the optional line: If you want # also to save audio fileengine.save_to_file(text_to_speak, 'speech.wav') engine.runAndWait()
Sample input text: “I got your question. Please bear with me while I retrieve the answer.”
The following code snippet creates a template, the template is used to create the prompt, create a reference to the llama model, chain (langchain pipeline for executing the query) is created using prompt and model, and finally chain is invoked to execute the query to get the response formed by LLM using the retrieved context.
# 4. Using LLM for forming the answertemplate ="""Instruction: {instruction}Contaxt: {contaxt}Query: {query}"""prompt = ChatPromptTemplate.from_template(template)model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm modelchain = prompt | modelbot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "contaxt": contaxt,"query": query })
5.3 Complete Code audiobot.py
Following is the code snippet for the audio bot. Save the file as audiobot.py
import pyaudioimport waveimport pyttsx3from qdrant_client import QdrantClientfrom langchain_ollama.llms import OllamaLLMfrom langchain_core.prompts import ChatPromptTemplatefrom faster_whisper import WhisperModel# Load the Speech to Text Model (faster-whisper: pip install faster-whisper)whishper_model_size ="small.en"whishper_model = WhisperModel(whishper_model_size, device="cpu", compute_type="int8")CHUNK=512FORMAT= pyaudio.paInt16 #paInt8CHANNELS=1RATE=44100#sample rateRECORD_SECONDS=7WAVE_OUTPUT_FILENAME="pyaudio-output.wav"defspeak(text_to_speak): engine = pyttsx3.init() engine.setProperty('volume', 1) engine.setProperty('rate', 130) voices = engine.getProperty('voices') engine.setProperty('voice', voices[1].id) engine.setProperty('voice', 'english+f3') engine.say(text_to_speak) engine.runAndWait()speak("I am an AI bot. I have learned the book \"democratic politics\" of class 9 published by N C E R T. You can ask me questions from this book.")whileTrue: speak("I am listening now for you question.") p = pyaudio.PyAudio() stream = p.open(format=FORMAT,channels=CHANNELS,rate=RATE,input=True,frames_per_buffer=CHUNK) #bufferprint("* recording") frames = []for i inrange(0, int(RATE/CHUNK*RECORD_SECONDS)): data = stream.read(CHUNK) frames.append(data) # 2 bytes(16 bits) per channel stream.stop_stream() stream.close() p.terminate() wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb') wf.setnchannels(CHANNELS) wf.setsampwidth(p.get_sample_size(FORMAT)) wf.setframerate(RATE) wf.writeframes(b''.join(frames)) wf.close()# Transcribe transcription = whishper_model.transcribe(audio=WAVE_OUTPUT_FILENAME,language="en", ) seg_text =''for segment in transcription[0]: seg_text = segment.textprint(f'\nUser: {seg_text}')if seg_text =='': speak("Probably you did not say anything.")continueelse: text_to_speak ="I got your question. Please bear with me " \+"while I retrieve about the answer." speak(text_to_speak)# 1. Create vector database(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 2. Make a query to the vectordb (qdrant)#query = "explain democracy in estonia?" query = seg_text search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text=query ) context ="" no_of_docs =2 count =1for search_result in search_results:if search_result.score >=0.8:print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}") context = context + search_result.documentif count >= no_of_docs:break count = count +1#print(f"Retrieved Context: {context}")if context =="":print("Context is blank. Could not find any relevant information in the given sources.") speak("I did not find anything in the book about the question.")continue# 4. Using LLM for forming the answer template ="""Instruction: {instruction} Context: {context} Query: {query} """ prompt = ChatPromptTemplate.from_template(template) model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model chain = prompt | model bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "context": context,"query": query })print(f'\nBot: {bot_response}') speak(bot_response)
6. Libraries Used
Following is the list of libraries used in the prototype implementation. These can be installed from the Python pip command.
Activate using the activate command in env1\Scripts\activate if on Windows. Activate command is there in the bin directory on the Linux system.
python -m pip install -r requirements.txt
python data_ingestion.py
python audiobot.py
10. My Conversation with the Audio Bot
9. Further Improvement
In the current prototype, the chunk size is of fixed length, CHUNK_SIZE = 1000 and CHUNK_OVERLAP = 30. For further improvement, the document can be split into logical units, such as paragraphs/sections, to maintain a better context.
10. References
A Practical Approach to Retrieval Augmented Generation Systems by Mehdi Allahyari and Angelina Yang
Install, run, and access Llama using Ollama. – link
How to Record, Save, and Play Audio in Python? – link
Making a talking bot using Llama3.2:1b running on Raspberry Pi 4 Model-B 4GB – link