Ranjan Kumar – A Seasoned Software Practitioner who loves to build cutting edge AI applications.

Fast Face Search (Billion-scale Face Recognition) using Vector DB (Faiss)

Posted on May 19, 2025 by Ranjan Kumar / 0 Comment

1. Introduction

Before understanding what face search is, what the use cases are, and why performing face search fast is so crucial, let us understand the following two key terms used in this domain:

Face Verification: This is a one-to-one comparison of faces to confirm the individual’s identity by comparing his/her face against a face or face template stored in the identity card or captured directly by the camera by clicking the image on the card. An example is when an organisation authenticates the user by comparing the image stored in the offline eKYC XML of Aadhaar with the face captured through a camera. This face capture can happen through cameras mounted at the entry point or may be captured by any web application using a computer camera. Other use cases may be, for example, online banking or passport checks. In the case of face verification, comparison of the faces is one-to-one.
Face Recognition: The purpose of face recognition is to identify/recognise the person from a database of faces by performing a one-to-many comparison.

Face images are not directly compared; rather, there are many deep learning-based models to transform these faces into embeddings. These embeddings are nothing but a vector, which is a mathematical representation of the face in the embedding space, learnt by the model. By simply calculating the distance metric, such as cosine similarity, and comparing it with a certain threshold, we can tell if the two faces belong to the same person or not. There are other distance metrics such as Dot Product, Squared Euclidean, Manhattan, Hamming, etc.

There are many use cases where there could be millions, even billions, of images in the database for comparison. One-to-many comparisons against this huge number of images are unimaginable in real-time use cases.

In this article and accompanying code, I have used Facebook AI Similarity Search (Faiss), a library that helps in quickly searching across multimedia documents that are similar to each other. The first step is data ingestion, where multimedia documents (a face image in this case) are transformed into vector embeddings and then saved in the database. Once queried, this database returns the k-nearest neighbours of the queried face, that is, k faces that are most similar to the queried face images. Other competing vector databases provide similar functionality. Read more about Faiss in the article “Faiss: A library for efficient similarity search“.

2. Data Ingestion

I used Labelled Faces in the Wild (LFW) dataset, which has over 13,000 images of faces collected from the web. The face images are stored in a directory with the same name as the person whose face images they belong to. All these directories are located in a directory named lfw-deepfunneled. The following is the code snippet to

Load the face images from the directory.
Transform the loaded face images to face embeddings.

To perform both operations, I used the face-recognition library. This Python library is built using dlib’s state-of-the-art face recognition. The loading step additionally detects the face region in the original face image, crops it, and then returns. The transformation step transforms the cropped face into a vector embedding. Following is the code snippet for the same. representations is the list of the list of key, value pairs. The key is the file name, and the value is the corresponding vector embedding. embeddings is the list that stores all the vector embeddings.

representations = []
path_dataset = "lfw-deepfunneled"
dirs = os.listdir(path_dataset)
dirs.sort()
count = 1
for dir in dirs:
    file_names = os.listdir(path_dataset + "/" + dir)
    for file_name in file_names:
        full_path_of_image = os.path.join(path_dataset, dir, file_name)
        print(f"Count: {count}, Image path: {full_path_of_image}")
        loaded_image = face_recognition.load_image_file(full_path_of_image)
        image_embedding = face_recognition.face_encodings(loaded_image)
        if len(image_embedding) > 0:
            image_embedding = image_embedding[0]
            if len(image_embedding) > 0:
                representations.append([file_name, image_embedding])
        count = count + 1

embeddings = []
for key, value in representations:
    embeddings.append(value)

print("Size of total embeddings: " + str(len(embeddings)))

The next step is to initialise the Faiss database and then store the vector embedding in it. Then, serialise the database on the disc. Finally, serialise the representations list on the disc. The intent is that when the face search module starts, it loads the serialised index and list in memory. Following is the code snippet:

# Initialize vector store and save the embbeddings  
print("Storing embeddings in faiss.") 
index = faiss.IndexFlatL2(128) 
index.add(np.array(embeddings, dtype = "f"))

# Save the index
faiss.write_index(index, "face_index.bin")

# Save the representations
with open('face_representations.txt', 'wb') as fp:
    pickle.dump(representations, fp)
print("Done")

3. Face Search

The following are the steps for face search:

Load the database; load the representations list.
Create a search interface (web interface using streamlit in this case)
Upload the query face image, crop the face, and transform it into a vector embedding
Pass the query vector embedding to the Faiss database
Faiss database returns the k nearest neighbours from the database.
Perform 1 to k comparisons (similarity check) of the query face with k face embeddings returned from the database.
Based on the comparison of this similarity value with a certain threshold, it is decided whether the person is found or not. If found, then show the face images found.

Following is the code snippet:

is_dataset_loaded = False

# Load the face embedding from the saved face_representations.txt file 
def get_data():   
    with st.spinner("Wait for the dataset to load...", show_time=True): 
        representations = None
        with open ('face_representations.txt', 'rb') as fp:
            representations = pickle.load(fp)
        print(representations)

         # Load the index
        face_index = faiss.read_index("face_index.bin")

        return representations, face_index

# Load the face embedding at the startup and store in session
if st.button('Rerun'):
    st.session_state.representations, st.session_state.index = get_data()
if 'index' not in st.session_state:
    st.session_state.representations, st.session_state.index = get_data()
index = st.session_state.index
representations = st.session_state.representations

# Search web interface
with st.form("search-form"):
    uploaded_face_image = st.file_uploader("Choose face image for search", key="search_face_image_uploader")
    if uploaded_face_image is not None:
        tic = time.time()
        st.text("Saving the query image...")
        print("Saving the query image in the directory: " + "query-images")
        random_query_image_name = uuid.uuid4().hex
        query_image_full_path = "query-images/" + random_query_image_name + ".jpg"
        with open(query_image_full_path, "wb") as binary_file:
            binary_file.write(uploaded_face_image.getvalue())

        st.image(uploaded_face_image, caption="Image uploaded for search")

        query_image = face_recognition.load_image_file(query_image_full_path)
        query_image_embedding = face_recognition.face_encodings(query_image)
        if len(query_image_embedding) > 0:
            query_image_embedding = query_image_embedding[0]
        query_image_embedding = np.expand_dims(query_image_embedding, axis = 0)

        # Search
        st.text("Searching the images...")
        k = 1
        distances, neighbours = index.search(query_image_embedding, k)
        #print(neighbours)
        #print(distances)
        i = 0
        is_image_found = False
        for distance in distances[0]:
            if distance < 0.3:
                st.text("Found the image.")
                st.text("Similarity: " + str(distance))
                image_file_name = representations[neighbours[0][i]][0]
                image_path = "lfw-deepfunneled/" + image_file_name[:-9] + "/" + image_file_name
                st.image(image_path)
                is_image_found = True
            i = i + 1
        if is_image_found == False:
            st.text("Cound not found the image.")
        
        toc = time.time()
        st.text("Total time taken: " + str(toc - tic) + " seconds")

    st.form_submit_button('Submit')

Other Details

Complete code is available at Github.

Dependent Libraries:

pip install face-recognition
pip install faiss
pip install pickle
pip install streamlit

Steps to Run the Application

pip install -r /path/to/requirements.txt
python data_ingestion_2_vector_db.py
streamlit run WebApp.py

Screenshot of the application:

AI & ML/GenAI/LLMs/Natural Language Processing - NLP/Unstructured Data

Question Answer Chatbot using RAG, Llama and Qdrant

Posted on May 19, 2025 by Ranjan Kumar / 0 Comment

1. Introduction

I have created this teaching chatbot that can answer questions from class IX, subject SST, on the topic “Democratic politics“. I have used RAG (Retrieval-Augmented Generation), Llama Model as LLM (Large Language Model), Qdrant as a vector database, Langchain, and Streamlit.

2. How to run the code?

Github repository link: https://github.com/ranjankumar-gh/teaching-bot/

Steps to run the code

git clone https://github.com/ranjankumar-gh/teaching-bot.git
cd teaching-bot
python -m venv env
Activate the environment from the env directory.
python -m pip install -r requirements.txt
Before running the following line, Qdrant should be running and available on localhost. If it’s running on a different machine, make appropriate URL changes to the code.
python data_ingestion.py
After running this, http://localhost:6333/dashboard#/collections should appear like figure 1.
Run the web application for the chatbot by running the following command. The web application is powered by Streamlit.
streamlit run app.py
The interface of the chatbot appears as in Figure 2.

Figure 1: Screenshot of the Qdrant dashboard after running the data_ingestion.py

Figure 2: Screenshot of the chatbot web application

3. Data Ingestion

Data: PDF files have been downloaded from the NCERT website for Class IX, subject SST, from the topic “Democratic politics”. These files are stored in the directory ix-sst-ncert-democratic-politics. The following are the steps for data ingestion:

PDF files are loaded from the directory.
Text contents are extracted from the PDF.
Text content is divided into chunks of text.
These chunks are transformed into vector embeddings.
These vector embeddings are stored in the Qdrant vector database.
This data is stored in Qdrant with the collection name “ix-sst-ncert-democratic-politics“.

The following is the code snippet for data_ingestion.py.

###############################################################
# Data ingestion pipeline 
# 1. Taking the input pdf file
# 2. Extracting the content
# 3. Divide into chunks
# 4. Use embeddings model to convet to the embedding vector
# 5. Store the embedding vectors to the qdrant (vector database)
################################################################
import os
from langchain_community.document_loaders import PDFMinerLoader
from langchain.text_splitter import CharacterTextSplitter
from qdrant_client import QdrantClient

path = "ix-sst-ncert-democratic-politics"
filenames = next(os.walk(path))[2]

for i, file_name in enumerate(filenames):
    print(f"Data ingestion for the chapter: {i}")

    # 1. Load the pdf document and extract text from it
    loader = PDFMinerLoader(path + "/" + file_name)
    pdf_content = loader.load()
    print(pdf_content)

    # 2. Split the text into small chunks
    CHUNK_SIZE = 1000 # chunk size not greater than 1000 chars
    CHUNK_OVERLAP = 30 # a bit of overlap is required for continued context

    text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
    docs = text_splitter.split_documents(pdf_content)

    # Make a list of split docs
    documents = []
    for doc in docs:
        documents.append(doc.page_content)

    # 3. Create vectordatabase(qdrant) client 
    qdrant_client = QdrantClient(url="http://localhost:6333")

    # 4. Add document chunks in vectordb
    qdrant_client.add(
        collection_name="ix-sst-ncert-democratic-politics",
        documents=documents,
        #metadata=metadata,
        #ids=ids
    )

    # 5. Make a query from the vectordb(qdrant)
    search_results = qdrant_client.query(
        collection_name="ix-sst-ncert-democratic-politics",
        query_text="What is democracy?"
    )

    for search_result in search_results:
        print(search_result.document, search_result.score)

4. Chatbot Web Application

The web application is powered by Streamlit. Following are the steps:

A connection to the Qdrant vector database is created.
User questions are captured through the web interface.
The question text is transformed into a vector embedding.
This vector embedding is searched in the Qdrant vector database to find the most relevant content similar to the question.
The text returned by the Qdrant acts as the context for the LLM.
I have used Llama LLM. The query, along with context, is sent to the Llama for an answer to be generated.
The answer is displayed on the web interface as the answer from the bot.

Following is the code snippet for app.py.

# Initialize chat history
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# React to user input
if query := st.chat_input("What is up?"):
    # Display user message in chat message container
    st.chat_message("user").markdown(query)
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": query})

    # Connect with vector db for getting the context
    search_results = qdrant_client.query(
    collection_name="ix-sst-ncert-democratic-politics",
    query_text=query
    )
    context = ""
    no_of_docs = 2
    count = 1
    for search_result in search_results:
        if search_result.score >= 0.8:
            #print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}")
            context = context + search_result.document
        if count >= no_of_docs:
            break
        count = count + 1

    # Using LLM for forming the answer
    template = """Instruction: {instruction}
    Context: {context}
    Query: {query}
    """
    prompt = ChatPromptTemplate.from_template(template)

    model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model

    chain = prompt | model

    bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", 
            "context": context,
            "query": query
            })

    print(f'\nBot: {bot_response}')

    #response = f"Echo: {prompt}"
    # Display assistant response in chat message container
    with st.chat_message("assistant"):
        st.markdown(bot_response)
    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": bot_response})

AI & ML/GenAI/LLMs/Natural Language Processing - NLP

On Emergent Abilities of Large Language Models

Posted on March 26, 2025 by Ranjan Kumar / 0 Comment

An ability is emergent if it is not present in smaller models but is present in larger models. [1]

Scaling up language models has been shown to improve predictably the performance and sample efficiency on a wide range of downstream tasks. Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. This raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. [1]

Today’s language models have been scaled primarily along three factors:

amount of computation,
number of model parameters, and
training data size

The following table lists the emergent abilities of large language models and the scale at which abilities emerge. [1]

Tasks that language models cannot currently do are prime candidates for future emergence; for instance, there are dozens of tasks in BIG-Bench[3] for which even the largest GPT-3 and PaLM models do not achieve above-random performance. [1] Similar to emergent abilities, emergent risks could also emerge, such as w.r.t. truthfulness, bias, and toxicity in LLMs, backdoor vulnerabilities, inadvertent deception, or harmful content synthesis.

But Rylan Schaeffer et al., in their paper [3], claim that the sudden appearance of emergent abilities is just a consequence of the way researchers measure the LLM’s performance. The article “How Quickly Do Large Language Models Learn Unexpected Skills?” by Stephen Ornes [4] beautifully summarises the two papers.

References

Emergent Abilities of Large Language Models by Jason Wei et al. – https://openreview.net/pdf?id=yzkSU5zdwD
Are Emergent Abilities of Large Language Models a Mirage? by Rylan Schaeffer et al. – https://arxiv.org/pdf/2304.15004
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models by Aarohi et al. – https://arxiv.org/pdf/2206.04615
How Quickly Do Large Language Models Learn Unexpected Skills? by Stephen Ornes – https://www.quantamagazine.org/how-quickly-do-large-language-models-learn-unexpected-skills-20240213/

AI & ML/GenAI/Natural Language Processing - NLP

Prompt Engineering

Posted on March 9, 2025 by Ranjan Kumar / 0 Comment

Prompt is a text set passed to the GenAI model as instructions. Given the prompt, the model responds with some generated text. A prompt can be questions, statements, or instructions. Prompt engineering facilitates the design of the prompt to enhance the generated text. It is also 1) a tool to evaluate the output of the model and 2) a tool for safety mitigation methods. There is no perfect prompt design. Prompt optimization and experimentation are done iteratively. The following figure 1 depicts a very basic example of the prompt.

Figure 1: A basic example of the prompt

Controlling Model Output by Adjusting Model Parameters

temperature and top_p parameters control the randomness of the output. Before a large language model (LLM) generates a token, it has many possible choices of tokens with different likelihoods assigned. Some tokens are most likely whereas some are least likely. For these two parameters to work, do_sample parameter should be set to True i.e. do_sample=True. It means that we are allowing the next token to be sampled from the set of all likely tokens.

temperature defines how likely the model will choose the least likely token. temperature=0 means the model will generate the same response every time, because it will always choose the most likely token every time. With the higher value of temperature, the model will give more chance to other less likely tokens also. With temperature=1, chances of any probable token for being selected will be mostly equally likely. For example, a value of 0.8 will produce the more diverse output, whereas the value of 0.2 will produce deterministic output. So we can say that temperature induces stochastic behaviour.

top_p also known as nucleus sampling allows the model to only select a subset of all likely tokens. Based on its value, the model will stop sampling as soon as the cumulative probability reaches the value of top_p. Value of 1 means, it will consider all tokens.

top_k parameter tells the model to select the number of tokens exactly equal to its value.

Based on the requirements of the use case, we choose to set these parameters. We need to find the right balance of randomness/diverse vs. deterministic/focused/coherent outputs.

Instruction-Based Prompting

Providing a large language model (AI) with clear, specific, and structured instructions to guide the response of the model is referred to as instruction-based prompting. The most basic prompt consists of two components:

the instruction itself and
the data required for the instruction.

The following diagram depicts a basic instruction prompt. Please note the instruction and data part of the prompt.

Figure 2: Instruction Prompt

To make the model very specific about the output, for example, if we want the output to be either “positive” or “negative”, we can use output indicators. The following diagram depicts the instruction prompt with output indicator.

Figure 3: Instruction prompt with output indicators

Different types of tasks require different formats of the prompt. The following diagram illustrates example formats for summarization, classification, and named-entity recognition.

Figure 4: Prompt format for summarization, classification and NER task

Following is the non-exclusive list of prompting techniques for improving the quality of the output.

Specificity: Accurately describe what you want to achieve.
Hallucination: LLMs can generate incorrect information with high confidence, called hallucination. To avoid this, we need to tell the model that if it does not know the answer, please respond with “I don’t know”.
Order: Either begin or end your prompt with the instruction. LLMs tend to focus more on two ends of the prompt (beginning – primacy effect and end – recency effect). It mostly forgets the middle part in the long prompt.

As we saw above, common components of prompts are instruction, data, and output indicators. However, prompts are not limited to these components; we can build up a prompt that is as complex as we want. Other common components are

Personal
Instruction
Context
Format
Audience
Tone
Data

The following is an example from book 1 that uses the above prompt components. This example demonstrates the modular nature of prompting. We can experiment by adding or removing components to see the effect.

Figure 5: Example of prompt showing use of the various components.

In-Context Learning – Providing examples

Giving examples to the LLM, greatly influences the output of the prompt. This is referred to as in-context learning. Zero-shot prompting uses no example, whereas one-shot prompting uses one example, and few-shot prompting uses two or more examples. The following diagram illustrates the examples of in-context learning.

Figure 6: Examples of in-context learning

While giving the examples, the user and the assistant should be differentiated clearly by mentioning the role as user or the role as assistant. By giving examples, we can be more clear in describing the model. But the model can still choose through random sampling and ignore the instruction.

Chain Prompting: Breaking up the Problem

We already know that we can break the prompt into modular components of the prompt to enhance the output of LLMs. Next level of strategy is to break the problem/task into subproblems/subtasks. We use separate prompts for subtasks and then chain the prompts in a sequence by passing the output of one prompt to the input of the other prompt, thus creating a continuous chain of interactions to solve our problem. This is called chain of prompt operations or prompt chaining. The prompt chaining can help in

achieving better performance,
boost the transparency of LLM application,
increases controllability and reliability,
debug problems with model responses more easily,
improve performance in the different stages that need improvement,
useful in building LLM-powered conversational assistants,
improve the personalization and user experience of your application.

Use cases include

Response validation: We can ask the LLM to validate the previously generated output or other LLM’s output.
Parallel prompts: There can be use cases where we would be running multiple prompts in parallel, and then we would be merging the parallel outputs.
Writing stories

Following is the example from the reference book¹. This example illustrates the prompt chain that first creates a product name, then uses this name with product features to create a slogan, and finally uses features, product name, and slogan for creating the sales pitch.

Figure 7: Example of prompt chain

Reasoning with Generative Models

Reasoning is an important trait of human intelligence. LLMs as of today resemble this reasoning behaviour by memorization of training data and pattern matching. We need to work with LLMs by leveraging prompt engineering so that they can mimic the reasoning processes, and the output could be enhanced.

System 1 and System 2 Thinking Process by Daniel Kahneman

Daniel Kahneman in his famous book “Thinking Fast and Slow” introduced the concept of System 1 and System 2 thinking process in humans. According to him, “System 1” represents our fast, automatic, and intuitive thinking mode, while “System 2” is our slower, more deliberate and conscious mode of thinking, which requires effort and attention; essentially, System 1 is “thinking fast” and System 2 is “thinking slow.”.

Figure 8: System 1 and System 2 thinking, Image Source

Inducing System 1 and System 2 Thinking in LLMs

The majority of LLMs today rely on System 1 thinking, but researchers are working on techniques to encourage more System 2-type behaviour by using prompting methods like “Chain of Thought” to elicit intermediate levels of reasoning before arriving at a final response.

Chain-of-Thought: Thinking Before Answering

The main aim of chain-of-thought is to push the model towards system 2 thinking i.e. think before answering, and allowing the model to distribute more compute for the reasoning process. Here, reasoning is referred to as thoughts.

Chain of thought – a series of intermediate reasoning steps – significantly improves the ability of large language models to perform complex reasoning [3]. Prompting using chain-of-thought is called chain-of-thought prompting. This prompting technique enables LLMs to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning process is highlighted in the following example taken from the paper[3].

Figure 9: Chain-of-thought example; reasoning process is highlighted – source [3]

Chain-of-thought also requires one or more examples of reasoning. But there is something called zero-shot chain-of-thought that can be achieved by simply using the phrase “Let’s think step-by-step“. Though this phrase does not need to be exactly the same. Small variation should be fine. The following is an example of zero-shot chain-of-thought.

Figure 10: Example of zero-shot chain-of-thought – source[1]

Self-Consistency: Sampling Outcomes

The paper[2] writes about the self-consistency as follows: “It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.“

We first prompt the language model with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, ‘sample-and-marginalize’ decoding procedure is followed. Sample-and-marginalize decoding procedure:

prompt the language model with chain-of-thought (CoT) prompting
replace the ‘greedy decode’ in CoT prompting by sampling from the language model’s decode to generate a diverse set of reasoning paths, and
marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

The following diagram, from the paper[4], illustrates the concept.

Figure 11: Example of self-consistency in CoT[4]

Tree of Thoughts: Deliberate Problem Solving

This is another effort in the direction of pushing the model towards system 2 level of thinking in humans. Following is a quote from Newell et al.[6]

” A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution in finally discovered.”

Paper[5] explains the Tree-of-Thought (ToT) as follows. ToT generalizes over the popular “Chain of Thought” approach to prompting language models, and enables exploration over a coherent unit of text (“thoughts”) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.

The following diagram from the paper[5] illustrates the various approaches to problem solving with LLMs. Each rectangle box represents a thought.

Figure 12: Various approaches to problem solving with LLMs.

Output Verification

It is important to verify and control the output of the model to avoid breakdown of the model in production and to create a robust AI system. Reasons for validating the output may include

Structure output: For example, we would need the output in JSON format.
Valid output: Even if we restrict the output to few choices, it may still come up with new choice.
Ethics: Free of profanity, personally identifiable information (PII), bias, cultural stereotypes, etc.
Accuracy: Checking if output is factually accurate, coherent or free from hallucination

Except from controlling the parameters temperature and top_p, the following are the three ways to control the output of the GenAI model:

Examples: Provide the number of examples of the expected output.
Grammar: Control the token selection process
Fine-tuning: Tune the model on data that contain the expected output.

Providing examples

To control the structure of the output, e.g., in JSON format, we can provide a few examples in this format for guiding the model to produce the output in the desired format. Still, the model will behave in a certain way, is not guaranteed. Some models may be better than others in following instructions.

Grammar: Constrained Sampling

Libraries have been developed to constrain and validate the output of generative models such as:

Guidance: An efficient programming paradigm for steering language models. With Guidance, you can control how output is structured and get high-quality output for your use case—while reducing latency and cost vs. conventional prompting or fine-tuning. It allows users to constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditionals, loops, tool use) and generation seamlessly.
Guardrails: This is a Python framework that helps build reliable AI applications by performing two key functions:
- Guardrails runs Input/Output Guards in your application that detect, quantify and mitigate the presence of specific types of risks. To look at the full suite of risks, check out Guardrails Hub.
- Guardrails help you generate structured data from LLMs.
LMQL: This is a programming language for LLMs. Robust and modular LLM prompting using types, templates, constraints and an optimizing runtime.

There is another way where we can define grammars or rules that LLM should follow when choosing the next token. For example, in llama-cpp-python we can specify response_format as JSON object, if we want the output in the JSON format.

References

Book: Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
https://www.promptingguide.ai/
Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et. al., Google Research, Brain Team
Paper: Self-Consistency improves Chain-of-Thought Reasoning in Language Models by Wang et. al., Google Research, Brain Team
Tree of Thoughts: Deliberate Problem Solving with Large Language Models by Shunyu yao et al. NIPS – 2023
Report on a general problem solving program by A. Newell et al. in IFIP congress – 1959

AI & ML/GenAI/Natural Language Processing - NLP

Text Clustering and Topic Modeling using Large Language Models (LLMs)

Posted on March 4, 2025 by Ranjan Kumar / 0 Comment

1. Introduction

Text clustering is an unsupervised approach that helps in discovering patterns in data. Grouping similar texts according to their semantic content, meaning, relationships, etc. is the goal of text clustering. This makes it easier to cluster vast amounts of unstructured text and perform exploratory data analysis quickly. With recent advancements in large language models (LLMs), we can obtain extremely precise contextual and semantic representations of text. This has improved text clustering’s efficacy even more. Use cases for text clustering include identifying outliers, accelerating labelling, identifying data that has been erroneously labelled, and more.

Topic modelling facilitates the identification of (abstract) topics that arise in huge textual data collections. Clusters of text documents can be given meaning through this method.

We will learn how to use embedding models for text clustering and text-clustering-inspired method of topic modeling, namely BERTopic, generating labels using LLM given the keywords of the topic.

2. Pipeline for Text Clustering

The following diagram depicts the pipeline for text clustering that consists of the following three steps:

Use an embedding model to transform the input documents into embeddings.
Using a dimensionality reduction model, lower the dimensionality of embeddings.
Use a cluster model to identify groups of documents that share semantic similarities.

2.1 Embedding Documents

We know that embeddings are the numerical representations of text that attempt to capture its meaning. We can use embedding models that are optimized for semantic similarity tasks, for transforming the documents to embeddings in the first step. We can use the Massive Text Embedding Benchmark (MTEB) leaderboard, for selecting the embedding model for our requirement. For example “thenlper/gte-small” is a small but performant model with fast inference.

2.2 Dimensionality Reduction of Embeddings

It is difficult for clustering techniques to identify meaningful clusters if the dimension of the data is high. These techniques preserve the global structure of the data by finding the low-dimension representations. This techniques act as compression techniques, so these do not remove dimensions arbitrarily. Popular algorithms are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Compared to PCA, UMAP is better in handling nonlinear relationships and structures.

2.3 Clustering the Reduced Embeddings

The following diagram depicts the methods of text clustering:

Density-based clustering algorithms calculate the number of dimensions itself and do not force all data points to be part of the cluster. The data points not part of any cluster are marked as outliers. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a hierarchical variation of the DBSCAN algorithm that specializes in finding dense(micro)-clusters without being told the number of clusters in advance.

3. Text Clustering to Topic Modeling

Topic modelling is the term used to describe the process of identifying themes or hidden topics in a set of textual data. Traditionally, it involves finding a group of keywords or phrases that best represent and capture the essence of the topic. We need to understand the meaning of the topic through these keywords or phrases. Latent Dirichlet Allocation (LDA) is one such algorithm. Let’s discuss BERTopic in the following sections, which is a highly modular text clustering and topic modeling framework.

3.1 BERTopic: A Modular Topic Modeling Framework

Steps for performing topic modeling follow three steps of text clustering. The output of the third step of text clustering is fed to the fourth step of topic modeling. The following diagram depicts the steps (4th and 5th) for topic modeling:

4th step calculates the class-based term frequency, i.e., frequency (tf) of word (X) in cluster (C). This term is then multiplied with IDF (inverse document frequency) in the 5th step. The goal is to give more weight to the words in a cluster and less weight to the words appearing across all clusters.

The following diagram depicts the full pipeline from clustering to topic modeling. Though topic modeling follows clustering, they are largely independent of each other, and each component is modular. BERTopic can be customised, and we can choose another algorithm instead of the default ones.

3.2 Reranking in BERTopic

c-TF-IDF does not take into account the semantic structures, so BERTopic leverages the representation models (e.g. KeyBERTInspired, Maximal marginal relevance – MMR, spaCy) to rerank the topics found out in the previously discussed 5th step. This reranking is applied on each topic instead of every document. Many of the representation models are LLMs. Now with this step final pipeline extends to become the following:

3.3 Using LLM to generate a label for Topic

The following diagram explains how keywords combined with documents, along with a prompt, can be passed on to LLM for generating the label for the topic given keywords.

Final pipeline is as follows:

Final detailed pipeline:

References

Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
Hugging Face – https://huggingface.co/

AI & ML/GenAI/Natural Language Processing - NLP

Text Classification using Large Language Models (LLMs)

Posted on March 3, 2025 by Ranjan Kumar / 0 Comment

1. Introduction

A common task in natural language processing (NLP) is text classification. Use cases of text classification include sentiment analysis, intent detection, entity extraction, and language detection. This article will delve into how to use LLMs for text classification. We will see representation models and generative models. Under the representation model, we will see how to use task-specific models and embedding models to classify the text. Under the generative models, we will see open source and closed source models. While both generative and representation models can be applied to classification, they take different approaches.

2. Text Classification with Representation Models

Task-specific models and embedding models are two types of representation models that can be used for text classification. To obtain task-specific models, representation models, like bidirectional encoder representations from transformers (BERT), are trained for a particular task, like sentiment analysis. On the other hand, general-purpose models such as embedding models can be applied to a range of tasks outside classification, such as semantic search.

As it can be seen in the below diagram, when used in the text classification use case, representation models are kept frozen (untrainable). As the task-specific models are specially trained for the given task, when the text is given as input, it can classify the given text as per the task at hand. But when we are using the embedding model, we need to generate embeddings for the texts in the training set. Then train a classifier on the train dataset that has embeddings and corresponding labels. Once the classifier is trained, it can be used for classification.

3. Model Selection

The factors we should look into for selecting the model for text classification:

How does it fit the use case?
What is its language capability?
What is the underlying architecture?
What is the size of the model?
How is the performance? etc.

Underlying Architecture

BERT is an encoder-only architecture and is a popular choice for creating task-specific models and embedding models, and falls into the category of representation models. Generative Pre-trained Transformer (GPT) is a decoder-only architecture that falls into the generative models category. Encoder-only models are normally small in size. Variations of BERT are RoBERTa, DistilBERT, DeBERTa, etc. For task-specific use cases such as sentiment analysis, Twitter-RoBERTa-base can be a good starting point. For embedding models sentence-transformers/all-mpnet-base-v2 can be a good starting point as this is a small but performant model.

4. Text Classification using Task-specific Models

This is pretty straight forward. Text is passed to the tokenizer that splits the text into tokens. These tokens are consumed by the task-specific model for predicting the class of the text.

This is fine if we could find the task-specific models for our use case. Otherwise, if we have to fine-tune the model ourselves, we would need to check if we have sufficient budget(time, cost) for it. Another option is to resort to using the general-purpose embedding models.

5. Text Classification using Embedding Models

We can generate features using an embedding model rather than directly using the task-specific representation model for classification. These features can be used for training a classifier such as logistic regression.

5.1 What if we do not have the labeled data?

We have the definition of the labels, but we do not have the labeled data, we can utilize what is called “zero-shot classification“. Zero-shot model predicts the labels of input text even if it was not trained on them. Following diagram depicts the concept.

We can use zero-shot classification using embeddings. We can describe our labels based on what they should represent. The following diagram describes the concept.

To assign the labels to the input text/document, we can calculate the cosine similarity with the label embeddings to check which label it is close to.

6. Text Classification with Generative Models

Generative models are trained for a wide variety of tasks, so it will not work with the text classification out of the box. To make the generative model understand our context, we need to use the concept of prompt engineering. The prompt needs to be cleverly written so that the model understands what it is expected to do, what the candidate labels, etc.

6.1 Using Text-to-Text Transfer Model (T5)

The following diagram summarizes the different categories of the models:

The following diagram depicts the training steps:

We need to prefix each document with the prompt “Is the following sentence positive or negative?“

6.2 ChatGPT for Classification

The following diagram describes the training procedure of ChatGPT:

The model is trained using human preference data to generate text that resembles human preference.

For text classification, following is the sample prompt:

prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If if is positive return 1 and if it is negative return 0. Do not give any other answers. """

References

Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
Hugging Face – https://huggingface.co/

AI & ML/GenAI/Uncategorized

LLM (Large Language Models) Inference and Serving

Posted on February 9, 2025 by Ranjan Kumar / 0 Comment

1. Introduction

This article talks about various available solutions, techniques, and underlying architectures for LLM inference and serving. LLM inference and serving is nothing but deploying LLM models and getting/providing access to them. Based on your requirement, whether you want a local deployment for your own use or you want a production-grade deployment, various solutions are available.

2. LLM Inference & Serving Architecture

The following is an overview of the architecture for LLMs’ inference and serving.

Figure: Typical architecture of LLM Inference and Serving

Before we go into detail, let’s first understand the difference between an inference engine and a server. Inference engines run the models and are in charge of everything needed for the generation process. Inference servers handle incoming and outgoing HTTP and gRPC queries from end users of the application, as well as metrics for measuring the deployment performance of the LLM.

The inference server primarily consists of the query queue scheduler and the inference engine.

Query Queue Scheduler: This component consists of a queue and a scheduler. When a query first arrives, they are added to this queue. The scheduler takes a query from the queue and hands it over to the inference engine. The use of the queue helps the scheduler in picking multiple requests and putting them in the same batch to be processed on the GPU.

Inference Engine: This component consists of batching, model, and query response modules.

Batching: This module is responsible for creating batches of query requests. The batches are created because when calculations on the GPU are done in batches, they are more performant and resource-efficient.

The model represents the LLM model that does the actual inference, e.g., next token prediction. The query response is the final response that is sent back to the original requester. Additionally, the inference server provides the interface for access via HTTP, gRPC, etc.

The metrics module keeps track of metrics such as throughput, latency, etc.

3. Evaluation of LLM Inference and Serving

Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT), which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.

To achieve high throughput and low latency, LLM inference engines and servers focus on optimising LLM memory utilisation and performance in the production environment. Though throughput and latency are important factors, based on your specific use case, you may choose additional factors that would influence your decision to select a particular engine/server.

4. Prominent Products

vLLM: Originally developed in the Sky Computing Lab at UC Berkeley, vLLM is a fast and easy-to-use library for LLM inference and serving. Red Hat recently acquired Neural Magic, which has an enterprise-ready product, nm-vllm, based on vLLM.
LightLLM: LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
LMDeploy: A toolkit for compressing, deploying, and serving LLM.
SGLang: SGLang is a fast-serving framework for large language models and vision language models.
OpenLLM: OpenLLM allows developers to run any open-source LLMs (Llama 3.3, Qwen2.5, Phi3 and more…) or custom models as OpenAI-compatible APIs with a single command.
Triton Inference Server with TensorRT-LLM: TensorRT-LLM is an open-sourced library from Nvidia for optimising Large Language Model (LLM) inference.
Ray Serve: Ray Serve is a scalable model serving library for building online inference APIs. Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code.
Hugging Face – Text Generation Inference (TGI): Text Generation Inference (TGI) from Hugging Face, is a toolkit for deploying and serving Large Language Models (LLMs).
DeepSpeed-MII: Low-latency and high-throughput inference.
CTranslate2: Fast inference engine for Transformer models.
BentoML: Unified Inference Platform for any model, on any cloud.
MLC LLM: Universal LLM Deployment Engine with ML Compilation.
Others: Ollama, WebLLM, LM Studio, GPT4ALL, llama.cpp,

5. Important Terminologies

KV cache: Key-value (K-V) caching is a technique used in transformer models. Key and value matrices from previous steps are stored and then reused during the generation of subsequent tokens. This helps in the reduction of redundant computations and speeding up inference time. But this comes at the cost of increased memory consumption. The increased memory consumption can be reduced by techniques such as cache invalidation and cache reuse. [10]
PagedAttention: Because of the KV cache, memory management becomes critical. Under this technique, the KV cache is partitioned into blocks. Because of this partition, storage of keys and values in memory can happen in a non-contiguous manner. This memory management strategy is inspired by the concept of virtual memory and paging in operating systems.
Batching: Under this technique, multiple inference requests (or prompts) are grouped/batched and then processed simultaneously to improve GPU utilisation and throughput.
Support for quantisation: Larger LLM models need hardware with a larger specification and thereby increase the overall cost. But if you are tight with budget, quantisation is one of the techniques you can try. The precision of weights and activations of the models is modified with lower precision to decrease the memory footprint. This helps in reducing memory bandwidth requirements and increasing cache utilisation. Typically, the precision range can vary from FP32 to INT8 or even INT4.
LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Tool calling: Tool calling, also known as function calling, is a technique that enables LLMs to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
Reasoning models: Reasoning models employ complex, multi-step generation with intermediate steps to solve complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. Examples include DeepSeek R1. [13]
Structured Outputs:
- outlines: Structured Text Generation
- LMFE (Language Model Format Enforcer): Enforce the output format (JSON Schema, Regex etc) of a language model.
- xgrammar: Efficient, Flexible and Portable Structured Generation
Automatic Prefix Caching (APC): APC is a technique that speeds up Large Language Models (LLMs) by reusing cached results for similar queries.
Speculative Decoding [4][5]
Chunked Prefill
Prompt Adapter
Beam Search
Guided decoding
AsyncLM: It improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently.
Prompt logprobs (logarithm of probability)
kserve: Standardized Serverless ML Inference Platform on Kubernetes
kubeai: AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
llama-stack: Composable building blocks to build Llama Apps

References / Further Reads

AI & ML/GenAI

Fact-checking in LLM

Posted on February 5, 2025 by Ranjan Kumar / 0 Comment

Summary of fact-checking in LLM #llm #hallucinations #misinformation

AI & ML/GenAI

Summary of the paper DeepSeek-R1

Posted on January 30, 2025 by Ranjan Kumar / 0 Comment

AI & ML/GenAI

How do you choose among competing open-source products? Example comparison of open-source vector databases.

Posted on January 28, 2025 by Ranjan Kumar / 0 Comment

Following are the questions that should be running in your mind when you are to choose from competing open-source products:

Are there any other users out there?
Is it the most popular in this category?
Is this technology in decline?

The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.

Following is the comparison of vector databases: qdrant, chroma, weaviate, marqo, milvus, and vespa.