Teaching Chatbot

I have created this teaching chatbot using RAG, Llama Model, Qdrant (Vector DB), Langchain, and Streamlit.

Github repository link: https://github.com/ranjankumar-gh/teaching-bot/

Screenshot of the app:

Data: PDF files have been downloaded from the NCERT website for Class IX, subject SST, from the topic “Democratic politics”.

For more details, visit the GitHub link.

On Emergent Abilities of Large Language Models

An ability is emergent if it is not present in smaller models but is present in larger models. [1]

Scaling up language models has been shown to improve predictably the performance and sample efficiency on a wide range of downstream tasks. Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. This raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. [1]

Today’s language models have been scaled primarily along three factors:

  1. amount of computation,
  2. number of model parameters, and
  3. training data size

The following table lists the emergent abilities of large language models and the scale at which abilities emerge. [1]

Tasks that language models cannot currently do are prime candidates for future emergence; for instance, there are dozens of tasks in BIG-Bench[3] for which even the largest GPT-3 and PaLM models do not achieve above-random performance. [1] Similar to emergent abilities, emergent risks could also emerge, such as w.r.t. truthfulness, bias, and toxicity in LLMs, backdoor vulnerabilities, inadvertent deception, or harmful content synthesis.

But Rylan Schaeffer et al., in their paper [3], claim that the sudden appearance of emergent abilities is just a consequence of the way researchers measure the LLM’s performance. The article “How Quickly Do Large Language Models Learn Unexpected Skills?” by Stephen Ornes [4] beautifully summarises the two papers.

References

  1. Emergent Abilities of Large Language Models by Jason Wei et al. – https://openreview.net/pdf?id=yzkSU5zdwD
  2. Are Emergent Abilities of Large Language Models a Mirage? by Rylan Schaeffer et al. – https://arxiv.org/pdf/2304.15004
  3. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models by Aarohi et al. – https://arxiv.org/pdf/2206.04615
  4. How Quickly Do Large Language Models Learn Unexpected Skills? by Stephen Ornes – https://www.quantamagazine.org/how-quickly-do-large-language-models-learn-unexpected-skills-20240213/

Prompt Engineering

Prompt is a text set passed to the GenAI model as instructions. Given the prompt, the model responds with some generated text. A prompt can be questions, statements, or instructions. Prompt engineering facilitates the design of the prompt to enhance the generated text. It is also 1) a tool to evaluate the output of the model and 2) a tool for safety mitigation methods. There is no perfect prompt design. Prompt optimization and experimentation are done iteratively. The following figure 1 depicts a very basic example of the prompt.

Figure 1: A basic example of the prompt

Controlling Model Output by Adjusting Model Parameters

temperature and top_p parameters control the randomness of the output. Before a large language model (LLM) generates a token, it has many possible choices of tokens with different likelihoods assigned. Some tokens are most likely whereas some are least likely. For these two parameters to work, do_sample parameter should be set to True i.e. do_sample=True. It means that we are allowing the next token to be sampled from the set of all likely tokens.

temperature defines how likely the model will choose the least likely token. temperature=0 means the model will generate the same response every time, because it will always choose the most likely token every time. With the higher value of temperature, the model will give more chance to other less likely tokens also. With temperature=1, chances of any probable token for being selected will be mostly equally likely. For example, a value of 0.8 will produce the more diverse output, whereas the value of 0.2 will produce deterministic output. So we can say that temperature induces stochastic behaviour.

top_p also known as nucleus sampling allows the model to only select a subset of all likely tokens. Based on its value, the model will stop sampling as soon as the cumulative probability reaches the value of top_p. Value of 1 means, it will consider all tokens.

top_k parameter tells the model to select the number of tokens exactly equal to its value.

Based on the requirements of the use case, we choose to set these parameters. We need to find the right balance of randomness/diverse vs. deterministic/focused/coherent outputs.

Instruction-Based Prompting

Providing a large language model (AI) with clear, specific, and structured instructions to guide the response of the model is referred to as instruction-based prompting. The most basic prompt consists of two components:

  1. the instruction itself and
  2. the data required for the instruction.

The following diagram depicts a basic instruction prompt. Please note the instruction and data part of the prompt.

Figure 2: Instruction Prompt

To make the model very specific about the output, for example, if we want the output to be either “positive” or “negative”, we can use output indicators. The following diagram depicts the instruction prompt with output indicator.

Figure 3: Instruction prompt with output indicators

Different types of tasks require different formats of the prompt. The following diagram illustrates example formats for summarization, classification, and named-entity recognition.

Figure 4: Prompt format for summarization, classification and NER task

Following is the non-exclusive list of prompting techniques for improving the quality of the output.

  1. Specificity: Accurately describe what you want to achieve.
  2. Hallucination: LLMs can generate incorrect information with high confidence, called hallucination. To avoid this, we need to tell the model that if it does not know the answer, please respond with “I don’t know”.
  3. Order: Either begin or end your prompt with the instruction. LLMs tend to focus more on two ends of the prompt (beginning – primacy effect and end – recency effect). It mostly forgets the middle part in the long prompt.

As we saw above, common components of prompts are instruction, data, and output indicators. However, prompts are not limited to these components; we can build up a prompt that is as complex as we want. Other common components are

  1. Personal
  2. Instruction
  3. Context
  4. Format
  5. Audience
  6. Tone
  7. Data

The following is an example from book 1 that uses the above prompt components. This example demonstrates the modular nature of prompting. We can experiment by adding or removing components to see the effect.

Figure 5: Example of prompt showing use of the various components.

In-Context Learning – Providing examples

Giving examples to the LLM, greatly influences the output of the prompt. This is referred to as in-context learning. Zero-shot prompting uses no example, whereas one-shot prompting uses one example, and few-shot prompting uses two or more examples. The following diagram illustrates the examples of in-context learning.

Figure 6: Examples of in-context learning

While giving the examples, the user and the assistant should be differentiated clearly by mentioning the role as user or the role as assistant. By giving examples, we can be more clear in describing the model. But the model can still choose through random sampling and ignore the instruction.

Chain Prompting: Breaking up the Problem

We already know that we can break the prompt into modular components of the prompt to enhance the output of LLMs. Next level of strategy is to break the problem/task into subproblems/subtasks. We use separate prompts for subtasks and then chain the prompts in a sequence by passing the output of one prompt to the input of the other prompt, thus creating a continuous chain of interactions to solve our problem. This is called chain of prompt operations or prompt chaining. The prompt chaining can help in

  1. achieving better performance,
  2. boost the transparency of LLM application,
  3. increases controllability and reliability,
  4. debug problems with model responses more easily,
  5. improve performance in the different stages that need improvement,
  6. useful in building LLM-powered conversational assistants,
  7. improve the personalization and user experience of your application.

Use cases include

  1. Response validation: We can ask the LLM to validate the previously generated output or other LLM’s output.
  2. Parallel prompts: There can be use cases where we would be running multiple prompts in parallel, and then we would be merging the parallel outputs.
  3. Writing stories

Following is the example from the reference book1. This example illustrates the prompt chain that first creates a product name, then uses this name with product features to create a slogan, and finally uses features, product name, and slogan for creating the sales pitch.

Figure 7: Example of prompt chain

Reasoning with Generative Models

Reasoning is an important trait of human intelligence. LLMs as of today resemble this reasoning behaviour by memorization of training data and pattern matching. We need to work with LLMs by leveraging prompt engineering so that they can mimic the reasoning processes, and the output could be enhanced.

System 1 and System 2 Thinking Process by Daniel Kahneman

Daniel Kahneman in his famous book “Thinking Fast and Slow” introduced the concept of System 1 and System 2 thinking process in humans. According to him, “System 1” represents our fast, automatic, and intuitive thinking mode, while “System 2” is our slower, more deliberate and conscious mode of thinking, which requires effort and attention; essentially, System 1 is “thinking fast” and System 2 is “thinking slow.”.

Figure 8: System 1 and System 2 thinking, Image Source

Inducing System 1 and System 2 Thinking in LLMs

The majority of LLMs today rely on System 1 thinking, but researchers are working on techniques to encourage more System 2-type behaviour by using prompting methods like “Chain of Thought” to elicit intermediate levels of reasoning before arriving at a final response.

Chain-of-Thought: Thinking Before Answering

The main aim of chain-of-thought is to push the model towards system 2 thinking i.e. think before answering, and allowing the model to distribute more compute for the reasoning process. Here, reasoning is referred to as thoughts.

Chain of thought – a series of intermediate reasoning steps – significantly improves the ability of large language models to perform complex reasoning [3]. Prompting using chain-of-thought is called chain-of-thought prompting. This prompting technique enables LLMs to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning process is highlighted in the following example taken from the paper[3].

Figure 9: Chain-of-thought example; reasoning process is highlighted – source [3]

Chain-of-thought also requires one or more examples of reasoning. But there is something called zero-shot chain-of-thought that can be achieved by simply using the phrase “Let’s think step-by-step“. Though this phrase does not need to be exactly the same. Small variation should be fine. The following is an example of zero-shot chain-of-thought.

Figure 10: Example of zero-shot chain-of-thought – source[1]

Self-Consistency: Sampling Outcomes

The paper[2] writes about the self-consistency as follows: “It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.

We first prompt the language model with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, ‘sample-and-marginalize’ decoding procedure is followed. Sample-and-marginalize decoding procedure:

  1. prompt the language model with chain-of-thought (CoT) prompting
  2. replace the ‘greedy decode’ in CoT prompting by sampling from the language model’s decode to generate a diverse set of reasoning paths, and
  3. marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set.

The following diagram, from the paper[4], illustrates the concept.

Figure 11: Example of self-consistency in CoT[4]

Tree of Thoughts: Deliberate Problem Solving

This is another effort in the direction of pushing the model towards system 2 level of thinking in humans. Following is a quote from Newell et al.[6]

” A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution in finally discovered.”

Paper[5] explains the Tree-of-Thought (ToT) as follows. ToT generalizes over the popular “Chain of Thought” approach to prompting language models, and enables exploration over a coherent unit of text (“thoughts”) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.

The following diagram from the paper[5] illustrates the various approaches to problem solving with LLMs. Each rectangle box represents a thought.

Figure 12: Various approaches to problem solving with LLMs.

Output Verification

It is important to verify and control the output of the model to avoid breakdown of the model in production and to create a robust AI system. Reasons for validating the output may include

  1. Structure output: For example, we would need the output in JSON format.
  2. Valid output: Even if we restrict the output to few choices, it may still come up with new choice.
  3. Ethics: Free of profanity, personally identifiable information (PII), bias, cultural stereotypes, etc.
  4. Accuracy: Checking if output is factually accurate, coherent or free from hallucination

Except from controlling the parameters temperature and top_p, the following are the three ways to control the output of the GenAI model:

  1. Examples: Provide the number of examples of the expected output.
  2. Grammar: Control the token selection process
  3. Fine-tuning: Tune the model on data that contain the expected output.

Providing examples

To control the structure of the output, e.g., in JSON format, we can provide a few examples in this format for guiding the model to produce the output in the desired format. Still, the model will behave in a certain way, is not guaranteed. Some models may be better than others in following instructions.

Grammar: Constrained Sampling

Libraries have been developed to constrain and validate the output of generative models such as:

  1. Guidance: An efficient programming paradigm for steering language models. With Guidance, you can control how output is structured and get high-quality output for your use case—while reducing latency and cost vs. conventional prompting or fine-tuning. It allows users to constrain generation (e.g. with regex and CFGs) as well as to interleave control (conditionals, loops, tool use) and generation seamlessly.
  2. Guardrails: This is a Python framework that helps build reliable AI applications by performing two key functions:
    • Guardrails runs Input/Output Guards in your application that detect, quantify and mitigate the presence of specific types of risks. To look at the full suite of risks, check out Guardrails Hub.
    • Guardrails help you generate structured data from LLMs.
  3. LMQL: This is a programming language for LLMs. Robust and modular LLM prompting using types, templates, constraints and an optimizing runtime.

There is another way where we can define grammars or rules that LLM should follow when choosing the next token. For example, in llama-cpp-python we can specify response_format as JSON object, if we want the output in the JSON format.

References

  1. Book: Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
  2. https://www.promptingguide.ai/
  3. Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models by Jason Wei et. al., Google Research, Brain Team
  4. Paper: Self-Consistency improves Chain-of-Thought Reasoning in Language Models by Wang et. al., Google Research, Brain Team
  5. Tree of Thoughts: Deliberate Problem Solving with Large Language Models by Shunyu yao et al. NIPS – 2023
  6. Report on a general problem solving program by A. Newell et al. in IFIP congress – 1959

Text Clustering and Topic Modeling using Large Language Models (LLMs)

1. Introduction

Text clustering is an unsupervised approach that helps in discovering patterns in data. Grouping similar texts according to their semantic content, meaning, relationships, etc. is the goal of text clustering. This makes it easier to cluster vast amounts of unstructured text and perform exploratory data analysis quickly. With recent advancements in large language models (LLMs), we can obtain extremely precise contextual and semantic representations of text. This has improved text clustering’s efficacy even more. Use cases for text clustering include identifying outliers, accelerating labelling, identifying data that has been erroneously labelled, and more.

Topic modelling facilitates the identification of (abstract) topics that arise in huge textual data collections. Clusters of text documents can be given meaning through this method.

We will learn how to use embedding models for text clustering and text-clustering-inspired method of topic modeling, namely BERTopic, generating labels using LLM given the keywords of the topic.

2. Pipeline for Text Clustering

The following diagram depicts the pipeline for text clustering that consists of the following three steps:

  1. Use an embedding model to transform the input documents into embeddings.
  2. Using a dimensionality reduction model, lower the dimensionality of embeddings.
  3. Use a cluster model to identify groups of documents that share semantic similarities.

2.1 Embedding Documents

We know that embeddings are the numerical representations of text that attempt to capture its meaning. We can use embedding models that are optimized for semantic similarity tasks, for transforming the documents to embeddings in the first step. We can use the Massive Text Embedding Benchmark (MTEB) leaderboard, for selecting the embedding model for our requirement. For example “thenlper/gte-small” is a small but performant model with fast inference.

2.2 Dimensionality Reduction of Embeddings

It is difficult for clustering techniques to identify meaningful clusters if the dimension of the data is high. These techniques preserve the global structure of the data by finding the low-dimension representations. This techniques act as compression techniques, so these do not remove dimensions arbitrarily. Popular algorithms are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Compared to PCA, UMAP is better in handling nonlinear relationships and structures.

2.3 Clustering the Reduced Embeddings

The following diagram depicts the methods of text clustering:

Density-based clustering algorithms calculate the number of dimensions itself and do not force all data points to be part of the cluster. The data points not part of any cluster are marked as outliers. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a hierarchical variation of the DBSCAN algorithm that specializes in finding dense(micro)-clusters without being told the number of clusters in advance.

3. Text Clustering to Topic Modeling

Topic modelling is the term used to describe the process of identifying themes or hidden topics in a set of textual data. Traditionally, it involves finding a group of keywords or phrases that best represent and capture the essence of the topic. We need to understand the meaning of the topic through these keywords or phrases. Latent Dirichlet Allocation (LDA) is one such algorithm. Let’s discuss BERTopic in the following sections, which is a highly modular text clustering and topic modeling framework.

3.1 BERTopic: A Modular Topic Modeling Framework

Steps for performing topic modeling follow three steps of text clustering. The output of the third step of text clustering is fed to the fourth step of topic modeling. The following diagram depicts the steps (4th and 5th) for topic modeling:

4th step calculates the class-based term frequency, i.e., frequency (tf) of word (X) in cluster (C). This term is then multiplied with IDF (inverse document frequency) in the 5th step. The goal is to give more weight to the words in a cluster and less weight to the words appearing across all clusters.

The following diagram depicts the full pipeline from clustering to topic modeling. Though topic modeling follows clustering, they are largely independent of each other, and each component is modular. BERTopic can be customised, and we can choose another algorithm instead of the default ones.

3.2 Reranking in BERTopic

c-TF-IDF does not take into account the semantic structures, so BERTopic leverages the representation models (e.g. KeyBERTInspired, Maximal marginal relevance – MMR, spaCy) to rerank the topics found out in the previously discussed 5th step. This reranking is applied on each topic instead of every document. Many of the representation models are LLMs. Now with this step final pipeline extends to become the following:

3.3 Using LLM to generate a label for Topic

The following diagram explains how keywords combined with documents, along with a prompt, can be passed on to LLM for generating the label for the topic given keywords.

Final pipeline is as follows:

Final detailed pipeline:

References

  1. Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
  2. Hugging Face – https://huggingface.co/

Text Classification using Large Language Models (LLMs)

1. Introduction

A common task in natural language processing (NLP) is text classification. Use cases of text classification include sentiment analysis, intent detection, entity extraction, and language detection. This article will delve into how to use LLMs for text classification. We will see representation models and generative models. Under the representation model, we will see how to use task-specific models and embedding models to classify the text. Under the generative models, we will see open source and closed source models. While both generative and representation models can be applied to classification, they take different approaches.

2. Text Classification with Representation Models

Task-specific models and embedding models are two types of representation models that can be used for text classification. To obtain task-specific models, representation models, like bidirectional encoder representations from transformers (BERT), are trained for a particular task, like sentiment analysis. On the other hand, general-purpose models such as embedding models can be applied to a range of tasks outside classification, such as semantic search.

As it can be seen in the below diagram, when used in the text classification use case, representation models are kept frozen (untrainable). As the task-specific models are specially trained for the given task, when the text is given as input, it can classify the given text as per the task at hand. But when we are using the embedding model, we need to generate embeddings for the texts in the training set. Then train a classifier on the train dataset that has embeddings and corresponding labels. Once the classifier is trained, it can be used for classification.

3. Model Selection

The factors we should look into for selecting the model for text classification:

  1. How does it fit the use case?
  2. What is its language capability?
  3. What is the underlying architecture?
  4. What is the size of the model?
  5. How is the performance? etc.

Underlying Architecture

BERT is an encoder-only architecture and is a popular choice for creating task-specific models and embedding models, and falls into the category of representation models. Generative Pre-trained Transformer (GPT) is a decoder-only architecture that falls into the generative models category. Encoder-only models are normally small in size. Variations of BERT are RoBERTa, DistilBERT, DeBERTa, etc. For task-specific use cases such as sentiment analysis, Twitter-RoBERTa-base can be a good starting point. For embedding models sentence-transformers/all-mpnet-base-v2 can be a good starting point as this is a small but performant model.

4. Text Classification using Task-specific Models

This is pretty straight forward. Text is passed to the tokenizer that splits the text into tokens. These tokens are consumed by the task-specific model for predicting the class of the text.

This is fine if we could find the task-specific models for our use case. Otherwise, if we have to fine-tune the model ourselves, we would need to check if we have sufficient budget(time, cost) for it. Another option is to resort to using the general-purpose embedding models.

5. Text Classification using Embedding Models

We can generate features using an embedding model rather than directly using the task-specific representation model for classification. These features can be used for training a classifier such as logistic regression.

5.1 What if we do not have the labeled data?

We have the definition of the labels, but we do not have the labeled data, we can utilize what is called “zero-shot classification“. Zero-shot model predicts the labels of input text even if it was not trained on them. Following diagram depicts the concept.

We can use zero-shot classification using embeddings. We can describe our labels based on what they should represent. The following diagram describes the concept.

To assign the labels to the input text/document, we can calculate the cosine similarity with the label embeddings to check which label it is close to.

6. Text Classification with Generative Models

Generative models are trained for a wide variety of tasks, so it will not work with the text classification out of the box. To make the generative model understand our context, we need to use the concept of prompt engineering. The prompt needs to be cleverly written so that the model understands what it is expected to do, what the candidate labels, etc.

6.1 Using Text-to-Text Transfer Model (T5)

The following diagram summarizes the different categories of the models:

The following diagram depicts the training steps:

We need to prefix each document with the prompt “Is the following sentence positive or negative?

6.2 ChatGPT for Classification

The following diagram describes the training procedure of ChatGPT:

The model is trained using human preference data to generate text that resembles human preference.

For text classification, following is the sample prompt:

prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If if is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

References

  1. Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
  2. Hugging Face – https://huggingface.co/

LLM (Large Language Models) Inference and Serving

1. Introduction

This article talks about various available solutions, techniques, and underlying architectures for LLM inference and serving. LLM inference and serving is nothing but deploying LLM models and getting/providing access to them. Based on your requirement, whether you want a local deployment for your own use or you want a production-grade deployment, various solutions are available.

2. LLM Inference & Serving Architecture

The following is an overview of the architecture for LLMs’ inference and serving.

Figure: Typical architecture of LLM Inference and Serving

Before we go into detail, let’s first understand the difference between an inference engine and a server. Inference engines run the models and are in charge of everything needed for the generation process. Inference servers handle incoming and outgoing HTTP and gRPC queries from end users of the application, as well as metrics for measuring the deployment performance of the LLM.

The inference server primarily consists of the query queue scheduler and the inference engine.

Query Queue Scheduler: This component consists of a queue and a scheduler. When a query first arrives, they are added to this queue. The scheduler takes a query from the queue and hands it over to the inference engine. The use of the queue helps the scheduler in picking multiple requests and putting them in the same batch to be processed on the GPU.

Inference Engine: This component consists of batching, model, and query response modules.

Batching: This module is responsible for creating batches of query requests. The batches are created because when calculations on the GPU are done in batches, they are more performant and resource-efficient.

The model represents the LLM model that does the actual inference, e.g., next token prediction. The query response is the final response that is sent back to the original requester. Additionally, the inference server provides the interface for access via HTTP, gRPC, etc.

The metrics module keeps track of metrics such as throughput, latency, etc.

3. Evaluation of LLM Inference and Serving

Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT), which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.

To achieve high throughput and low latency, LLM inference engines and servers focus on optimising LLM memory utilisation and performance in the production environment. Though throughput and latency are important factors, based on your specific use case, you may choose additional factors that would influence your decision to select a particular engine/server.

4. Prominent Products

  1. vLLM: Originally developed in the Sky Computing Lab at UC Berkeley, vLLM is a fast and easy-to-use library for LLM inference and serving. Red Hat recently acquired Neural Magic, which has an enterprise-ready product, nm-vllm, based on vLLM.
  2. LightLLM: LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
  3. LMDeploy: A toolkit for compressing, deploying, and serving LLM.
  4. SGLang: SGLang is a fast-serving framework for large language models and vision language models.
  5. OpenLLM: OpenLLM allows developers to run any open-source LLMs (Llama 3.3, Qwen2.5, Phi3 and more…) or custom models as OpenAI-compatible APIs with a single command.
  6. Triton Inference Server with TensorRT-LLM: TensorRT-LLM is an open-sourced library from Nvidia for optimising Large Language Model (LLM) inference.
  7. Ray Serve: Ray Serve is a scalable model serving library for building online inference APIs. Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code.
  8. Hugging Face – Text Generation Inference (TGI): Text Generation Inference (TGI) from Hugging Face, is a toolkit for deploying and serving Large Language Models (LLMs).
  9. DeepSpeed-MII: Low-latency and high-throughput inference.
  10. CTranslate2: Fast inference engine for Transformer models.
  11. BentoML: Unified Inference Platform for any model, on any cloud.
  12. MLC LLM: Universal LLM Deployment Engine with ML Compilation.
  13. Others: Ollama, WebLLM, LM Studio, GPT4ALL, llama.cpp,

5. Important Terminologies

  1. KV cache: Key-value (K-V) caching is a technique used in transformer models. Key and value matrices from previous steps are stored and then reused during the generation of subsequent tokens. This helps in the reduction of redundant computations and speeding up inference time. But this comes at the cost of increased memory consumption. The increased memory consumption can be reduced by techniques such as cache invalidation and cache reuse. [10]
  2. PagedAttention: Because of the KV cache, memory management becomes critical. Under this technique, the KV cache is partitioned into blocks. Because of this partition, storage of keys and values in memory can happen in a non-contiguous manner. This memory management strategy is inspired by the concept of virtual memory and paging in operating systems.
  3. Batching: Under this technique, multiple inference requests (or prompts) are grouped/batched and then processed simultaneously to improve GPU utilisation and throughput.
  4. Support for quantisation: Larger LLM models need hardware with a larger specification and thereby increase the overall cost. But if you are tight with budget, quantisation is one of the techniques you can try. The precision of weights and activations of the models is modified with lower precision to decrease the memory footprint. This helps in reducing memory bandwidth requirements and increasing cache utilisation. Typically, the precision range can vary from FP32 to INT8 or even INT4.
  5. LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
  6. Tool calling: Tool calling, also known as function calling, is a technique that enables LLMs to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
  7. Reasoning models: Reasoning models employ complex, multi-step generation with intermediate steps to solve complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. Examples include DeepSeek R1. [13]
  8. Structured Outputs:
    • outlines: Structured Text Generation
    • LMFE (Language Model Format Enforcer): Enforce the output format (JSON Schema, Regex etc) of a language model.
    • xgrammar: Efficient, Flexible and Portable Structured Generation
  9. Automatic Prefix Caching (APC): APC is a technique that speeds up Large Language Models (LLMs) by reusing cached results for similar queries.
  10. Speculative Decoding [4][5]
  11. Chunked Prefill
  12. Prompt Adapter
  13. Beam Search
  14. Guided decoding
  15. AsyncLM: It improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently.
  16. Prompt logprobs (logarithm of probability)
  17. kserve: Standardized Serverless ML Inference Platform on Kubernetes
  18. kubeai: AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
  19. llama-stack: Composable building blocks to build Llama Apps

References / Further Reads

  1. Best LLM Inference Engines and Servers to deploy LLMs in Production
  2. Efficient Memory Management for Large Language Model Serving with PagedAttention
  3. LoRA: Low-Rank Adaptation of Large Language Models
  4. Fast Inference from Transformers via Speculative Decoding
  5. Looking back at speculative decoding
  6. Efficient Generative Large Language Model Serving
  7. Ten ways to Serve Large Language Models: A Comprehensive Guide
  8. The 6 Best LLM Tools To Run Models Locally
  9. Benchmarking LLM Inference Backends
  10. Transformers Key-Value Caching Explained
  11. LLM Inference Series: 3. KV caching explained
  12. vLLM and PagedAttention: A Comprehensive Overview
  13. Understanding Reasoning LLMs

How do you choose among competing open-source products? Example comparison of open-source vector databases.

Following are the questions that should be running in your mind when you are to choose from competing open-source products:

  1. Are there any other users out there?
  2. Is it the most popular in this category?
  3. Is this technology in decline?

The popularity and traction of GitHub can be inferred from their star histories. You can use star-history.com to make a comparison based on these two metrics. Refer to the tutorial for details.

Following is the comparison of vector databases: qdrant, chroma, weaviate, marqo, milvus, and vespa.

Star History
Star History