LLM (Large Language Models) Inference and Serving

1. Introduction

This article talks about various available solutions, techniques, and underlying architectures for LLM inference and serving. LLM inference and serving is nothing but deploying LLM models and getting/providing access to them. Based on your requirement, whether you want a local deployment for your own use or you want a production-grade deployment, various solutions are available.

2. LLM Inference & Serving Architecture

The following is an overview of the architecture for LLMs’ inference and serving.

Figure: Typical architecture of LLM Inference and Serving

Before we go into detail, let’s first understand the difference between an inference engine and a server. Inference engines run the models and are in charge of everything needed for the generation process. Inference servers handle incoming and outgoing HTTP and gRPC queries from end users of the application, as well as metrics for measuring the deployment performance of the LLM.

The inference server primarily consists of the query queue scheduler and the inference engine.

Query Queue Scheduler: This component consists of a queue and a scheduler. When a query first arrives, they are added to this queue. The scheduler takes a query from the queue and hands it over to the inference engine. The use of the queue helps the scheduler in picking multiple requests and putting them in the same batch to be processed on the GPU.

Inference Engine: This component consists of batching, model, and query response modules.

Batching: This module is responsible for creating batches of query requests. The batches are created because when calculations on the GPU are done in batches, they are more performant and resource-efficient.

The model represents the LLM model that does the actual inference, e.g., next token prediction. The query response is the final response that is sent back to the original requester. Additionally, the inference server provides the interface for access via HTTP, gRPC, etc.

The metrics module keeps track of metrics such as throughput, latency, etc.

3. Evaluation of LLM Inference and Serving

Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT), which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.

To achieve high throughput and low latency, LLM inference engines and servers focus on optimising LLM memory utilisation and performance in the production environment. Though throughput and latency are important factors, based on your specific use case, you may choose additional factors that would influence your decision to select a particular engine/server.

4. Prominent Products

vLLM: Originally developed in the Sky Computing Lab at UC Berkeley, vLLM is a fast and easy-to-use library for LLM inference and serving. Red Hat recently acquired Neural Magic, which has an enterprise-ready product, nm-vllm, based on vLLM.
LightLLM: LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
LMDeploy: A toolkit for compressing, deploying, and serving LLM.
SGLang: SGLang is a fast-serving framework for large language models and vision language models.
OpenLLM: OpenLLM allows developers to run any open-source LLMs (Llama 3.3, Qwen2.5, Phi3 and more…) or custom models as OpenAI-compatible APIs with a single command.
Triton Inference Server with TensorRT-LLM: TensorRT-LLM is an open-sourced library from Nvidia for optimising Large Language Model (LLM) inference.
Ray Serve: Ray Serve is a scalable model serving library for building online inference APIs. Ray Serve is particularly well suited for model composition and many model serving, enabling you to build a complex inference service consisting of multiple ML models and business logic all in Python code.
Hugging Face – Text Generation Inference (TGI): Text Generation Inference (TGI) from Hugging Face, is a toolkit for deploying and serving Large Language Models (LLMs).
DeepSpeed-MII: Low-latency and high-throughput inference.
CTranslate2: Fast inference engine for Transformer models.
BentoML: Unified Inference Platform for any model, on any cloud.
MLC LLM: Universal LLM Deployment Engine with ML Compilation.
Others: Ollama, WebLLM, LM Studio, GPT4ALL, llama.cpp,

5. Important Terminologies

KV cache: Key-value (K-V) caching is a technique used in transformer models. Key and value matrices from previous steps are stored and then reused during the generation of subsequent tokens. This helps in the reduction of redundant computations and speeding up inference time. But this comes at the cost of increased memory consumption. The increased memory consumption can be reduced by techniques such as cache invalidation and cache reuse. [10]
PagedAttention: Because of the KV cache, memory management becomes critical. Under this technique, the KV cache is partitioned into blocks. Because of this partition, storage of keys and values in memory can happen in a non-contiguous manner. This memory management strategy is inspired by the concept of virtual memory and paging in operating systems.
Batching: Under this technique, multiple inference requests (or prompts) are grouped/batched and then processed simultaneously to improve GPU utilisation and throughput.
Support for quantisation: Larger LLM models need hardware with a larger specification and thereby increase the overall cost. But if you are tight with budget, quantisation is one of the techniques you can try. The precision of weights and activations of the models is modified with lower precision to decrease the memory footprint. This helps in reducing memory bandwidth requirements and increasing cache utilisation. Typically, the precision range can vary from FP32 to INT8 or even INT4.
LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Tool calling: Tool calling, also known as function calling, is a technique that enables LLMs to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
Reasoning models: Reasoning models employ complex, multi-step generation with intermediate steps to solve complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. Examples include DeepSeek R1. [13]
Structured Outputs:
- outlines: Structured Text Generation
- LMFE (Language Model Format Enforcer): Enforce the output format (JSON Schema, Regex etc) of a language model.
- xgrammar: Efficient, Flexible and Portable Structured Generation
Automatic Prefix Caching (APC): APC is a technique that speeds up Large Language Models (LLMs) by reusing cached results for similar queries.
Speculative Decoding [4][5]
Chunked Prefill
Prompt Adapter
Beam Search
Guided decoding
AsyncLM: It improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently.
Prompt logprobs (logarithm of probability)
kserve: Standardized Serverless ML Inference Platform on Kubernetes
kubeai: AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
llama-stack: Composable building blocks to build Llama Apps