You might have encountered some performance issues while executing LLM (Large Language Model) models for the production environment. You may consider using an inference engine or server that handles various such issues off the shelf for you. Following is an overview of the architecture for LLMs’ inference and serving.

Several LLM inference engines and servers are available to deploy and serve LLMs in production. The following are a few and most prominent among them:
- vLLM
- LightLLM
- LMDeploy
- SGLang
- OpenLLM
- Triton Inference Server with TensorRT-LLM
- Ray Serve
- Hugging Face – Text Generation Inference (TGI)
- DeepSpeed-MII
- CTranslate2
- BentoML
- MLC LLM
Throughput vs. Latency
Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT) which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.
LLM inference engines and servers are intended to optimise LLM memory utilisation and performance in the production environment. They assist you with achieving high throughput and low latency, guaranteeing that your LLMs can handle a huge number of requests while responding rapidly. Based on your specific use case, you may also have additional factors that would influence your decision to select a particular engine/server.
Inferene Engine vs. Inference Server
Inference engines run the models and are in charge of everything needed for the generating process. Inference Servers handle incoming and outgoing HTTP and gRPC queries from end users for your application, as well as metrics for measuring the deployment performance of your LLM.
Techniques/Terminologies used across these Frameworks
- KV cache
- PagedAttention
- Batching
- Support for quantisation
- LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
- Tool calling: Tool calling, also known as function calling, is a technique that enables Large Language Models (LLMs) to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
- Reasoning models: e.g. DeepSeek R1
- Structured Outputs:
- Automatic Prefix Caching (APC): APC is a technique that speeds up Large Language Models (LLMs) by reusing cached results for similar queries.
- Speculative Decoding [4][5]
- Chunked Prefill
- Prompt Adapter
- Beam Search
- Guided decoding
- AsyncLM: It improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently.
- Prompt logprobs (logarithm of probability)
- kserve: Standardized Serverless ML Inference Platform on Kubernetes
- kubeai: AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
- llama-stack: Composable building blocks to build Llama Apps
Many of the terms and techniques have been left above without any elaboration. These are pointers for exploring in detail.