1. Introduction
LLM inference and serving refer to the process of deploying large language models and making them accessible for use — whether locally for personal projects or in production for large-scale applications.
Depending on your needs, you may opt for a lightweight local deployment or a robust, enterprise-grade solution. The right choice depends on factors like performance, scalability, latency, and infrastructure availability.
2. LLM Inference & Serving Architecture
The following is an overview of the architecture for LLMs’ inference and serving. This architecture represents how an application interacts with a deployed Large Language Model (LLM) to generate predictions (inference). It highlights the role of inference servers, inference engines, and the hardware layer.

Figure: Typical architecture of LLM Inference and Serving
2.1. Application Layer
- The Application is the client (e.g., a chatbot, API consumer, or internal tool) that sends requests to the inference server.
- These requests are typically made over HTTP or gRPC for low-latency, high-performance communication.
- Example: A frontend UI sends a user’s prompt to the server for processing.
2.2 Inference Server
The Inference Server acts as the bridge between the application and the model.
It contains three main responsibilities as shown in the diagram:
2.2.1 Query Queue Scheduler
- Purpose: Manages incoming queries in a queue to avoid overwhelming the model.
- Function:
- Receives requests from the application.
- Places them in a queue.
- Uses a scheduler to decide when and how to process them.
- Batching Opportunity: The scheduler can group multiple requests together into a batch, improving GPU utilisation and throughput.
2.2.2. Metrics Module
- Purpose: Collects real-time statistics on system performance.
- Metrics Tracked:
- Throughput: Tokens generated per second.
- Latency: Total response time.
- TTFT (Time to First Token): Time from request to first token generated.
- Resource utilisation (GPU/CPU/memory).
- This data is essential for monitoring, scaling, and debugging.
2.3. Inference Engine
The Inference Engine is the core computation unit that runs the LLM.
2.3.1 Batching
- Groups multiple queued queries into a single execution batch on the GPU.
- Benefits:
- Reduces overhead from individual GPU calls.
- Improves parallel processing efficiency.
2.3.2 Model Execution
- Runs the LLM model itself.
- The model takes the batched input and generates output tokens.
- Can utilise optimisations such as:
- KV Caching for faster token generation.
- Quantisation for memory efficiency.
- Speculative Decoding for speed.
2.3.3 Query Response
- Gathers model outputs.
- Splits them back into individual responses for each original request.
- Sends results back to the application over HTTP/gRPC.
2.4 Hardware Layer
- GPU/CPU hardware actually runs the model computation.
- For LLMs, GPUs (often with large VRAM) are preferred for:
- Parallel processing of large tensor computations.
- Efficient handling of multi-batch workloads.
- CPUs can be used for smaller models or less latency-sensitive tasks.
2.5. Workflow Summary
- Application sends HTTP/gRPC request with prompt.
- Query Queue Scheduler stores and batches incoming requests.
- Batching Module groups requests for efficient GPU execution.
- Model generates predictions (tokens).
- Query Response sends formatted results back.
- Metrics Module continuously tracks performance.
- Results return to the application.
3. Evaluation of LLM Inference and Serving
When evaluating LLM inference (the process of generating outputs from a trained large language model) and LLM serving (the infrastructure and software that delivers model predictions to end users), the two primary performance metrics are:
- Throughput – Measures the total volume of output tokens generated per second.
- Example: An LLM serving system producing 2,000 tokens/sec can handle more concurrent requests or generate longer responses faster than one producing 500 tokens/sec.
- High throughput is critical for scenarios like batch inference, multi-user chatbots, or real-time content generation in high-traffic applications.
- Throughput depends on factors such as:
- Model size and architecture (e.g., LLaMA vs. GPT-style transformers).
- GPU/TPU hardware capabilities and memory bandwidth.
- Request batching efficiency.
- Quantisation and weight compression techniques.
- Latency – Measures how quickly a model responds to an individual request.
- Key metric: Time to First Token (TTFT) – the delay between receiving a prompt and starting to produce output.
- After TTFT, token generation latency is often measured as milliseconds per token (ms/token).
- Low latency is especially important for interactive applications like real-time chat, virtual assistants, and code autocompletion.
- Latency can be influenced by:
- Model loading time (cold start vs. warm start).
- Network communication overhead.
- Prompt length (longer prompts mean longer context processing time).
- Decoding strategy (e.g., greedy search, beam search, sampling).
3.1 Optimisation Strategies for High Throughput and Low Latency
LLM inference engines and serving frameworks focus heavily on memory utilisation and computational efficiency in production:
- Model Optimisation
- Quantisation – Reduce precision (e.g., FP16 → INT8 or 4-bit) to speed up inference with minimal accuracy loss.
- Pruning – Remove redundant weights to reduce model size.
- Speculative Decoding – Use a smaller model to “guess” future tokens and confirm with the main model.
- LoRA / PEFT – Use parameter-efficient fine-tuning to avoid reloading huge models.
- Serving Architecture
- Request Batching – Combine multiple user queries into a single forward pass for better GPU utilisation.
- Pipeline Parallelism – Split model layers across multiple GPUs.
- Tensor Parallelism – Split the tensor computations across devices.
- KV Cache Reuse – Store intermediate attention key-value pairs to avoid recomputation in autoregressive decoding.
- Infrastructure-Level Improvements
- Use GPUs with high VRAM and bandwidth (e.g., NVIDIA A100/H100).
- Place inference servers close to end users to reduce network latency (edge inference).
- Use asynchronous request handling and efficient scheduling.
3.2 Beyond Throughput and Latency – Additional Considerations
While throughput and latency form the backbone of LLM performance evaluation, real-world deployments often require additional criteria to ensure stability, scalability, and cost-effectiveness.
3.2.1 Scalability
- What it means: The ability of the inference infrastructure to handle sudden spikes in traffic without sacrificing speed or accuracy.
- Why it matters:
- LLM-based customer support systems may experience traffic surges during product launches.
- AI coding assistants can see unpredictable query bursts during hackathons or exams.
- How it’s achieved:
- Auto-scaling mechanisms in Kubernetes (HPA/VPA) or serverless GPU backends.
- Load balancing across multiple GPU/TPU nodes.
- Model sharding for extremely large models (e.g., Megatron-LM, DeepSpeed ZeRO-3).
3.2.2 Cost Efficiency
- What it means: Delivering optimal performance per dollar spent, especially important in pay-per-token or per-hour GPU rental models.
- Why it matters:
- Cloud GPU instances (A100, H100) are expensive; inefficient deployments can burn budgets fast.
- Inference for large models may cost more than fine-tuning if poorly optimized.
- Strategies:
- Use quantization (e.g., INT8, FP16) to reduce GPU memory usage and increase batch size.
- Employ dynamic batching to process multiple requests simultaneously.
- Choose spot/preemptible GPU instances for non-critical workloads.
3.2.3 Ease of Deployment
- What it means: How quickly and reliably the LLM stack can be set up and updated.
- Why it matters:
- Shorter deployment cycles reduce time-to-market.
- DevOps teams prefer infrastructure that integrates into existing CI/CD pipelines.
- Implementation best practices:
- Package inference servers using Docker.
- Deploy using Helm charts for Kubernetes clusters.
- Integrate with GitHub Actions / GitLab CI for automated rollouts.
3.2.4 Fault Tolerance & Reliability
- What it means: The ability of the system to keep running even if one or more nodes fail.
- Why it matters:
- LLM applications like healthcare assistants or financial chatbots can’t afford downtime.
- Techniques:
- Redundant model replicas with active-active failover.
- Checkpointing model states so recovery is quick.
- Health checks and graceful degradation (e.g., fall back to a smaller, local model if GPU fails).
3.2.5 Multi-Model Support
- What it means: Running different LLMs (or different versions of the same LLM) simultaneously.
- Why it matters:
- Some applications may require domain-specific models alongside general-purpose ones.
- Allows A/B testing for performance evaluation before production rollout.
- Examples:
- vLLM and Triton Inference Server can host multiple models and route requests accordingly.
3.2.6 Security & Compliance
- What it means: Protecting data in transit and ensuring compliance with legal and organizational standards.
- Why it matters:
- LLMs often process sensitive data (e.g., PII, financial records, medical notes).
- Non-compliance with regulations like GDPR, HIPAA, or SOC 2 can lead to heavy penalties.
- Security Measures:
- TLS encryption for all API calls.
- Role-based access control (RBAC) and API key authentication.
- Audit logs for every request to track usage.
4. Prominent Products
When selecting an LLM inference and serving framework, the decision often hinges on performance, scalability, hardware compatibility, and ease of integration with existing workflows. Below is an expanded look at some of the most prominent solutions in the space.
4.1 vLLM
- Origin & Background: Developed at the Sky Computing Lab, UC Berkeley, vLLM quickly gained popularity for its advanced scheduling and memory management optimizations.
- Key Strengths:
- PagedAttention: An innovative attention mechanism for faster inference and reduced memory footprint.
- Supports continuous batching, making it ideal for serving multiple requests with minimal latency spikes.
- Enterprise Adoption: Neural Magic’s nm-vllm (post Red Hat acquisition) adds enterprise-grade optimizations like quantization, CPU acceleration, and Kubernetes deployment tooling.
4.2 LightLLM
- Language & Approach: Pure Python implementation for simplicity and developer friendliness.
- Key Strengths:
- Extremely lightweight — minimal dependencies.
- Designed for fast setup and low-resource deployment scenarios.
- Ideal Use Cases: Edge devices, personal projects, or small-scale cloud deployments where quick prototyping is needed.
4.3 LMDeploy
- Focus: Deployment toolkit for compressing, quantizing, and serving large models.
- Key Strengths:
- Multi-backend support (ONNX, TensorRT, PyTorch).
- Integrated with MPT, LLaMA, and other popular architectures.
- Best Fit: Enterprises that want to reduce model size while keeping reasonable accuracy.
4.4 SGLang
- Scope: Targets both text and vision-language models.
- Key Strengths:
- Supports fine-tuned and multi-modal model serving.
- Optimized for GPU and distributed setups.
- Typical Users: Research teams and startups building chatbots with multi-modal capabilities.
4.5 OpenLLM
- Mission: Make running any open-source LLM as easy as running
openllm start
. - Key Strengths:
- OpenAI-compatible APIs, enabling drop-in replacement in existing code.
- CLI and Docker-friendly deployment.
- Example Models Supported: Llama 3.3, Qwen2.5, Phi-3, Mistral, and more.
4.6 Triton Inference Server with TensorRT-LLM
- Developed by: NVIDIA.
- Key Strengths:
- TensorRT-LLM optimizes transformer inference with CUDA Graphs, FP8 precision, and kernel fusion.
- Triton supports multi-framework serving (PyTorch, TensorFlow, ONNX) in one server.
- Best Fit: Enterprises with NVIDIA GPU clusters aiming for maximum throughput and lowest latency.
4.7 Ray Serve
- From: The Ray ecosystem.
- Key Strengths:
- Horizontal scaling for ML model APIs.
- Model composition — chain multiple models with Python code.
- Typical Scenario: Building a complex AI service combining embeddings, retrieval, and multiple LLM calls.
4.8 Hugging Face – Text Generation Inference (TGI)
- Purpose: Highly optimized serving backend for text generation models.
- Key Strengths:
- Supports FlashAttention, tensor parallelism, and streaming output.
- Native integration with Hugging Face Hub.
- Use Case: Deploying HF-hosted models in enterprise or local infrastructure.
4.9 DeepSpeed-MII
- From: Microsoft’s DeepSpeed team.
- Key Strengths:
- Specializes in low-latency, high-throughput serving with quantization and sparsity support.
- Leverages DeepSpeed inference optimizations like kernel fusion and ZeRO-offload.
- Ideal For: Ultra-large models that need aggressive GPU memory optimizations.
4.10 CTranslate2
- Focus: Transformer inference for translation and seq2seq tasks.
- Key Strengths:
- CPU and GPU support, quantization for small footprint.
- Exceptional speed for encoder-decoder models.
- Typical Users: Machine translation systems, document summarization.
4.11 BentoML
- Scope: Unified inference and deployment framework for any model type.
- Key Strengths:
- Abstracts away serving details; works with ML frameworks like PyTorch, TensorFlow, scikit-learn.
- Easy API packaging and containerization.
- Great For: ML engineers who want one platform for models across use cases.
4.12 MLC LLM
- Mission: Compile and deploy LLMs anywhere — from cloud GPUs to mobile devices.
- Key Strengths:
- Uses ML compilation techniques (TVM) to optimize for target hardware.
- Browser and mobile-friendly deployment options.
- Notable Edge: Run LLMs in WebAssembly or Metal (Apple) without server dependencies.
4.13 Others
- Ollama: Local LLM runner with easy model downloading and CLI interaction.
- WebLLM: Runs LLM inference entirely in the browser using WebGPU.
- LM Studio: Desktop app to run and test local models.
- GPT4ALL: Open-source, offline-capable chatbot environment.
- llama.cpp: Lightweight C++ implementation for running LLaMA-family models on CPUs.
5. LLM Inference & Serving Frameworks – Comparison Table
Framework / Tool | Language & Ecosystem | Deployment Model | Key Strengths | Optimisations | Best For |
---|---|---|---|---|---|
vLLM | Python (PyTorch), UC Berkeley origin | Local & Cloud | High throughput, easy API | PagedAttention, efficient KV cache | General-purpose high-speed serving |
LightLLM | Python | Local, Cloud | Lightweight, minimal dependencies | Async IO, KV cache, batching | Resource-constrained environments |
LMDeploy | Python + C++ | Local, Edge, Cloud | Model compression + serving toolkit | Quantisation, distillation | Deploying smaller/faster LLMs |
SGLang | Python | Local, Cloud | Fast for LLM & VLM | Optimised batching, KV cache | Multimodal LLM serving |
OpenLLM | Python (BentoML) | Local, Cloud | OpenAI-compatible APIs for any model | API standardisation | Developers needing API parity with OpenAI |
Triton + TensorRT-LLM | C++, Python (NVIDIA) | On-prem & Cloud | Enterprise-grade GPU serving | TensorRT optimisations, FP8/INT8 quant | GPU-heavy workloads |
Ray Serve | Python | Cloud, Kubernetes | Multi-model composition | Scaling, autoscheduling | Complex inference pipelines |
Hugging Face TGI | Python | Local, Cloud | Optimised for text generation | Speculative decoding, quantisation | Hugging Face model ecosystem |
DeepSpeed-MII | Python | Cloud & On-Prem | High throughput, low latency | Tensor parallelism, quantisation | Large model production serving |
CTranslate2 | C++, Python | Local | Fast Transformer inference | Quantisation, CPU optimised | CPU-based serving |
BentoML | Python | Local, Cloud | Unified inference for any model | API packaging, scaling | Teams serving mixed model types |
MLC LLM | Python, Rust | Local, Mobile, Web | Runs anywhere (cross-platform) | ML compilation | Deploying LLMs to edge/mobile |
Ollama | CLI | Local (Mac/Linux) | Simple local LLM serving | Pre-packaged models | Non-technical local usage |
WebLLM | JavaScript | Browser | No server needed | WebGPU execution | Running LLMs fully in-browser |
LM Studio | GUI App | Local | Easy local model download & run | Built-in chat interface | Offline local use |
GPT4ALL | Python, GUI | Local | Wide model support | CPU optimised | Privacy-focused offline inference |
llama.cpp | C++ | Local | Lightweight, portable | Quantisation to 4-bit | Running LLMs on low-spec hardware |
6. Important Terminologies
6.1 KV Cache (Key-Value Cache)
Key-value caching is a technique used in transformer models where the key and value matrices computed in earlier decoding steps are stored for reuse during subsequent token generation.
- Benefits: Reduces redundant computation, leading to faster inference times.
- Trade-offs: Increased memory consumption, especially for long context windows.
- Optimizations:
- Cache Invalidation – removing unused portions of the cache when switching contexts.
- Cache Reuse – sharing parts of the cache between similar prompts or multi-turn conversations.
- Quantized Cache – storing KV cache in lower precision (e.g., FP16, INT8) to save memory.
6.2 PagedAttention
A memory management strategy for KV cache inspired by virtual memory paging in operating systems. Instead of storing keys and values contiguously, they are stored in fixed-size memory pages, allowing flexible allocation and reuse of GPU memory.
- Advantages:
- Efficient use of GPU VRAM.
- Avoids large contiguous memory allocations.
- Implementations: Used in libraries like vLLM to enable very large batch sizes.
6.3 Batching
Combining multiple inference requests into a single forward pass to improve GPU utilization and throughput.
- Types:
- Static Batching – fixed batch size; efficient but may introduce latency.
- Dynamic Batching – requests are grouped on the fly based on arrival time and sequence length.
- Key Libraries: Hugging Face TGI, vLLM, TensorRT-LLM, Ray Serve.
6.4 Support for Quantisation
Reducing model precision to decrease memory footprint and increase inference speed.
- Common Precisions: FP32 → FP16 → BF16 → INT8 → INT4.
- Benefits: Lower memory bandwidth usage, higher cache efficiency.
- Popular Methods:
- GPTQ – post-training quantization.
- AWQ – activation-aware weight quantization.
- QLoRA – quantized low-rank adapters.
6.5 LoRA (Low-Rank Adaptation)
A fine-tuning technique that freezes pre-trained weights and injects small trainable rank decomposition matrices into transformer layers. This drastically reduces the number of trainable parameters for downstream tasks, making fine-tuning cost-efficient.
6.6 Tool Calling / Function Calling
Allows an LLM to invoke external APIs or tools when it needs information it was not trained on, or to take actions in the real world.
- Example: An LLM calling a weather API when asked “What’s the weather in Mumbai right now?”.
6.7 Reasoning Models
Models optimized for multi-step problem solving with intermediate reasoning chains.
- Examples: DeepSeek-R1, OpenAI o1-preview.
- Techniques: Chain-of-Thought (CoT), Tree-of-Thought (ToT), Graph-of-Thought (GoT).
6.8 Structured Outputs
Ensuring the model generates responses in a strict format.
- Outlines – hierarchical text planning before full generation.
- LMFE (Language Model Format Enforcer) – enforces output to match JSON Schema, regex, or XML.
- xgrammar – flexible grammar-based generation.
6.9 Automatic Prefix Caching (APC)
Reuses cached prefix computations for similar queries, reducing token processing time for repeated or partially overlapping prompts.
6.10 Speculative Decoding
A technique where a smaller, faster “draft” model generates candidate tokens, and the larger main model only verifies and finalizes them—reducing latency significantly.
6.11 Chunked Prefill
Splitting long input sequences into manageable chunks for faster prefill operations without overwhelming GPU memory.
6.12 Prompt Adapter
A lightweight fine-tuning approach where small adapter layers are trained to inject task-specific knowledge into a base LLM without retraining the entire model.
6.13 Beam Search
A decoding strategy that keeps track of multiple candidate sequences at each generation step, selecting the most probable one at the end.
6.14 Guided Decoding
Constrained generation to follow specific patterns, constraints, or external logic. Useful for generating SQL queries, code, or structured data.
6.15 AsyncLM
Enables asynchronous processing, allowing the LLM to generate and execute multiple function calls or tasks concurrently—reducing idle GPU time.
6.16 Prompt Logprobs
Logarithmic probability values for each generated token, useful for evaluating model confidence and detecting hallucinations.
6.17 KServe
A standardized, serverless ML inference platform for Kubernetes. Supports scaling, canary deployments, and integrates with GPU/TPU backends.
6.18 KubeAI
An AI inference operator for Kubernetes that simplifies deployment of LLMs, VLMs, embedding models, and speech-to-text pipelines.
6.19 Llama Stack
A composable set of tools, APIs, and services designed for building applications with Meta’s LLaMA models.
6.20 Additional Serving & Inference Terms
- Continuous Batching – Dynamically merging new requests into ongoing batches for maximum throughput (used by vLLM).
- Request Scheduling – Prioritizing inference requests to meet SLAs for latency-sensitive workloads.
- Token Parallelism – Splitting token generation across multiple GPUs to improve throughput.
- Pipeline Parallelism – Splitting the model layers across multiple GPUs.
- Tensor Parallelism – Splitting individual tensors across GPUs for large model inference.
- MoE (Mixture of Experts) – Activating only a subset of model parameters per token to reduce compute cost.
- FlashAttention – An optimized attention algorithm that reduces memory usage and speeds up computation.
- vLLM – A high-performance inference engine with PagedAttention and continuous batching for serving large language models efficiently.
- TensorRT-LLM – NVIDIA’s optimized LLM serving library with quantization, fused kernels, and multi-GPU support.
- Serving Gateway – A request router/load balancer for distributing LLM inference requests across multiple workers.
7. References
- Best LLM Inference Engines and Servers to Deploy LLMs in Production – Overview of popular inference backends.
- Efficient Memory Management for Large Language Model Serving with PagedAttention – Core memory optimization paper behind vLLM.
- LoRA: Low-Rank Adaptation of Large Language Models – Efficient fine-tuning approach for LLMs.
- Fast Inference from Transformers via Speculative Decoding – Reducing token generation latency.
- Looking Back at Speculative Decoding – Retrospective analysis of speculative decoding trade-offs.
- Efficient Generative Large Language Model Serving – Practical techniques for faster inference.
- Ten Ways to Serve Large Language Models: A Comprehensive Guide – High-level serving strategies.
- The 6 Best LLM Tools To Run Models Locally – Lightweight deployment options.
- Benchmarking LLM Inference Backends – Performance metrics comparison.
- Transformers Key-Value Caching Explained – How KV caching accelerates LLMs.
- LLM Inference Series: 3. KV Caching Explained – Deep dive on caching internals.
- vLLM and PagedAttention: A Comprehensive Overview – End-to-end guide to vLLM.
- Understanding Reasoning LLMs – Reasoning capabilities in inference.
8. Further Reads
14. DeepSpeed-MII: High-Throughput and Low-Latency Inference for Transformers – Microsoft’s optimized inference stack.
15. Ray Serve for Distributed LLM Inference – Scaling LLM inference across nodes.
16. Serving Multiple Models Efficiently with NVIDIA Triton – Multi-model and GPU scheduling strategies.
17. FlashAttention: Fast and Memory-Efficient Attention – Key innovation for attention speedups.
18. SGLang: Structured Generation for Large Language Models – Efficient structured text generation.
19. Quantization-Aware LLM Serving with GPTQ – Speed + memory optimizations through quantization.
20. MII vs vLLM vs HuggingFace Transformers Benchmarks – Comparative analysis of popular inference engines.
21. Accelerating LLM Inference with TensorRT-LLM – NVIDIA’s low-level optimization library.
22. Dynamic Batching for LLM Inference – Improving throughput without hurting latency.
23. LLMOps: Operational Challenges and Best Practices – Managing LLM inference in production environments.
24. Speculative Beam Search for Faster LLM Inference – Combining speculative decoding with beam search.
25. Serving LLMs with Kubernetes and KServe – Cloud-native deployment approaches.