Large Language Models (LLMs) Inference and Serving

You might have encountered some performance issues while executing LLM (Large Language Model) models for the production environment. You may consider using an inference engine or server that handles various such issues off the shelf for you. Following is an overview of the architecture for LLMs’ inference and serving.

Several LLM inference engines and servers are available to deploy and serve LLMs in production. The following are a few and most prominent among them:

  1. vLLM
  2. LightLLM
  3. LMDeploy
  4. SGLang
  5. OpenLLM
  6. Triton Inference Server with TensorRT-LLM
  7. Ray Serve
  8. Hugging Face – Text Generation Inference (TGI)
  9. DeepSpeed-MII
  10. CTranslate2
  11. BentoML
  12. MLC LLM

Throughput vs. Latency

Throughput and latency are two important metrics for evaluating LLM inference and serving. Throughput refers to the number of output tokens an LLM can generate per second. Latency refers to the time it takes for a large language model (LLM) to process an input and generate a response. For latency, the important metric is “Time to First Token” (TTFT) which refers to the amount of time it takes for a language model to generate the first token of its response after receiving a prompt.

LLM inference engines and servers are intended to optimise LLM memory utilisation and performance in the production environment. They assist you with achieving high throughput and low latency, guaranteeing that your LLMs can handle a huge number of requests while responding rapidly. Based on your specific use case, you may also have additional factors that would influence your decision to select a particular engine/server.

Inferene Engine vs. Inference Server

Inference engines run the models and are in charge of everything needed for the generating process. Inference Servers handle incoming and outgoing HTTP and gRPC queries from end users for your application, as well as metrics for measuring the deployment performance of your LLM.

Techniques/Terminologies used across these Frameworks

  1. KV cache
  2. PagedAttention
  3. Batching
  4. Support for quantisation
  5. LoRA: Low-Rank Adaptation: It is a technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
  6. Tool calling: Tool calling, also known as function calling, is a technique that enables Large Language Models (LLMs) to request information from external tools. This enables LLMs to obtain information for which they were not trained or perform actions that are beyond their own capacities.
  7. Reasoning models: e.g. DeepSeek R1
  8. Structured Outputs:
    • outlines: Structured Text Generation
    • LMFE (Language Model Format Enforcer): Enforce the output format (JSON Schema, Regex etc) of a language model.
    • xgrammar: Efficient, Flexible and Portable Structured Generation
  9. Automatic Prefix Caching (APC): APC is a technique that speeds up Large Language Models (LLMs) by reusing cached results for similar queries.
  10. Speculative Decoding [4][5]
  11. Chunked Prefill
  12. Prompt Adapter
  13. Beam Search
  14. Guided decoding
  15. AsyncLM: It improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently.
  16. Prompt logprobs (logarithm of probability)
  17. kserve: Standardized Serverless ML Inference Platform on Kubernetes
  18. kubeai: AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
  19. llama-stack: Composable building blocks to build Llama Apps

Many of the terms and techniques have been left above without any elaboration. These are pointers for exploring in detail.

References

  1. Best LLM Inference Engines and Servers to deploy LLMs in Production
  2. Efficient Memory Management for Large Language Model Serving with PagedAttention
  3. LoRA: Low-Rank Adaptation of Large Language Models
  4. Fast Inference from Transformers via Speculative Decoding
  5. Looking back at speculative decoding

Leave a Comment

Your email address will not be published. Required fields are marked *