If you’ve built conversational AI applications with ChatGPT, Claude, or other large language models, you’ve likely encountered a fundamental challenge: how do you maintain consistent, reliable conversations across multiple turns?
The answer lies in ChatML (Chat Markup Language) — a lightweight, structured format that transforms the art of prompting into an engineering discipline.
The Problem ChatML Solves
Early LLM implementations suffered from “prompt fragility” — minor wording changes would break expected behavior. Consider this problematic approach:
# ❌ Fragile approachprompt ="You are a helpful assistant. User: What's the capital of France? Assistant:"
Issues with this approach:
No clear role separation
Ambiguous message boundaries
Difficult to maintain multi-turn conversations
Hard to debug when things go wrong
The ChatML Solution
<|im_start|>systemYou are a helpful assistant.<|im_end|><|im_start|>userWhat's the capital of France?<|im_end|><|im_start|>assistant
ChatML provides:
✅ Clear role separation: System, user, and assistant roles are explicit ✅ Defined boundaries: Special tokens mark where messages begin and end ✅ Conversation continuity: Easy to maintain context across turns ✅ Debugging clarity: Immediately see structure issues
2. Understanding ChatML Fundamentals
What is ChatML?
ChatML is a plain-text markup format designed to give large language models a structured way to understand conversation history. It’s similar to markdown or XML but optimized specifically for LLM conversations.
Key Characteristics:
Lightweight: Minimal overhead, easy to parse
Human-readable: Developers can read and debug it directly
Model-agnostic: Core concepts work across different LLMs
Extensible: Can add new roles or metadata as needed
Why Structure Matters
LLMs are trained on vast amounts of unstructured text, but they perform better with clear structural cues. ChatML provides these cues through:
Role tokens — Identify who’s speaking
Boundary markers — Separate distinct messages
Metadata support — Add context like timestamps or user IDs
Nesting capability — Support complex conversations
The Evolution of Prompt Engineering
Before ChatML:
# Unstructured, fragile promptsprompt ="""System: You are helpful.User: HelloAI: Hi there!User: What's the weather?"""
With ChatML:
<|im_start|>systemYou are helpful.<|im_end|><|im_start|>userHello<|im_end|><|im_start|>assistantHi there!<|im_end|><|im_start|>userWhat's the weather?<|im_end|><|im_start|>assistant
The structured format eliminates ambiguity and provides clear parsing rules for both humans and models.
3. The Anatomy of ChatML
Core Components
1. Special Tokens
<|im_start|> — Marks the beginning of a message<|im_end|> — Marks the end of a message
These tokens are specifically chosen to:
Rarely appear in natural text
Be easily tokenized by LLM tokenizers
Provide clear visual boundaries
2. Role Identifiers
ChatML supports four primary roles:
Role
Purpose
Example Use Case
system
Sets behavior, constraints, personality
“You are a Python expert who explains code clearly”
user
Represents end-user input
“How do I sort a list in Python?”
assistant
Represents AI’s response
“To sort a list in Python, use the sorted() function…”
tool
Represents external tool outputs
{"status": "success", "data": [...]}
3. Message Structure
Complete message format:
<|im_start|>{role}{content}<|im_end|>
Complete Example:
<|im_start|>systemYou are a helpful, concise AI assistant specializing in Python programming.<|im_end|><|im_start|>userWrite a function to calculate factorial.<|im_end|><|im_start|>assistantHere's a Python function to calculate factorial:```pythondeffactorial(n):if n ==0or n ==1:return1return n * factorial(n -1)```This uses recursion to calculate the factorial efficiently.<|im_end|>
Token Efficiency
ChatML is designed to be token-efficient:
Start token: <|im_start|> = 1 token
End token: <|im_end|> = 1 token
Role identifier: system/user/assistant = 1 token each
Total overhead per message: ~3-4 tokens (negligible compared to content)
4. Roles and Message Boundaries in Depth
System Role: The Foundation
The system role is your primary control mechanism for AI behavior.
Best Practices for System Prompts
✅ Effective system prompts:
<|im_start|>systemYou are a senior Python developer with10 years of experience.Guidelines:- Always include error handling- Prioritize code readability- Add docstrings to functions- Suggest performance optimizations when relevantTone: Professional but friendlyFormat: Provide code examples with explanations<|im_end|>
❌ Vague system prompts:
<|im_start|>systemYou are helpful.<|im_end|>
System Prompt Structure
A well-structured system prompt includes:
Identity/Role: Who is the AI?
Expertise: What domain knowledge does it have?
Guidelines: How should it behave?
Constraints: What should it avoid?
Tone: How should it communicate?
Format: How should responses be structured?
Example with all components:
<|im_start|>system# IdentityYou are an expert technical documentation writer forAI/ML projects.# Expertise-15+ years documenting complex software systems- Specialization in Python, machine learning, andAPI documentation- Deep understanding of developer workflows# Guidelines1. Use clear, concise language2. Include practical code examples3. Explain "why"not just "how"4. Anticipate edge cases5. Provide troubleshooting tips# Constraints- Never invent functionality that doesn't exist- Always cite sources when referencing external documentation- Avoid jargon without explanation- Don't assume prior knowledge# ToneProfessional, encouraging, and patient. Treat readers as intelligent but learning.# Format- Start with a brief overview- Use code blocks forall examples- Include both simple and advanced examples- End with common pitfalls or tips<|im_end|>
User Role: Context and Intent
The user role should contain:
The actual user query
Relevant context
Constraints or preferences
Expected output format
Minimal user message:
<|im_start|>userSort this list: [3, 1, 4, 1, 5, 9]<|im_end|>
Rich user message with context:
<|im_start|>userI'm building a REST API with Flask for an e-commerce platform. I need to implement rate limiting for my endpoints to prevent abuse.Requirements:- Limit: 100 requests per hour per user- Use Redis for storage (already set up)- Return proper HTTP status codes (429for rate limit exceeded)- Include Retry-After header in responses- Support both authenticated users (by user_id) and anonymous users (by IP)Current tech stack:- Python 3.11- Flask 3.0- Redis 7.0- Running on AWSEC2Please provide a production-ready implementation with proper error handling.<|im_end|>
User Message Best Practices
Be specific: Clear requirements lead to better responses
Provide context: Mention tech stack, constraints, environment
State expected output: Code? Explanation? Both?
Include examples: Show input/output if applicable
Mention edge cases: What scenarios should be handled?
Assistant Role: Structured Responses
The assistant role can include:
Direct answers
Code examples
Explanations
Follow-up questions
Reasoning steps
Well-structured assistant response:
<|im_start|>assistantI'll help you implement rate limiting with Flask and Redis. Here's a production-ready solution:## 1. Install Dependencies```bashpip install Flask redis flask-limiter```## 2. Implementation```pythonfrom flask import Flask, jsonify, requestfrom flask_limiter import Limiterfrom flask_limiter.util import get_remote_addressimport redisapp = Flask(__name__)# Configure Redis connectionredis_client = redis.Redis(host='localhost',port=6379,db=0,decode_responses=True)# Initialize rate limiterlimiter = Limiter(app=app,key_func=get_remote_address,storage_uri="redis://localhost:6379",strategy="fixed-window")# Custom key function for authenticated usersdefget_user_identifier():"""Get user ID if authenticated, otherwise use IP address.""" user_id = request.headers.get('X-User-ID')if user_id:returnf"user:{user_id}"returnf"ip:{get_remote_address()}"# Apply rate limiting to routes@app.route('/api/products')@limiter.limit("100 per hour", key_func=get_user_identifier)defget_products():return jsonify({"products": [...], "status": "success"})@app.route('/api/orders', methods=['POST'])@limiter.limit("100 per hour", key_func=get_user_identifier)defcreate_order():return jsonify({"order_id": "12345", "status": "created"})# Custom error handler@app.errorhandler(429)defratelimit_handler(e):return jsonify({"error": "Rate limit exceeded","message": "Too many requests. Please try again later.","retry_after": e.description }), 429if__name__=='__main__': app.run(debug=True)```## 3. Key Features✅ **Dual identification**: Uses user IDfor authenticated users, IPfor anonymous✅ **Automatic retry-after**: Flask-Limiter adds this header automatically✅ **Redis persistence**: Survives application restarts✅ **Custom error handling**: Professional error messages## 4. Testing```pythonimport requests# Test rate limitingfor i inrange(105): response = requests.get('http://localhost:5000/api/products')print(f"Request {i+1}: {response.status_code}")if response.status_code ==429:print(f"Rate limited! Retry after: {response.headers.get('Retry-After')}")break```## 5. Production Considerations⚠️ **Important notes**:- Adjust limits based on your API capacity- Monitor Redis memory usage- Consider implementing token bucket algorithm for smoother rate limiting- Add logging for rate limit violations- Implement different limits for different endpoints- Consider geographic distribution if using multiple Redis instancesWould you like me to show you how to implement more advanced features like:- Burst allowance- Different limits for premium vs. free users- Rate limit analytics dashboard<|im_end|>
Assistant Response Structure
A quality assistant response includes:
Acknowledgment: Confirm understanding
Structure: Use headings for clarity
Code examples: Complete, runnable code
Explanations: Why, not just how
Testing: How to verify it works
Warnings: Edge cases and gotchas
Next steps: Optional follow-ups
Tool Role: External Integrations
The tool role represents outputs from external systems:
<|im_start|>tool{"tool_name": "web_search","query": "ChatML documentation OpenAI","results": [ {"title": "ChatML Format Specification - OpenAI","url": "https://platform.openai.com/docs/guides/chatml","snippet": "ChatML is a structured format for representing conversations..." }, {"title": "Understanding ChatML - Developer Guide","url": "https://example.com/chatml-guide","snippet": "Learn how to use ChatML for building production chatbots..." } ],"timestamp": "2025-12-17T10:30:00Z"}<|im_end|>
Tool Integration Pattern
<|im_start|>userWhat's the current weather in San Francisco?<|im_end|><|im_start|>assistantI'll check the current weather in San Francisco for you.<|im_end|><|im_start|>tool{"tool_name": "weather_api","location": "San Francisco, CA","data": {"temperature": 62,"condition": "Partly Cloudy","humidity": 75,"wind_speed": 12 }}<|im_end|><|im_start|>assistantThe current weather in San Francisco is:- Temperature: 62°F- Condition: Partly Cloudy- Humidity: 75%- Wind Speed: 12 mphIt's a pleasant day with comfortable temperatures!<|im_end|>
5. Implementing ChatML in Python
Basic Implementation
classChatMLFormatter:"""Production-ready ChatML formatter with validation."""VALID_ROLES= {'system', 'user', 'assistant', 'tool'}START_TOKEN='<|im_start|>'END_TOKEN='<|im_end|>'def__init__(self):self.messages = []defadd_message(self, role: str, content: str) -> 'ChatMLFormatter':"""Add a message with validation."""if role notinself.VALID_ROLES:raiseValueError(f"Invalid role: {role}. Must be one of {self.VALID_ROLES}")ifnot content ornot content.strip():raiseValueError("Message content cannot be empty")self.messages.append({'role': role,'content': content.strip() })returnself# Enable chainingdefto_chatml(self, include_assistant_start: bool=True) -> str:"""Convert messages to ChatML format.""" chatml = []for msg inself.messages: chatml.append(f"{self.START_TOKEN}{msg['role']}") chatml.append(msg['content']) chatml.append(self.END_TOKEN)# Add assistant start token for model completionif include_assistant_start: chatml.append(f"{self.START_TOKEN}assistant")return'\n'.join(chatml)deffrom_chatml(self, chatml_string: str) -> 'ChatMLFormatter':"""Parse ChatML string back to messages."""import re pattern =rf"{re.escape(self.START_TOKEN)}(\w+)\n(.*?){re.escape(self.END_TOKEN)}" matches = re.findall(pattern, chatml_string, re.DOTALL)self.messages = []for role, content in matches:if role inself.VALID_ROLES:self.messages.append({'role': role,'content': content.strip() })returnselfdefto_dict(self) -> list:"""Convert to OpenAI API format."""return [{'role': msg['role'], 'content': msg['content']} for msg inself.messages]def__len__(self) -> int:returnlen(self.messages)def__repr__(self) -> str:returnf"ChatMLFormatter({len(self)} messages)"# Usage exampleformatter = ChatMLFormatter()formatter.add_message('system', 'You are a helpful AI assistant.') \ .add_message('user', 'What is ChatML?') \ .add_message('assistant', 'ChatML is a structured format for LLM conversations.')# Generate ChatMLchatml_output = formatter.to_chatml()print(chatml_output)# Convert to OpenAI formatopenai_format = formatter.to_dict()print(openai_format)
import timefrom collections import dequefrom threading import LockclassRateLimiter:"""Token bucket rate limiter for ChatML requests."""def__init__(self, requests_per_minute: int=60):self.rpm = requests_per_minuteself.requests = deque()self.lock = Lock()defacquire(self) -> bool:"""Acquire permission to make a request."""withself.lock: now = time.time()# Remove requests older than 1 minutewhileself.requests andself.requests[0] < now -60:self.requests.popleft()# Check if we can make requestiflen(self.requests) <self.rpm:self.requests.append(now)returnTruereturnFalsedefwait_if_needed(self):"""Block until request can be made."""whilenotself.acquire(): time.sleep(0.1)# Usagelimiter = RateLimiter(requests_per_minute=60)for i inrange(100): limiter.wait_if_needed()# Make API callprint(f"Request {i+1}")
defverify_tokenization(text: str, model: str="gpt-4") -> None:"""Verify ChatML tokens are properly recognized."""import tiktoken encoding = tiktoken.encoding_for_model(model) tokens = encoding.encode(text)# Check if special tokens are single tokens im_start_tokens = encoding.encode('<|im_start|>') im_end_tokens = encoding.encode('<|im_end|>')print(f"<|im_start|> tokens: {len(im_start_tokens)}")print(f"<|im_end|> tokens: {len(im_end_tokens)}")iflen(im_start_tokens) !=1orlen(im_end_tokens) !=1:print("⚠️ Warning: Special tokens not recognized as single tokens")print("Solution: Ensure you're using a ChatML-compatible model")verify_tokenization('<|im_start|>system\nHello<|im_end|>')
Issue 2: Conversation Context Loss
Problem: Model “forgets” earlier parts of conversation
Solution:
classContextPreserver:"""Preserve important context across long conversations."""def__init__(self, max_context_messages: int=10):self.max_context = max_context_messagesself.important_indices =set()defmark_important(self, index: int):"""Mark a message as important (always keep)."""self.important_indices.add(index)defcompress_messages(self, messages: list) -> list:"""Compress messages while preserving important ones."""iflen(messages) <=self.max_context:return messages# Always keep system message result = [messages[0]] if messages[0]['role'] =='system'else []# Keep important messagesfor idx insorted(self.important_indices):if idx <len(messages): result.append(messages[idx])# Fill remaining slots with recent messages recent_count =self.max_context -len(result) result.extend(messages[-recent_count:])return result# Usagepreserver = ContextPreserver(max_context_messages=10)preserver.mark_important(2) # Keep message at index 2compressed = preserver.compress_messages(long_conversation)
Issue 3: Malformed ChatML
Problem: Generated ChatML is syntactically incorrect
Solution:
defvalidate_chatml_syntax(chatml_string: str) -> Tuple[bool, List[str]]:"""Validate ChatML syntax.""" errors = []# Check matching start/end tokens start_count = chatml_string.count('<|im_start|>') end_count = chatml_string.count('<|im_end|>')if start_count != end_count: errors.append(f"Mismatched tokens: {start_count} starts, {end_count} ends")# Check role validityimport re roles = re.findall(r'<\|im_start\|>(\w+)', chatml_string) valid_roles = {'system', 'user', 'assistant', 'tool'}for role in roles:if role notin valid_roles: errors.append(f"Invalid role: {role}")# Check empty messages messages = re.findall(r'<\|im_start\|>\w+\n(.*?)<\|im_end\|>', chatml_string, re.DOTALL )for i, msg inenumerate(messages):ifnot msg.strip(): errors.append(f"Empty message at position {i}")returnlen(errors) ==0, errors# Usagechatml ="<|im_start|>system\nHello<|im_end|>"valid, errors = validate_chatml_syntax(chatml)ifnot valid:print("Validation errors:")for error in errors:print(f" - {error}")
Issue 4: Performance Bottlenecks
Problem: Slow response times in production
Solutions:
import functoolsimport time# 1. Caching@functools.lru_cache(maxsize=128)defcached_format(messages_tuple: tuple) -> str:"""Cache formatted ChatML strings.""" messages =list(messages_tuple) formatter = ChatMLFormatter()for msg in messages: formatter.add_message(msg['role'], msg['content'])return formatter.to_chatml()# 2. Connection Poolingfrom openai import OpenAIclassConnectionPool:"""Manage OpenAI client connections."""def__init__(self, api_key: str, pool_size: int=5):self.clients = [OpenAI(api_key=api_key) for _ inrange(pool_size)]self.current =0defget_client(self) -> OpenAI:"""Get next available client (round-robin).""" client =self.clients[self.current]self.current = (self.current +1) %len(self.clients)return client# 3. Batch ProcessingclassBatchProcessor:"""Process multiple ChatML requests efficiently."""def__init__(self, batch_size: int=10):self.batch_size = batch_sizeself.queue = []asyncdefadd_request(self, messages: list):"""Add request to batch queue."""self.queue.append(messages)iflen(self.queue) >=self.batch_size:awaitself.process_batch()asyncdefprocess_batch(self):"""Process accumulated requests."""# Process all queued requests results = []for messages inself.queue:# Make API call result =awaitself.call_api(messages) results.append(result)self.queue.clear()return results
10. Future of Structured Prompting
Emerging Trends
1. Extended Role Types
<!-- Critic role forself-evaluation --><|im_start|>criticLet me evaluate the previous response:- Code quality: 8/10- Completeness: 9/10- Error handling: 7/10Suggestions: Add input validation<|im_end|><!-- Planner role for multi-step reasoning --><|im_start|>plannerTask breakdown:1. Parse user requirements2. Research available APIs3. Design architecture4. Implement solution5. Test and validate<|im_end|><!-- Observer role for monitoring --><|im_start|>observerMonitoring conversation health:- Token usage: 1,234/8,192 (15%)- Turn count: 5- Average response time: 2.3s- User satisfaction: High (inferred)<|im_end|>
2. Metadata Enrichment
<|im_start|>usercontent: "What's the weather in New York?"metadata: timestamp: "2025-12-17T10:30:00Z" location: "New York, NY" user_id: "user_123" device: "mobile" session_id: "sess_abc" intent: "weather_query" priority: "normal"<|im_end|>
3. Nested Conversations
<|im_start|>assistantI'll break this complex task into subtasks:<|im_start|>plannerPrimary task: Build RESTAPISubtasks:1. Design database schema2. Implement authentication3. Create CRUD endpoints<|im_end|>Let me start with subtask 1...<|im_start|>assistantFor the database schema, I recommend:[detailed response]<|im_end|><|im_end|>
4. Structured Outputs
<|im_start|>assistant{"response_type": "structured","sections": [ {"heading": "Solution Overview","content": "Here's how to implement rate limiting..." }, {"heading": "Code Implementation","content": "[code block]","language": "python" }, {"heading": "Testing Strategy","content": "Run these tests..." } ],"confidence": 0.95,"sources_cited": 3}<|im_end|>
Industry Standardization Efforts
Current initiatives:
Cross-provider working groups
Open specifications (ChatML RFC proposals)
Interoperability testing frameworks
Unified metadata schemas
Expected timeline:
2025: Broader adoption of ChatML-inspired formats
2026: First cross-provider standards
2027: Industry-wide standardization
11. Frequently Asked Questions
Q1: Is ChatML only for OpenAI models?
A: No. While ChatML originated with OpenAI, the core concepts (role-based messaging, clear boundaries) are now used or adapted by many LLMs including:
Qwen (full support)
Claude (adapted format)
Mistral (partial support)
Various open-source models
The structured approach has proven so effective that it’s becoming a de facto standard.
Q2: Can I use ChatML with local models?
A: Yes, many fine-tuned open-source models support ChatML or similar formats:
Vicuna
WizardLM
Alpaca
Many LLaMA 2/3 fine-tunes
Check the model card on Hugging Face for specific format requirements.
Q3: What’s the performance overhead of ChatML?
A: Minimal. ChatML tokens typically add <50 tokens per conversation:
Start token: 1 token
End token: 1 token
Role identifier: 1 token
Example: A 5-turn conversation adds ~30 tokens total (negligible compared to message content which may be 1000+ tokens).
Q4: How do I handle multi-language conversations?
A: ChatML works with any language. The structure remains the same:
Or use base64 encoding for small files (check model’s file handling capabilities).
Q10: What’s the future of ChatML?
Near-term (2025-2026):
Broader adoption across LLM providers
Extended role types (critic, planner, observer)
Richer metadata support
Better tooling and validation libraries
Long-term (2027+):
Industry standardization efforts
Cross-provider interoperability
Advanced nesting and structured outputs
Integration with agent frameworks
12. Conclusion: Building Better AI with ChatML
ChatML transforms conversational AI from an art into an engineering discipline. By providing clear structure, role separation, cross-model compatibility, and debugging clarity, you’re equipped to build reliable, maintainable AI systems.
Key Takeaways
✅ Clear structure — Eliminate prompt ambiguity with defined roles and boundaries ✅ Role separation — System, user, assistant, and tool roles provide semantic clarity ✅ Cross-model compatibility — Build once, adapt easily for different LLMs ✅ Debugging clarity — Spot structural issues immediately with validation tools ✅ Production readiness — Scale with confidence using best practices ✅ Future-proof — Industry moving toward standardization around these concepts
Implementation Checklist
Week 1: Foundation
Implement basic ChatMLFormatter class
Add input validation
Create simple test cases
Test with your target LLM
Week 2: Enhancement
Add context window management
Implement error handling with retries
Create conversation templates
Add logging and monitoring
Week 3: Production
Deploy with rate limiting
Set up monitoring dashboards
Document your implementation
Train team on ChatML concepts
Ongoing
Monitor performance metrics
Iterate based on user feedback
Stay updated on ChatML developments
Contribute to open-source tools
Next Steps
Start small: Implement the basic formatter and test with simple conversations
Validate thoroughly: Use the validation tools before deploying
Test across models: Ensure compatibility with your target LLMs
Monitor in production: Track token usage, errors, and performance
Iterate continuously: Improve based on real-world usage patterns
The Road Ahead
As AI systems become more complex, structured prompting will become increasingly critical. ChatML provides:
A foundation for building reliable conversational systems
A framework for multi-agent orchestration
A standard for cross-platform compatibility
A path forward as the industry matures
By mastering ChatML today, you’re positioning yourself at the forefront of AI engineering best practices.
Before understanding what face search is, what the use cases are, and why performing face search fast isso crucial, let us understand the following two key terms used in this domain:
Face Verification: This is a one-to-one comparison of faces to confirm the individual’s identity by comparing his/her face against a face or face template stored in the identity card or captured directly by the camera by clicking the image on the card. An example is when an organisation authenticates the user by comparing the image stored in the offline eKYC XML of Aadhaar with the face captured through a camera. This face capture can happen through cameras mounted at the entry point or may be captured by any web application using a computer camera. Other use cases may be, for example, online banking or passport checks. In the case of face verification, comparison of the faces is one-to-one.
Face Recognition: The purpose of face recognition is to identify/recognise the person from a database of faces by performing a one-to-many comparison.
Face images are not directly compared; rather, there are many deep learning-based models to transform these faces into embeddings. These embeddings are nothing but a vector, which is a mathematical representation of the face in the embedding space, learnt by the model. By simply calculating the distance metric, such as cosine similarity, and comparing it with a certain threshold, we can tell if the two faces belong to the same person or not. There are other distance metrics such as Dot Product, Squared Euclidean, Manhattan, Hamming, etc.
There are many use cases where there could be millions, even billions, of images in the database for comparison. One-to-many comparisons against this huge number of images are unimaginable in real-time use cases.
In this article and accompanying code, I have used Facebook AI Similarity Search (Faiss), a library that helps in quickly searching across multimedia documents that are similar to each other. The first step is data ingestion, where multimedia documents (a face image in this case) are transformed into vector embeddings and then saved in the database. Once queried, this database returns the k-nearest neighbours of the queried face, that is, k faces that are most similar to the queried face images. Other competing vector databases provide similar functionality. Read more about Faiss in the article “Faiss: A library for efficient similarity search“.
2. Data Ingestion
I used Labelled Faces in the Wild (LFW) dataset, which has over 13,000 images of faces collected from the web. The face images are stored in a directory with the same name as the person whose face images they belong to. All these directories are located in a directory named lfw-deepfunneled. The following is the code snippet to
Load the face images from the directory.
Transform the loaded face images to face embeddings.
To perform both operations, I used the face-recognition library. This Python library is built using dlib’s state-of-the-art face recognition. The loading step additionally detects the face region in the original face image, crops it, and then returns. The transformation step transforms the cropped face into a vector embedding. Following is the code snippet for the same. representations is the list of the list of key, value pairs. The key is the file name, and the value is the corresponding vector embedding. embeddings is the list that stores all the vector embeddings.
The next step is to initialise the Faiss database and then store the vector embedding in it. Then, serialise the database on the disc. Finally, serialise the representations list on the disc. The intent is that when the face search module starts, it loads the serialised index and list in memory. Following is the code snippet:
# Initialize vector store and save the embbeddings print("Storing embeddings in faiss.") index = faiss.IndexFlatL2(128) index.add(np.array(embeddings, dtype ="f"))# Save the indexfaiss.write_index(index, "face_index.bin")# Save the representationswithopen('face_representations.txt', 'wb') asfp: pickle.dump(representations, fp)print("Done")
3. Face Search
The following are the steps for face search:
Load the database; load the representations list.
Create a search interface (web interface using streamlit in this case)
Upload the query face image, crop the face, and transform it into a vector embedding
Pass the query vector embedding to the Faiss database
Faiss database returns the k nearest neighbours from the database.
Perform 1 to k comparisons (similarity check) of the query face with k face embeddings returned from the database.
Based on the comparison of this similarity value with a certain threshold, it is decided whether the person is found or not. If found, then show the face images found.
Following is the code snippet:
is_dataset_loaded =False# Load the face embedding from the saved face_representations.txt file defget_data(): with st.spinner("Wait for the dataset to load...", show_time=True): representations =Nonewithopen ('face_representations.txt', 'rb') as fp: representations = pickle.load(fp)print(representations)# Load the index face_index = faiss.read_index("face_index.bin")return representations, face_index# Load the face embedding at the startup and store in sessionif st.button('Rerun'): st.session_state.representations, st.session_state.index = get_data()if'index'notin st.session_state: st.session_state.representations, st.session_state.index = get_data()index = st.session_state.indexrepresentations = st.session_state.representations# Search web interfacewith st.form("search-form"): uploaded_face_image = st.file_uploader("Choose face image for search", key="search_face_image_uploader")if uploaded_face_image isnotNone: tic = time.time() st.text("Saving the query image...")print("Saving the query image in the directory: "+"query-images") random_query_image_name = uuid.uuid4().hex query_image_full_path ="query-images/"+ random_query_image_name +".jpg"withopen(query_image_full_path, "wb") as binary_file: binary_file.write(uploaded_face_image.getvalue()) st.image(uploaded_face_image, caption="Image uploaded for search") query_image = face_recognition.load_image_file(query_image_full_path) query_image_embedding = face_recognition.face_encodings(query_image)iflen(query_image_embedding) >0: query_image_embedding = query_image_embedding[0] query_image_embedding = np.expand_dims(query_image_embedding, axis=0)# Search st.text("Searching the images...") k =1 distances, neighbours = index.search(query_image_embedding, k)#print(neighbours)#print(distances) i =0 is_image_found =Falsefor distance in distances[0]:if distance <0.3: st.text("Found the image.") st.text("Similarity: "+str(distance)) image_file_name = representations[neighbours[0][i]][0] image_path ="lfw-deepfunneled/"+ image_file_name[:-9] +"/"+ image_file_name st.image(image_path) is_image_found =True i = i +1if is_image_found ==False: st.text("Cound not found the image.") toc = time.time() st.text("Total time taken: "+str(toc - tic) +" seconds") st.form_submit_button('Submit')
I have created this teaching chatbot that can answer questions from class IX, subject SST, on the topic “Democratic politics“. I have used RAG (Retrieval-Augmented Generation), Llama Model as LLM (Large Language Model), Qdrant as a vector database, Langchain, and Streamlit.
Before running the following line, Qdrant should be running and available on localhost. If it’s running on a different machine, make appropriate URL changes to the code. python data_ingestion.py After running this, http://localhost:6333/dashboard#/collections should appear like figure 1.
Run the web application for the chatbot by running the following command. The web application is powered by Streamlit. streamlit run app.py The interface of the chatbot appears as in Figure 2.
Figure 1: Screenshot of the Qdrant dashboard after running the data_ingestion.py
Figure 2: Screenshot of the chatbot web application
3. Data Ingestion
Data: PDF files have been downloaded from the NCERT website for Class IX, subject SST, from the topic “Democratic politics”. These files are stored in the directory ix-sst-ncert-democratic-politics. The following are the steps for data ingestion:
PDF files are loaded from the directory.
Text contents are extracted from the PDF.
Text content is divided into chunks of text.
These chunks are transformed into vector embeddings.
These vector embeddings are stored in the Qdrant vector database.
This data is stored in Qdrant with the collection name “ix-sst-ncert-democratic-politics“.
The following is the code snippet for data_ingestion.py.
################################################################ Data ingestion pipeline # 1. Taking the input pdf file# 2. Extracting the content# 3. Divide into chunks# 4. Use embeddings model to convet to the embedding vector# 5. Store the embedding vectors to the qdrant (vector database)################################################################import osfrom langchain_community.document_loaders import PDFMinerLoaderfrom langchain.text_splitter import CharacterTextSplitterfrom qdrant_client import QdrantClientpath ="ix-sst-ncert-democratic-politics"filenames =next(os.walk(path))[2]for i, file_name inenumerate(filenames):print(f"Data ingestion for the chapter: {i}")# 1. Load the pdf document and extract text from it loader = PDFMinerLoader(path +"/"+ file_name) pdf_content = loader.load()print(pdf_content)# 2. Split the text into small chunksCHUNK_SIZE=1000# chunk size not greater than 1000 charsCHUNK_OVERLAP=30# a bit of overlap is required for continued context text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP) docs = text_splitter.split_documents(pdf_content)# Make a list of split docs documents = []for doc in docs: documents.append(doc.page_content)# 3. Create vectordatabase(qdrant) client qdrant_client = QdrantClient(url="http://localhost:6333")# 4. Add document chunks in vectordb qdrant_client.add(collection_name="ix-sst-ncert-democratic-politics",documents=documents,#metadata=metadata,#ids=ids )# 5. Make a query from the vectordb(qdrant) search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text="What is democracy?" )for search_result in search_results:print(search_result.document, search_result.score)
4. Chatbot Web Application
The web application is powered by Streamlit. Following are the steps:
A connection to the Qdrant vector database is created.
User questions are captured through the web interface.
The question text is transformed into a vector embedding.
This vector embedding is searched in the Qdrant vector database to find the most relevant content similar to the question.
The text returned by the Qdrant acts as the context for the LLM.
I have used Llama LLM. The query, along with context, is sent to the Llama for an answer to be generated.
The answer is displayed on the web interface as the answer from the bot.
Following is the code snippet for app.py.
# Initialize chat historyif"messages"notin st.session_state: st.session_state.messages = []# Display chat messages from history on app rerunfor message in st.session_state.messages:with st.chat_message(message["role"]): st.markdown(message["content"])# React to user inputif query := st.chat_input("What is up?"):# Display user message in chat message container st.chat_message("user").markdown(query)# Add user message to chat history st.session_state.messages.append({"role": "user", "content": query})# Connect with vector db for getting the context search_results = qdrant_client.query(collection_name="ix-sst-ncert-democratic-politics",query_text=query ) context ="" no_of_docs =2 count =1for search_result in search_results:if search_result.score >=0.8:#print(f"Retrieved document: {search_result.document}, Similarity score: {search_result.score}") context = context + search_result.documentif count >= no_of_docs:break count = count +1# Using LLM for forming the answer template ="""Instruction: {instruction} Context: {context} Query: {query} """ prompt = ChatPromptTemplate.from_template(template) model = OllamaLLM(model="llama3.2") # Using llama3.2 as llm model chain = prompt | model bot_response = chain.invoke({"instruction": "Answer the question based on the context below. If you cannot answer the question with the given context, answer with \"I don't know.\"", "context": context,"query": query })print(f'\nBot: {bot_response}')#response = f"Echo: {prompt}"# Display assistant response in chat message containerwith st.chat_message("assistant"): st.markdown(bot_response)# Add assistant response to chat history st.session_state.messages.append({"role": "assistant", "content": bot_response})
An ability is emergent if it is not present in smaller models but is present in larger models. [1]
Scaling up language models has been shown to improve predictably the performance and sample efficiency on a wide range of downstream tasks. Emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. This raises the question of whether additional scaling could potentially further expand the range of capabilities of language models. [1]
Today’s language models have been scaled primarily along three factors:
amount of computation,
number of model parameters, and
training data size
The following table lists the emergent abilities of large language models and the scale at which abilities emerge. [1]
Tasks that language models cannot currently do are prime candidates for future emergence; for instance, there are dozens of tasks in BIG-Bench[3] for which even the largest GPT-3 and PaLM models do not achieve above-random performance. [1] Similar to emergent abilities, emergent risks could also emerge, such as w.r.t. truthfulness, bias, and toxicity in LLMs, backdoor vulnerabilities, inadvertent deception, or harmful content synthesis.
But Rylan Schaeffer et al., in their paper [3], claim that the sudden appearance of emergent abilities is just a consequence of the way researchers measure the LLM’s performance. The article “How Quickly Do Large Language Models Learn Unexpected Skills?” by Stephen Ornes [4] beautifully summarises the two papers.
Prompt engineering is the practice of designing and refining the text (prompt) that we pass to a Generative AI (GenAI) model. The prompt acts as an instruction or query, and the model generates responses based on it. Prompts can be questions, statements, or detailed instructions.
Prompt engineering serves three purposes:
Enhancing output quality – refining how the model responds.
Evaluating model behavior – testing the output against requirements.
Ensuring safety – reducing harmful or biased responses.
There is no single “perfect” prompt. Instead, prompt design is an iterative process involving optimization and experimentation.
Figure 1: A basic example of the prompt
2. Controlling Model Output by Adjusting Model Parameters
The behavior of large language models (LLMs) can be fine-tuned using parameters such as temperature, top_p, and top_k. For these to take effect, do_sample=True must be set, allowing the model to sample tokens instead of always choosing the most likely one.
Temperature controls randomness.
temperature=0: deterministic output (always the same response).
Higher values → more diverse responses.
Example: 0.2 = focused, coherent; 0.8 = more creative.
Top_p (nucleus sampling) restricts token choices to the smallest set whose cumulative probability ≥ p.
top_p=1: consider all tokens.
Lower values → more focused output.
Top_k limits the selection to the k most likely tokens.
By tuning these, one can strike a balance between deterministic/focused and creative/diverse outputs.
3. Instruction-Based Prompting
Instruction-based prompting is one of the most fundamental and widely used approaches in working with large language models (LLMs). It involves providing the model with explicit, structured, and unambiguous instructions that guide how the response should be generated.
At its core, an instruction-based prompt consists of two essential components:
Instruction – what the model is supposed to do (e.g., “Summarize the text in one sentence.”).
Data – the input on which the instruction operates (e.g., the paragraph to be summarized).
A simple example:
Prompt
Instruction:Summarize the following text in one sentence.Data:Artificial Intelligence is revolutionizing industries such as healthcare, finance, and education by automating tasks and enabling data-driven decision-making.
Output
AI is transforming industries by automating tasks and enabling smarter decisions.
The following diagram depicts a basic instruction prompt. Please note the instructions and data in the prompt.
Figure 2: Instruction Prompt
3.1 Adding Output Indicators
Sometimes instructions alone are not enough. To make the response more constrained and predictable, we can add output indicators – predefined answer formats or expected categories.
For example:
Prompt
Instruction:Classify the sentiment of the following review.Data:“The product is amazing and works perfectly.”Output options: Positive | Negative
Output
Positive
The following diagram depicts the instruction prompt with an output indicator.
Figure 3: Instruction prompt with output indicators
3.2 Task-Specific Prompt Formats
Different NLP tasks require slightly different instruction structures. For example:
Summarization: “Summarize the following paragraph in 2–3 sentences.”
Classification: “Classify the following text as spam or not spam.”
Named Entity Recognition (NER): “Extract all names of organizations mentioned in the following text and list them as a JSON array.”
These formats not only help the model but also make evaluation easier for humans.
The following diagram illustrates example formats for summarization, classification, and named-entity recognition.
Figure 4: Prompt format for summarization, classification, and NER task
3.3 Prompting Techniques for Better Results
Instruction-based prompting can be improved using several best practices:
Specificity Be as precise as possible. Instead of “Explain photosynthesis”, say “Explain photosynthesis in 3 simple steps, suitable for a 10-year-old student.”
Hallucination Control LLMs sometimes generate plausible but false answers (hallucinations). To mitigate this, include safeguards: If you are not sure about the answer, respond with: "I don't know."
Order Matters (Primacy and Recency Effects) LLMs pay more attention to the beginning and end of a prompt. Place the most important instructions either at the start or the end to maximize compliance.
3.4 Beyond the Basics: Modular Prompt Components
While instruction, data, and output indicators are the core building blocks, prompts can be made more modular and adaptive by incorporating additional components such as:
Personalization – “You are a financial advisor…”
Instruction – “Summarize, compare, explain…”
Context – background information that helps the model.
Format – JSON, table, bullet points, plain text.
Audience – “Explain as if to a high school student.”
Tone – formal, casual, persuasive, neutral.
Data – the actual input for the model to process.
This modularity makes prompts flexible. You can add or remove components depending on the use case, and observe how the output changes.
The following is an example that uses the above prompt components. This example demonstrates the modular nature of prompting. We can experiment by adding or removing components to see the effect.
Figure 5: Example of a prompt showing the use of the various components.
3.5 Example: Modular Prompt
Prompt
Personalization:You are a career coach.Instruction:Provide advice on writing a resume.Context:The user is applying for a software engineering role at a mid-sized startup.Format:List your advice as 5 bullet points.Audience:Entry-level job seeker with 1 year of internship experience.Tone:Encouraging and practical.Data:“How should I improve my resume?”
Output
1. Highlight your technical skills (Python, JavaScript, databases). 2. Add details about your internship projects—focus on impact. 3. Tailor your resume to the startup by showing adaptability and problem-solving. 4. Keep the format clean, concise, and one page. 5. Add a short section for personal projects or open-source contributions.
This demonstrates how instruction-based prompting becomes powerful and adaptable when multiple components are layered together.
4. In-Context Learning – Providing examples
Large Language Models (LLMs) do not “learn” in the traditional sense during inference. Instead, they adapt to patterns given in the prompt itself. This ability to condition their behavior on a few examples provided at runtime is called In-Context Learning (ICL).
4.1 The Idea Behind ICL
By showing the model examples of the task and the desired outputs, we “teach” it on the fly. The model does not change its weights; rather, it uses the examples as a temporary pattern guide to align its responses with the given format.
This makes ICL especially powerful when:
We don’t want to fine-tune the model.
Training data for fine-tuning is small or unavailable.
We want flexibility to change tasks quickly.
4.2 Types of In-Context Learning
1. Zero-shot prompting
No examples are provided, only instructions.
Works best when the task is common or well-aligned with the model’s pretraining.
Example:
Instruction:Translate the following English sentence into French.Data:"How are you?"
Output: “Comment ça va ?”
2. One-shot prompting
A single example is given to demonstrate the expected behavior.
Useful when the task requires clarity in format or style.
Example:
User:Translate the following English sentence into French.Example Input: "Good morning" → Example Output: "Bonjour" Task Input: "How are you?"
Output: “Comment ça va ?”
3. Few-shot prompting
Multiple examples are given before the actual task.
Works well when tasks are ambiguous or domain-specific.
Example:
Task:Classify the sentiment of the following reviews as Positive or Negative.Review:"I love this phone, the battery lasts long." → PositiveReview:"The screen cracked within a week." → NegativeReview:"Excellent sound quality and fast processor." → PositiveNow classify: "The camera is blurry and disappointing."
Output: Negative
The following diagram illustrates the examples of in-context learning.
Figure 6: Examples of in-context learning
4.3 Importance of Role Differentiation
When writing few-shot prompts, clearly distinguishing roles (e.g., User: and Assistant: or Q: and A:) helps the model mimic the structure consistently. Without role markers, the model may drift into producing unstructured responses.
For example:
User:What is 2 + 2?Assistant:4User:What is 5 + 3?Assistant:8User:What is 7 + 6?Assistant:
This encourages the model to continue in the same call-and-response pattern.
4.4 Benefits of In-Context Learning
Flexibility – You can “train” the model on a new task instantly without modifying its parameters.
Rapid prototyping – Great for testing new use cases before investing in fine-tuning.
Control – Helps enforce formatting (e.g., JSON, tables, bullet points).
4.5 Limitations of In-Context Learning
Context length constraints – Too many examples may exceed the model’s context window.
Random sampling – Even with examples, the model may ignore instructions if randomness (temperature, top_p) is high.
Inconsistency – The same examples may yield slightly different outputs.
4.6 Advanced Variants of ICL
Instruction + Demonstration Hybrid: Combine explicit task instructions with few-shot examples for stronger guidance.
Chain-of-Thought with ICL: Provide examples that include reasoning steps, so the model learns to “think out loud” before answering.
Style Transfer with ICL: Use few-shot examples to enforce a particular writing style (e.g., Shakespearean, academic, casual).
5. Chain Prompting: Breaking up the Problem
When dealing with complex tasks, asking a large language model (LLM) to solve everything in a single prompt often leads to suboptimal results. The model may lose focus, misinterpret requirements, or generate incomplete answers. Chain prompting is a structured strategy where we break down a large problem into smaller subtasks, design prompts for each subtask, and then link them sequentially, passing outputs from one prompt as inputs to the next. This creates a pipeline of prompts that together achieve the final solution.
This approach mirrors how humans naturally solve complex problems—by breaking them into manageable steps rather than attempting everything at once.
5.1 Key Benefits of Prompt Chaining
Better Performance
By focusing each prompt on a single subtask, the LLM can generate more accurate and high-quality responses.
Reduces cognitive overload for the model.
Transparency
Each intermediate step in the chain is visible and explainable.
Makes it easier for developers and users to trace how the final output was constructed.
Controllability and Reliability
Developers can adjust or fine-tune only the prompts for the weaker subtasks instead of rewriting the entire large prompt.
More control over model behavior.
Debugging
Since outputs are broken into stages, it’s easier to identify where an error occurs and fix it.
Incremental Improvement
You can evaluate the performance of each subtask independently and selectively improve weak links in the chain.
Conversational Assistants
Useful for designing chatbots where conversation naturally involves sequential reasoning (e.g., clarifying intent → retrieving information → generating response).
Personalization
Chains can be designed to collect user preferences at one step and then apply those preferences consistently across subsequent prompts.
5.2 Common Use Cases
Response Validation
Prompt 1: Generate an answer.
Prompt 2: Ask the model (or another model) to evaluate correctness, consistency, or bias in the answer.
Example: LLM generates an explanation of a concept, then another LLM verifies if the explanation is factually correct.
Parallel Prompts
Sometimes, different subtasks can be run simultaneously.
Example: One prompt generates a list of features, another generates customer pain points, and later prompts merge them to design marketing copy.
Creative Writing / Storytelling
Prompt 1: Generate character descriptions.
Prompt 2: Use characters to generate a plot outline.
Prompt 3: Expand the outline into a full story.
Business Use Case – Marketing Flow
Step 1 (Prompt 1): Generate a catchy product name.
Step 2 (Prompt 2): Using the product name + product features, generate a short slogan.
Step 3 (Prompt 3): Using the product name, features, and slogan, generate a full sales pitch.
This modular approach ensures the final pitch is consistent, creative, and logically structured.
5.3 Prompt Chain Example
The following example illustrates the prompt chain that first creates a product name, then uses this name with product features to create a slogan, and finally uses features, product name, and slogan to create the sales pitch.
Figure 7: Example of a prompt chain
Step 1 – Product Naming
Instruction:“Suggest a creative name for a new smartwatch that focuses on health tracking and long battery life.”
Output:“PulseMate”
Step 2 – Slogan Generation
Instruction:“Using the product name ‘PulseMate’ and the features (health tracking, long battery life), create a short catchy slogan.”
Output:“PulseMate – Your Health, Powered All Day.”
Step 3 – Sales Pitch
Instruction:“Using the product name ‘PulseMate,’ its slogan ‘Your Health, Powered All Day,’ and the features (health tracking, long battery life), write a compelling sales pitch for customers.”
Output:“Meet PulseMate, the smartwatch designed to keep up with your lifestyle. Track your health seamlessly while enjoying a battery that lasts for days. PulseMate—Your Health, Powered All Day.”
5.4 Variants of Prompt Chaining
Sequential Chaining – Output of one prompt feeds directly into the next (step-by-step). The above example in Figure 7 demonstrates sequential chaining.
Branching Chaining – One output is used to create multiple different paths of prompts.
Merging Chains – Combine results from different parallel chains into a unified final response.
Iterative Chaining – Loop a prompt multiple times for refinement (e.g., “revise this until it’s concise and clear”).
6. Reasoning with Generative Models
LLMs don’t “reason” like humans. They excel at pattern completion over very large text corpora. With careful prompting, scaffolding, and verification, we can simulate aspects of reasoning and markedly improve reliability.
6.1 System 1 vs. System 2 (Kahneman) — and LLMs
System 1 (fast, intuitive): In LLMs this looks like single-shot answers, low token budget, low/no deliberation. Good for well-trodden tasks (grammar fixes, casual Q&A).
System 2 (slow, deliberate): In LLMs this is multi-step prompting, intermediate reasoning, tool use (calculator/RAG), sampling multiple candidates, and verification. Good for math, logic, policy checks, multi-constraint generation, and anything high-stakes.
In practice: choose System 1 for speed/low risk; escalate to System 2 scaffolds when accuracy, traceability, or multi-constraint synthesis matters.
6.2 Techniques to Induce Deliberation
6.2.1 Chain-of-Thought (CoT): “Think before answering”
Elicit intermediate reasoning steps prior to the final answer.
Zero-shot CoT trigger (minimal):
You are solving a reasoning task.First, think step-by-step in brief bullet points.Then, give the final answer on a new line prefixed with "Answer:".Question:<problem>
Few-shot CoT (when format matters): include 1–3 worked examples showing short, crisp reasoning and a clearly marked Answer line.
Tips
Keep thoughts succinct to reduce cost and drift.
For production UIs, you can ask the model to hide the rationale and output only the final answer + a confidence or citation list (see “Reasoning privacy” below).
When to use: arithmetic/logic puzzles, planning, constraint satisfaction, data transformation with edge cases.
The following figure demonstrates standard prompting vs C-o-T Prompting:
Figure 9: Chain-of-thought example; reasoning process is highlighted – source [3]
The following is an example of zero-shot chain-of-thought.
Figure 10: Example of zero-shot chain-of-thought – source[1]
6.2.2 Self-Consistency: sample multiple rationales and vote
Rather than trusting the first reasoning path, sample k solutions and aggregate.
Template
Task:<problem>Instruction:Generate a short, step-by-step rationale and final answer.Vary your approach each time.[Runthispromptktimeswithtemperature~0.7–1.0]Aggregator:- Extract the final answer from each sample.- Choose the majority answer (tie-break: pick the one supported by the clearest rationale).- Return "Final:" <answer> and "Support count:" <m/n>.
Practical defaults
k = 5–15 (trade accuracy vs. latency/cost)
temperature: 0.7–1.0
top_p: 0.9–1.0
When to use: problems with one correct output but many valid reasoning paths (math, logical deduction, label inference).
The following diagram illustrates the concept of self-consistency.
Figure 11: Example of self-consistency in CoT[4]
6.2.3 Tree of Thoughts (ToT): explore and evaluate branches
Generalizes CoT into a search over alternative “thoughts” (states). You expand multiple partial solutions, score them, prune weak ones, and continue until a budget is reached.
Lightweight ToT pseudo-workflow
state0 = problem descriptionfrontier = [state0]for depth in 1..D: candidates = [] for s in frontier: thoughts = LLM("Propose 2-3 next-step thoughts for: " + s) for t in thoughts: v = LLM("Rate this partial approach 1-5 for promise. Be strict.\nThought: " + t) candidates.append((t, v)) frontier = top_k(candidates, by=v, k=K)best = argmax(frontier, by=v)answer = LLM("Given this best chain of thoughts, produce the final answer:\n" + best)
Tuning knobs
D (max depth), K (beam width), value function (how you score thoughts), and token budget.
Use “look-ahead” prompts: “Simulate next two steps; if dead-end, backtrack.”
When to use: multi-step planning (itineraries, workflows), puzzle solving, coding strategies, complex document transformations.
The following diagram illustrates the various approaches to problem-solving with LLMs. Each rectangular box represents a thought.
Figure 12: Various approaches to problem-solving with LLMs.
6.2.4 Related, practical reasoning scaffolds
ReAct (Reason + Act): Interleave “Thought → Action (tool call/RAG) → Observation” until done. Great for tasks that need tools, search, or databases.
Program-of-Thoughts (PoT): Ask the model to output code (e.g., Python) to compute the answer; execute it; return result. Excellent for math, data wrangling, and reproducibility.
Debate / Critic-Judge: Have model A propose an answer, model B critique it, and a judge (or the same model) select/merge. Pairs well with self-consistency.
Plan-then-Execute: Prompt 1 creates a plan/checklist; Prompt 2 executes step by step; Prompt 3 verifies outputs against the plan.
6.3 Putting it together: a robust System-2 pipeline
Use case: Policy compliance check for marketing copy.
Extract constraints (CoT): “List policy rules relevant to social ads, each with an ID and short paraphrase.”
Assess violations (ReAct/PoT): For each rule, analyze the ad text; return pass|fail with span references.
Self-consistency vote: Sample assessments 7× and majority-vote each rule outcome.
Summarize & justify: Compose a final verdict with a table of rules, decisions, and cited spans.
Verifier pass: A separate prompt re-checks logical consistency and that every failure has evidence.
Guarded output: Enforce schema (JSON) and redact PII (Redacting or identifying personally identifiable information).
This gives you accuracy (deliberation), transparency (artifacts per step), and control (schema + verifier).
6.4 Operational Guidance
6.4.1 Prompt templates
CoT (short)
Solve the problem. First give 3-5 brief reasoning bullets. Then output the final result as: "Answer: <value>".Question:<...>
Self-consistency runner (controller code)
answers = []for i in range(k): ans = call_llm(prompt, temperature=0.8, top_p=0.95) answers.append(extract_final(ans))final = majority_vote(answers)
ReAct skeleton
Thought:I need the latest spec section.Action:search("<query>")Observation:<top snippet>Thought:Summarize the relevant passage and apply the rule....Final Answer: <concise verdict + citation>
ToT node expansion
Propose 3 distinct next-step ideas to advance the solution.For each: give a one-sentence rationale and a 1-5 promise score.Return JSON: [{"idea":..., "rationale":..., "score":...}]
6.5 Evaluation & QA
Once we design and deploy prompts, evaluation and quality assurance (QA) become critical. Unlike traditional software, where behavior is deterministic, LLM outputs are probabilistic and context-dependent. This means even well-designed prompts may fail in certain conditions. A structured evaluation strategy helps measure reliability, accuracy, and efficiency of your prompt-engineering pipeline.
Evaluation can be broadly divided into four dimensions: task accuracy, process metrics, ablations, and cost/latency.
1. Task Accuracy – Measuring End Results
The first dimension is whether the model actually solves the task correctly. Depending on the nature of the application, different metrics apply:
Exact Match (EM): Used for tasks where there is a single correct answer (e.g., classification, math problems, SQL query generation). Checks if the model output matches the ground truth exactly.
F1 Score: Measures overlap between predicted tokens and ground-truth tokens, balancing precision and recall. Common for QA and NER tasks.
pass@k: Especially used in code generation, where we test if any of the top-k sampled outputs are correct (e.g., pass@1, pass@10).
BLEU / ROUGE: Standard metrics for summarization, translation, and text generation tasks, where multiple valid outputs may exist.
Domain-Specific Metrics:
Medical: accuracy of ICD codes, dosage consistency.
Finance: correctness of risk scores, compliance alignment.
Task accuracy answers: Did the model get it right?
2. Process Metrics – Evaluating the Reasoning Path
Sometimes the final answer looks right, but the process is flawed. Evaluating intermediate reasoning steps ensures robustness:
Step Validity Rate: In CoT or ToT prompting, check if each intermediate reasoning step is logically valid.
Verifier Agreement: Use an external verifier model (or human annotators) to check whether the reasoning aligns with domain knowledge.
Citation Coverage: For knowledge-grounded tasks, measure how many claims in the output are backed by explicit references (retrieved documents, database entries).
Hallucination Rate: % of outputs containing unsupported or fabricated facts.
Process metrics answer: Did the model follow a sound reasoning path, not just guess the final answer?
3. Ablation Studies – Quantifying the Effect of Prompting Techniques
Prompt engineering often involves experimenting with different prompting strategies. Ablation studies allow us to isolate what works best by systematically varying one factor at a time.
Single-Shot vs. CoT (Chain-of-Thought): Compare baseline prompts against CoT prompting to measure reasoning improvements.
CoT vs. CoT+SC (Self-Consistency): Test whether sampling multiple reasoning paths and aggregating improves accuracy.
ToT (Tree-of-Thought): Compare CoT vs. ToT to see if deliberate multi-path exploration boosts complex problem-solving.
Role of Examples: Zero-shot vs. one-shot vs. few-shot performance.
This helps quantify the lift (improvement in accuracy or reasoning reliability) due to advanced prompting.
Ablations answer: Which prompting strategy gives the best performance trade-off?
4. Cost & Latency – Operational Constraints
In production, even the most accurate system fails if it’s too slow or expensive. Evaluation must include efficiency metrics:
Tokens per Step: Track how many tokens are consumed per prompt and per reasoning step. Helps understand scaling behavior.
Cache Intermediate Artifacts: Save partial reasoning outputs (e.g., retrieved documents, intermediate JSONs) to avoid recomputation.
Latency per Request: Time taken for one query end-to-end (prompt → LLM → post-processing).
Cost per Query: Estimate $$ spent per API call or GPU inference, especially with multi-step chains (CoT, ToT).
Trade-off Curves: Accuracy vs. cost/latency curves, to decide the optimal configuration for production.
Cost/latency metrics answer: Is the solution practical at scale?
A robust evaluation framework should combine accuracy, process validity, ablations, and cost tracking. Only then can we say our prompt engineering strategy is not just clever in theory, but reliable, efficient, and production-ready.
6.6 Safety & reliability
Ensuring safety and reliability in prompt engineering is one of the most critical aspects when deploying LLM-powered applications in production. Without guardrails, models may generate unsafe, incoherent, or unpredictable responses that can result in privacy leaks, reputational damage, or compliance violations. This section outlines key strategies for strengthening the robustness of generative AI systems.
🔒 Reasoning Privacy
Hidden rationale vs. exposed reasoning:
When using techniques like Chain-of-Thought (CoT) prompting, models produce intermediate reasoning steps. While useful for debugging or internal evaluation, exposing these steps to end users may inadvertently leak sensitive information, such as internal rules, confidential business logic, or hints about training data.
Best practice: Allow the model to perform its reasoning “behind the scenes,” but only expose the concise, final answer in the user-facing product. This keeps user interactions clean, prevents information leakage, and reduces the risk of misuse.
🛡️ Guardrails
Guardrails act as safety filters and structural enforcements that make outputs predictable, secure, and policy-compliant. They operate at two levels:
Structural Guardrails
Constrain model outputs using:
JSON schemas → ensure the output always matches a machine-parseable format.
Frameworks such as Guardrails AI, LMQL, or Guidance provide programmatic ways to enforce these constraints.
⚙️ Determinism Knobs
LLMs are inherently probabilistic, meaning the same prompt may yield different outputs on different runs. For enterprise-grade reliability, determinism can be controlled via:
Lowering Temperature
Reduces randomness in token sampling.
At temperature = 0, the model becomes nearly deterministic, always picking the most probable token.
Self-Consistency with Majority Voting
Instead of accepting a single output, the model generates multiple reasoning paths (using Chain-of-Thought).
A majority vote across outputs ensures stability and reduces the impact of outlier generations.
Example: In a math problem, the model might produce 5 possible solutions; by selecting the most common final answer, reliability improves.
Safety and reliability in prompt engineering require balancing privacy (hidden reasoning), structural/content guardrails (schemas, filters, citations), and deterministic controls (temperature, self-consistency). These practices make LLM-powered systems not only smarter but also trustworthy, compliant, and production-ready.
6.7 When not to use heavy reasoning
While techniques like Chain-of-Thought (CoT), Tree-of-Thought (ToT), or self-consistency sampling can significantly improve reasoning quality in Large Language Models (LLMs), they are not always the right choice. Heavy reasoning often comes at the cost of latency, cost, and computational overhead. In certain contexts, it is better to avoid them altogether and rely on simpler, faster prompting strategies.
Here are situations where heavy reasoning is unnecessary or even counterproductive:
🔹 1. Simple, Well-Known Tasks Where Single-Shot Responses Are Accurate
Not every task requires multiple reasoning steps. If the task has a clear, deterministic answer and can be handled with a single-shot prompt, adding chain-of-thought or multi-step reasoning only adds complexity without benefit.
Examples:
Asking factual questions with unambiguous answers: “What is the capital of France?” → “Paris”
Formatting tasks: Convert 1234 to Roman numerals → MCCXXXIV
Standardized classification: Sentiment analysis of short product reviews → “Positive/Negative”
👉 In such cases, heavy reasoning only increases inference time and may even introduce noise (e.g., overthinking a trivial fact).
🔹 2. Ultra-Tight Latency Budgets
Reasoning methods like CoT or ToT require more tokens because they expand the answer into intermediate steps before concluding. This makes them slower and more expensive.
If the application has strict response time requirements, such as:
Customer support chatbots expected to respond in <1 second.
Voice assistants where delays break the conversational flow.
High-frequency trading AI where every millisecond counts.
👉 In these scenarios, it’s better to stick with direct, single-shot answers or pre-validated responses instead of reasoning chains. Latency constraints make heavy reasoning impractical.
🔹 3. Very Small Models Without Enough Capacity or Context Window
Advanced reasoning prompts assume the model has sufficient capacity (parameters) and context length to simulate multi-step reasoning. Very small models (e.g., <1B parameters, or edge-deployed models with small context windows) often fail to benefit from reasoning prompts because they:
Forget earlier reasoning steps due to short context limits.
Generate incoherent or circular reasoning when asked to “think step by step”.
Struggle to hold multiple candidate reasoning paths in memory (needed for self-consistency or ToT).
Example:
Running CoT on a mobile LLM with 1B parameters may just produce verbose, repetitive text instead of genuine reasoning.
👉 For such models, it is better to use direct, concise prompting and offload complex reasoning to a larger backend model if required.
⚖️ Trade-Offs: Accuracy vs. Efficiency
Scenario
Reasoning Style
Why
Simple, factual Q&A
Single-shot
Faster, cheaper, reliable
Creative writing
CoT / ToT
Requires exploration & coherence
Real-time chatbot
Single-shot / lightweight CoT
Minimize latency
Legal/medical analysis
CoT + self-consistency
Accuracy more important than speed
Edge device app
Single-shot
Small models can’t handle CoT well
Heavy reasoning should be used selectively. It shines in complex, ambiguous, or multi-step reasoning problems, but for simple, latency-sensitive, or resource-constrained scenarios, sticking with direct prompting leads to better user experience and system efficiency.
In real-world deployments, verifying and controlling the output of generative AI models is crucial to ensure safety, robustness, and reliability. LLMs, while powerful, are prone to errors, hallucinations, ethical risks, or unstructured responses that can cause failures in production systems.
Without proper verification, issues such as malformed data, offensive content, or incorrect facts can undermine user trust and lead to business or compliance risks.
7.1 Why Output Verification Matters
Structured Output
Many applications require the output in machine-readable formats (e.g., JSON, XML, CSV).
An unstructured answer can break downstream systems expecting strict schemas.
Valid Output Choices
Even if the model is instructed to choose among fixed options (e.g., “positive” or “negative”), it may generate something outside the list (e.g., “neutral” or “very positive”).
Output validation ensures strict adherence to predefined categories.
Ethical Compliance
Outputs must be free of profanity, bias, harmful stereotypes, or PII (Personally Identifiable Information).
Regulatory compliance (GDPR, HIPAA, etc.) requires strict filtering of sensitive or discriminatory outputs.
Accuracy and Reliability
LLMs can hallucinate — produce factually wrong but confident-sounding information.
Verification steps such as grounding with external knowledge bases or post-checking factual claims can prevent misinformation.
7.2 Methods to Control Output
Apart from tweaking generation parameters like temperature (randomness) and top_p (nucleus sampling), there are three primary strategies for enforcing correct outputs:
Supply the model with examples of desired output in the correct format (e.g., JSON, Markdown tables).
The model uses these as patterns to mimic.
Example Prompt:
{"name": "Alice","sentiment": "positive"}
Now classify the following: Input: “The movie was fantastic!” Output:
Limitations:
Models may still deviate, especially under ambiguous inputs.
Reliability varies across models — some are better at following formatting instructions than others.
7.2.2 Grammar-Based Constrained Sampling
Instead of relying only on examples, grammars and constraints can be enforced at the token generation level. This guarantees that outputs match the expected structure.
Techniques & Tools:
🔹 Guidance
A framework for programmatically controlling LLM outputs.Uses regex, context-free grammars (CFGs), and structured templates.Supports conditionals, loops, and tool calls inside prompt templates.
Advantage: Reduced cost and latency compared to brute-force fine-tuning.
🔹 Guardrails
Python framework to build safe, reliable AI pipelines.
Key features:
Input/Output Guards to catch risks (bias, toxicity, PII leaks).Schema enforcement (ensures outputs comply with JSON, XML, etc.).Ecosystem of reusable validators via Guardrails Hub.
Example: Ensuring LLM output is a safe, validated JSON object representing a chatbot reply.
🔹 LMQL (Language Model Query Language)
Specialized programming language for LLM prompting.
Provides types, templates, and constraints for robust prompting.
Runtime ensures the model adheres to the defined schema during decoding.
Low-level Constrained Decoding Example (llama-cpp-python):
response = llm("Classify the sentiment.",response_format= {"type": "json_object"})
Forces the model to output a valid JSON object instead of free text.
7.2.3 Fine-Tuning for Desired Outputs
How it works:
Retrain or fine-tune the base model on domain-specific datasets that already contain the desired output style.
Example: A customer support LLM fine-tuned only on safe, structured responses in JSON.
Benefits:
Reduces variance and unpredictability.
Makes structured outputs more native to the model (less prompt engineering overhead).
Limitations:
Requires labeled data in the target output format.
Costly and time-consuming compared to prompting or grammar constraints.
7.3 Output Verification Pipeline (Best Practice)
A robust production system often combines multiple techniques:
Prompt-level control → Provide few-shot examples of structured output.
Post-generation validation → Apply validators for ethics, factuality, and compliance.
Fallback mechanism → If verification fails, rerun the model with tighter constraints or route to a human-in-the-loop system.
Output verification transforms LLMs from unpredictable text generators into reliable components of enterprise systems. By combining structured examples, constrained grammar, and fine-tuning, developers can build trustworthy AI applications that are safe, accurate, and production-ready.
References
Book: Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
Master semantic document clustering using embeddings, UMAP, HDBSCAN, and BERTopic
Quick Start: 5-Minute Text Clustering
Want to see text clustering in action immediately? Here’s a minimal working example:
# Install: pip install bertopicfrom bertopic import BERTopic# Your documentsdocs = ["Machine learning is transforming how we process data","Deep learning uses neural networks with multiple layers","Natural language processing enables computers to understand text","Computer vision allows machines to interpret visual information","Reinforcement learning trains agents through rewards and penalties","Neural networks are inspired by biological brain structures","Text mining extracts valuable insights from unstructured data","Image recognition has applications in healthcare and security"]# Cluster in 3 linestopic_model = BERTopic(min_topic_size=2, verbose=True)topics, probs = topic_model.fit_transform(docs)# View resultsprint(topic_model.get_topic_info())
That’s it! Now let’s dive into how this works and how to customize it for production use.
1. What is Text Clustering with LLMs?
Have you ever needed to organize thousands of customer feedback responses? Or group millions of research papers by topic? Or identify emerging themes in social media conversations?
Text clustering is the unsupervised machine learning technique of grouping similar documents based on their semantic content, meaning, and relationships. Unlike classification, which requires labeled data, clustering automatically discovers patterns in unstructured text.
The Problem We’re Solving
With recent advancements in Large Language Models (LLMs), we can now obtain extremely precise contextual and semantic representations of text. This has revolutionized text clustering’s effectiveness compared to traditional methods.
Traditional approach limitations:
❌ Bag-of-words models ignore context (“bank” always means the same thing)
❌ TF-IDF misses semantic relationships (“car” and “automobile” treated as different)
❌ Fixed vocabulary can’t handle synonyms or related concepts
❌ No understanding of actual word meanings
LLM-based clustering advantages:
✅ Contextual embeddings capture meaning (“river bank” vs “financial bank”)
✅ Semantic similarity across different words (“ML” ≈ “machine learning”)
✅ Handles synonyms and related concepts automatically
✅ Works across multiple languages with same model
Key Use Cases for Text Clustering
1. Customer Feedback Analysis
Group support tickets by issue type automatically
Identify recurring problems in product reviews
Discover emerging customer concerns before they become critical
Prioritize feature requests based on cluster size
2. Research Organization
Cluster academic papers by research topic
Discover emerging research trends in your field
Find related work for literature reviews
Organize large document repositories semantically
3. Content Management
Automatically categorize blog posts and articles
Group similar documents for easier navigation
Improve search relevance with semantic grouping
Enable topic-based content discovery
4. Data Quality & Labeling
Identify outliers and anomalies in datasets
Detect mislabeled data by finding odd cluster assignments
Find duplicate or near-duplicate content
Accelerate manual labeling by clustering first
5. Market Intelligence
Analyze competitor mentions and sentiment
Track brand perception across clusters
Identify market segments in customer data
Monitor emerging trends in social media
2. Why Use LLMs for Text Clustering?
Traditional Clustering Methods and Their Limitations
Before LLMs, text clustering relied on simpler, less effective representations:
TF-IDF + K-Means: The Classic Approach
# Traditional approachfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.cluster import KMeans# Convert text to numbers (loses meaning!)vectorizer = TfidfVectorizer(max_features=5000)tfidf_matrix = vectorizer.fit_transform(documents)# Must specify number of clusters upfrontkmeans = KMeans(n_clusters=10, random_state=42)clusters = kmeans.fit_predict(tfidf_matrix)
Critical limitations:
Treats “king” and “queen” as completely unrelated words
Misses that “car” and “automobile” mean the same thing
Can’t understand negation: “not good” vs “good” look similar
from sklearn.decomposition import LatentDirichletAllocation# Also requires specifying number of topicslda = LatentDirichletAllocation(n_components=10, random_state=42)topic_distributions = lda.fit_transform(tfidf_matrix)
Key problems:
Assumes bag-of-words (word order doesn’t matter)
No semantic understanding of relationships
Fixed number of topics must be specified
Poor performance on short texts (tweets, reviews)
Topics are just probability distributions over words
“NLP” ≈ “natural language processing” ≈ “text analysis”
3. Multilingual Magic
Same model handles multiple languages in unified semantic space:
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")docs = ["Machine learning is powerful", # English"El aprendizaje automático es poderoso", # Spanish"機械学習は強力です", # Japanese"机器学习很强大", # Chinese]# All embedded in the same semantic space!embeddings = model.encode(docs)# Similar concepts cluster together regardless of language
# Traditional TF-IDF treats these as 75% similar (3 of 4 words match)doc1 ="The movie was not good at all"doc2 ="The movie was very good overall"# LLM embeddings correctly identify opposite sentimentsfrom sentence_transformers import SentenceTransformermodel = SentenceTransformer("all-MiniLM-L6-v2")emb1 = model.encode([doc1])[0]emb2 = model.encode([doc2])[0]from sklearn.metrics.pairwise import cosine_similaritysimilarity = cosine_similarity([emb1], [emb2])[0][0]print(f"Similarity: {similarity:.3f}") # Low similarity (0.234)# LLM correctly understands "not good" ≠ "good"
3. The Text Clustering Pipeline: 3-Stage Approach
Text clustering with LLMs follows a systematic three-stage pipeline. Each stage is crucial for high-quality results.
Figure 1: Complete text clustering pipeline
Pipeline Overview
Let’s explore each stage in detail.
Stage 1: Document Embedding
Goal: Transform text documents into dense numerical vectors that preserve semantic meaning.
Why Embeddings?
Computers can’t process text directly. We need numbers that capture meaning:
# ❌ Bad: Simple encoding loses all meaning"I love this movie" → [1, 2, 3, 4]"I hate this movie" → [1, 5, 3, 4] # These look similar (3 of 4 numbers match) but mean opposite things!# ✅ Good: Embeddings capture semantic relationships"I love this movie" → [0.8, 0.2, 0.9, -0.1, ...] # Positive sentiment"I hate this movie" → [-0.7, 0.1, -0.8, 0.2, ...] # Negative sentiment# These are far apart in embedding space (correct!)
Choosing the Right Embedding Model
This is the most important decision for your clustering system!
Popular Embedding Models:
Model Name
Dimensions
Size
Speed
Quality
Best For
all-MiniLM-L6-v2
384
80MB
⚡⚡⚡ Fast
⭐⭐⭐⭐ Good
General purpose, prototyping
all-mpnet-base-v2
768
420MB
⚡⚡ Medium
⭐⭐⭐⭐⭐ Excellent
Production systems
stella-en-400M-v5
1024
1.6GB
⚡ Slow
⭐⭐⭐⭐⭐ Best
Maximum accuracy
e5-large-v2
1024
1.3GB
⚡ Slow
⭐⭐⭐⭐⭐ Best
Research
paraphrase-multilingual
768
970MB
⚡⚡ Medium
⭐⭐⭐⭐ Good
50+ languages
How to choose:
# For speed and efficiency (good starting point)from sentence_transformers import SentenceTransformerembedding_model = SentenceTransformer("all-MiniLM-L6-v2")# ✅ Fast: ~1000 docs/second# ✅ Small: 80MB download# ⚠️ Quality: Good but not best# For best quality (recommended for production)embedding_model = SentenceTransformer("all-mpnet-base-v2")# ✅ Quality: Excellent accuracy# ⚠️ Speed: ~400 docs/second# ⚠️ Size: 420MB# For domain-specific textembedding_model = SentenceTransformer("allenai/specter") # Scientific papersembedding_model = SentenceTransformer("finbert") # Financial documents
Generating Embeddings: Complete Implementation
from sentence_transformers import SentenceTransformerimport numpy as np# Step 1: Initialize modelprint("Loading embedding model...")embedding_model = SentenceTransformer("all-MiniLM-L6-v2")print(f"✓ Model loaded")print(f"✓ Embedding dimensions: {embedding_model.get_sentence_embedding_dimension()}")# Step 2: Prepare your documentsdocuments = ["Deep learning revolutionizes computer vision applications","Neural networks learn hierarchical feature representations","Natural language processing enables human-computer interaction","Reinforcement learning optimizes decision-making agents",# ... add thousands more documents]print(f"✓ Prepared {len(documents)} documents for embedding")# Step 3: Generate embeddings (batch processing for efficiency)print("Generating embeddings...")embeddings = embedding_model.encode( documents,batch_size=32, # Process 32 documents at onceshow_progress_bar=True, # Visual progress feedbackconvert_to_numpy=True, # Return as NumPy arraynormalize_embeddings=True# L2 normalize for faster cosine similarity)print(f"✓ Generated embeddings: {embeddings.shape}")# Output: (5000, 384) means 5000 documents, 384 dimensions each# Step 4: Quality check - verify semantic similarity worksfrom sklearn.metrics.pairwise import cosine_similaritydoc1 ="Machine learning models learn from data"doc2 ="ML algorithms are trained on datasets"doc3 ="I enjoy eating chocolate cake"emb1 = embedding_model.encode([doc1])[0]emb2 = embedding_model.encode([doc2])[0]emb3 = embedding_model.encode([doc3])[0]sim_related = cosine_similarity([emb1], [emb2])[0][0]sim_unrelated = cosine_similarity([emb1], [emb3])[0][0]print(f"\n✓ Quality check:")print(f" Similar docs similarity: {sim_related:.3f}") # Should be high (>0.7)print(f" Different docs similarity: {sim_unrelated:.3f}") # Should be low (<0.3)if sim_related >0.7and sim_unrelated <0.3:print(" ✅ Embeddings are working correctly!")else:print(" ⚠️ Warning: Embeddings may need adjustment")
Expected output:
Loading embedding model...✓ Model loaded✓ Embedding dimensions: 384✓ Prepared 5000 documents for embeddingGenerating embeddings...100%|██████████|157/157 [00:45<00:00, 3.47it/s]✓ Generated embeddings: (5000, 384)✓ Quality check: Similar docs similarity: 0.847 Different docs similarity: 0.156 ✅ Embeddings are working correctly!
Pro Tips for Embeddings
# Tip 1: Save embeddings to avoid recomputingnp.save("my_embeddings.npy", embeddings)# Later: embeddings = np.load("my_embeddings.npy")# Tip 2: Use GPU for faster processingembedding_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")# Tip 3: Handle very long documents (>512 tokens)long_embeddings = embedding_model.encode( long_documents,batch_size=16, # Reduce batch size for long docsshow_progress_bar=True)# Tip 4: Process in chunks for massive datasetsdefembed_large_dataset(docs, chunk_size=10000): all_embeddings = []for i inrange(0, len(docs), chunk_size): chunk = docs[i:i+chunk_size] chunk_emb = embedding_model.encode(chunk, show_progress_bar=True) all_embeddings.append(chunk_emb)return np.vstack(all_embeddings)
Stage 2: Dimensionality Reduction with UMAP
Goal: Reduce embedding dimensions while preserving cluster structure.
Why Reduce Dimensions?
The problem: Embeddings are high-dimensional (384-1024 dimensions)
📊 Impossible to visualize
🐌 Clustering algorithms slow on high dimensions
📏 Distance metrics less meaningful (“curse of dimensionality”)
💾 Memory intensive for large datasets
The solution: UMAP (Uniform Manifold Approximation and Projection)
UMAP reduces dimensions while preserving both local neighborhoods and global structure.
UMAP Implementation
from umap importUMAPimport numpy as np# Configure UMAPprint("Configuring UMAP for dimensionality reduction...")umap_model = UMAP(n_neighbors=15, # Balance local vs global structuren_components=5, # Reduce to 5 dimensions for clusteringmin_dist=0.0, # Allow tight clusters (0.0 = tightest)metric='cosine', # Best for normalized embeddingsrandom_state=42, # Reproducibilityverbose=True# Show progress)# Apply dimensionality reductionprint(f"Reducing {embeddings.shape} embeddings to 5 dimensions...")print("(This takes 5-15 minutes for large datasets)")reduced_embeddings = umap_model.fit_transform(embeddings)print(f"✓ Reduced to: {reduced_embeddings.shape}")# Output: (5000, 5) - 5000 documents, 5 dimensions# Verify dimensionality reduction qualityfrom sklearn.metrics import pairwise_distances# Sample 1000 points for speedsample_idx = np.random.choice(len(embeddings), 1000, replace=False)orig_distances = pairwise_distances(embeddings[sample_idx], metric='cosine')reduced_distances = pairwise_distances(reduced_embeddings[sample_idx], metric='euclidean')# Calculate correlation (should be high)correlation = np.corrcoef(orig_distances.flatten(), reduced_distances.flatten())[0,1]print(f"✓ Distance preservation: {correlation:.3f}")# Good: >0.7, Excellent: >0.8
Understanding UMAP Parameters
1. n_neighbors (default: 15)
Controls balance between local and global structure:
# Small values: Focus on local structure, more small clustersumap_local = UMAP(n_neighbors=5, n_components=5)# Use when: Want to find fine-grained clusters# Medium values: Balanced (recommended)umap_balanced = UMAP(n_neighbors=15, n_components=5) # ✅ Start here# Use when: General purpose clustering# Large values: Focus on global structure, fewer large clustersumap_global = UMAP(n_neighbors=50, n_components=5)# Use when: Want broad topic categories
2. n_components (default: 5)
Number of dimensions to reduce to:
# For clustering: 5-10 dimensionsumap_clustering = UMAP(n_components=5) # ✅ Recommended for clustering# For 2D visualizationumap_viz = UMAP(n_components=2) # For scatter plots only# For 3D visualizationumap_3d = UMAP(n_components=3) # For interactive 3D plots
3. min_dist (default: 0.0)
Controls cluster tightness:
# Tight clusters (recommended for clustering)umap_tight = UMAP(min_dist=0.0) # ✅ For clustering# Spread out (better for visualization)umap_spread = UMAP(min_dist=0.3) # For visual exploration
4. metric (default: ‘euclidean’)
Distance metric to use:
# For normalized embeddings (recommended)umap_cosine = UMAP(metric='cosine') # ✅ Best for text embeddings# For non-normalizedumap_euclidean = UMAP(metric='euclidean')# Other optionsumap_manhattan = UMAP(metric='manhattan')
UMAP vs PCA Comparison
from sklearn.decomposition importPCA# PCA: Linear, fast, but loses non-linear structurepca_model = PCA(n_components=5, random_state=42)pca_reduced = pca_model.fit_transform(embeddings)# ⚠️ Only captures linear relationships# ✅ Very fast (seconds vs minutes)# ⚠️ May lose important cluster structure# UMAP: Non-linear, slower, preserves structureumap_model = UMAP(n_components=5, random_state=42)umap_reduced = umap_model.fit_transform(embeddings)# ✅ Preserves non-linear cluster topology# ⚠️ Slower (minutes vs seconds)# ✅ Better for clustering quality# Recommendation: Use UMAP for production, PCA for quick prototyping
Stage 3: Clustering with HDBSCAN
HDBSCAN is the only algorithm that: 1. Discovers all valid topics automatically (no K specification) 2. Handles both dense and sparse clusters equally well 3. Explicitly identifies outliers (not forcing noise into good clusters) 4. Provides hierarchical structure (see topic relationships) 5. Has robust parameters (less tuning, more stable) This is why BERTopic, the leading topic modeling framework, chose HDBSCAN as its default clustering algorithm.
Goal: Group similar documents into clusters and identify outliers.
Why HDBSCAN?
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is perfect for text clustering because it:
✅ No K required – Automatically finds optimal number of clusters ✅ Finds outliers – Explicitly identifies documents that don’t fit (cluster -1) ✅ Varying densities – Can find both large and small clusters ✅ Hierarchical – Shows topic relationships ✅ Deterministic – Same data = same clusters (with fixed random_state)
HDBSCAN Implementation
from hdbscan importHDBSCANimport numpy as npimport pandas as pd# Configure HDBSCANprint("Configuring HDBSCAN for clustering...")hdbscan_model = HDBSCAN(min_cluster_size=15, # Minimum 15 documents per clustermin_samples=10, # Conservative outlier detectionmetric='euclidean', # Standard for reduced embeddingscluster_selection_method='eom', # Excess of Mass (recommended)prediction_data=True, # Enable soft clustering predictionscore_dist_n_jobs=-1# Use all CPU cores)# Fit and predict clustersprint(f"Clustering {len(reduced_embeddings)} documents...")print("(This takes 2-5 minutes)")clusters = hdbscan_model.fit_predict(reduced_embeddings)# Analyze resultsn_clusters =len(set(clusters)) - (1if-1in clusters else0)n_outliers =list(clusters).count(-1)outlier_pct = n_outliers /len(clusters) *100print(f"\n{'='*60}")print("CLUSTERING RESULTS")print(f"{'='*60}")print(f"✓ Total documents: {len(clusters):,}")print(f"✓ Clusters discovered: {n_clusters}")print(f"✓ Outliers identified: {n_outliers:,} ({outlier_pct:.1f}%)")# Cluster size distributioncluster_sizes = pd.Series(clusters[clusters !=-1]).value_counts().sort_values(ascending=False)print(f"\n✓ Cluster size statistics:")print(f" • Largest cluster: {cluster_sizes.max()} documents")print(f" • Smallest cluster: {cluster_sizes.min()} documents")print(f" • Average cluster size: {cluster_sizes.mean():.1f} documents")print(f" • Median cluster size: {cluster_sizes.median():.0f} documents")# Show size distributionprint(f"\n✓ Cluster size distribution:")bins = [(15,50), (50,100), (100,500), (500, float('inf'))]for min_size, max_size in bins: count =sum((cluster_sizes >= min_size) & (cluster_sizes < max_size))print(f" • {min_size}-{int(max_size) if max_size !=float('inf') else'500+'} docs: {count} clusters")
# Small clusters (fine-grained topics)hdbscan_fine = HDBSCAN(min_cluster_size=10)# Result: Many small, specific clusters# Use when: Want detailed topic granularity# Medium clusters (balanced - recommended)hdbscan_balanced = HDBSCAN(min_cluster_size=20) # ✅ Start here# Result: Moderate number of interpretable clusters# Use when: General purpose clustering# Large clusters (broad topics)hdbscan_broad = HDBSCAN(min_cluster_size=50)# Result: Few large, general clusters# Use when: Want high-level categories
2. min_samples (Controls outlier sensitivity)
How conservative to be about outliers:
# Conservative (fewer outliers, more inclusive)hdbscan_inclusive = HDBSCAN(min_cluster_size=15,min_samples=10# Higher = fewer outliers)# Result: ~5% outliers# Use when: Want to cluster most documents# Moderate (balanced)hdbscan_balanced = HDBSCAN(min_cluster_size=15,min_samples=5# ✅ Good default)# Result: ~10% outliers# Aggressive (more outliers, stricter)hdbscan_strict = HDBSCAN(min_cluster_size=15,min_samples=1# Lower = more outliers)# Result: ~20% outliers# Use when: Want only very coherent clusters
3. cluster_selection_method
How to select clusters from hierarchy:
# Excess of Mass (recommended, default)hdbscan_eom = HDBSCAN(cluster_selection_method='eom') # ✅# Selects most stable clusters across hierarchy# Result: Balanced cluster sizes# Leaf (more clusters)hdbscan_leaf = HDBSCAN(cluster_selection_method='leaf')# Selects all leaf clusters in hierarchy# Result: Many fine-grained clusters
Understanding Outliers (-1 Cluster)
The -1 cluster is special – it contains documents that don’t fit any cluster:
# Inspect outliersoutlier_mask = clusters ==-1outlier_docs = [documents[i] for i, is_outlier inenumerate(outlier_mask) if is_outlier]print(f"Outlier examples ({len(outlier_docs)} total):")for i, doc inenumerate(outlier_docs[:5]):print(f"{i+1}. {doc[:100]}...")
⚠️ Too short – Not enough content for meaningful embedding
🌐 Different language – If using single-language model
How to handle outliers:
# Strategy 1: Keep separate for manual reviewoutlier_docs = documents[clusters ==-1]# Review these manually - may contain insights# Strategy 2: Assign to nearest clusterfrom scipy.spatial.distance import cdistdefassign_outliers(embeddings, clusters): outlier_indices = np.where(clusters ==-1)[0] cluster_ids =set(clusters) - {-1}# Calculate cluster centroids centroids = {}for cid in cluster_ids: cluster_points = embeddings[clusters == cid] centroids[cid] = cluster_points.mean(axis=0)# Assign each outlier to nearest centroidfor idx in outlier_indices: distances = { cid: np.linalg.norm(embeddings[idx] - centroid)for cid, centroid in centroids.items() } nearest =min(distances, key=distances.get)# Only assign if within reasonable distanceif distances[nearest] <0.5: # Threshold clusters[idx] = nearestreturn clusters# Applyclusters_with_assigned = assign_outliers(reduced_embeddings, clusters.copy())# Strategy 3: Create "Miscellaneous" topic# Keep as -1 but give it a descriptive label later# Strategy 4: Adjust parameters to reduce outliershdbscan_less_outliers = HDBSCAN(min_cluster_size=15,min_samples=3# More lenient)
Complete 3-Stage Pipeline Example
"""Complete text clustering pipeline in one placeFrom raw documents to cluster assignments"""from sentence_transformers import SentenceTransformerfrom umap importUMAPfrom hdbscan importHDBSCANimport numpy as npdefcluster_documents(documents, save_path=None):""" Complete clustering pipeline Args: documents: List of text strings save_path: Optional path to save results Returns: clusters: Array of cluster assignments embeddings: Document embeddings reduced: Reduced embeddings """print(f"Starting clustering pipeline for {len(documents)} documents...\n")# STAGE 1: EMBEDDINGprint("[1/3] Generating embeddings...") embedding_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = embedding_model.encode( documents,batch_size=32,show_progress_bar=True,normalize_embeddings=True )print(f"✓ Embeddings: {embeddings.shape}\n")# STAGE 2: DIMENSIONALITY REDUCTIONprint("[2/3] Reducing dimensions with UMAP...") umap_model = UMAP(n_neighbors=15,n_components=5,min_dist=0.0,metric='cosine',random_state=42 ) reduced_embeddings = umap_model.fit_transform(embeddings)print(f"✓ Reduced: {reduced_embeddings.shape}\n")# STAGE 3: CLUSTERINGprint("[3/3] Clustering with HDBSCAN...") hdbscan_model = HDBSCAN(min_cluster_size=15,min_samples=10,metric='euclidean',cluster_selection_method='eom',prediction_data=True ) clusters = hdbscan_model.fit_predict(reduced_embeddings)# Results n_clusters =len(set(clusters)) - (1if-1in clusters else0) n_outliers =list(clusters).count(-1)print(f"✓ Clustering complete!\n")print(f"Results:")print(f" • Clusters: {n_clusters}")print(f" • Outliers: {n_outliers} ({n_outliers/len(clusters)*100:.1f}%)")# Save if requestedif save_path: np.save(f"{save_path}_embeddings.npy", embeddings) np.save(f"{save_path}_reduced.npy", reduced_embeddings) np.save(f"{save_path}_clusters.npy", clusters)print(f"\n✓ Saved results to {save_path}_*.npy")return clusters, embeddings, reduced_embeddings# Usagedocuments = [...] # Your documentsclusters, embeddings, reduced = cluster_documents( documents,save_path="my_clustering_results")
4. BERTopic Framework: Complete Guide
Now that we understand the 3-stage clustering pipeline, let’s explore BERTopic – the modular framework that extends clustering with powerful topic modeling capabilities.
BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters, allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
The key innovation: BERTopic takes our 3-stage clustering pipeline and adds three more components for topic extraction and refinement.
BERTopic’s 6-Component Architecture
Input: Documents ↓1. Embedding Model (SBERT) → Dense vectors ↓2. Dimensionality Reduction (UMAP) → 5D vectors ↓3. Clustering (HDBSCAN) → Cluster IDs ↓4. Tokenization (CountVectorizer) → Word frequencies per cluster ↓5. Weighting (c-TF-IDF) → Important words per cluster ↓6. Representation Model (Optional) → Refined topic labels ↓Output: Topics with keywords
Components 1-3: Same as our pipeline (Embedding → UMAP → HDBSCAN)
Components 4-6: NEW! These extract meaningful topic keywords
Instead of analyzing individual documents, BERTopic concatenates all documents in a cluster and treats each cluster as one “mega-document”:
from sklearn.feature_extraction.text import CountVectorizervectorizer_model = CountVectorizer(ngram_range=(1, 2), # Both single words and two-word phrasesstop_words="english", # Remove common words (the, is, and, etc.)min_df=5, # Ignore words appearing in < 5 documentsmax_df=0.7# Ignore words appearing in > 70% of documents)# Traditional: Each document analyzed separately# BERTopic: All documents in cluster 0 → one big document# Result: Cluster-level patterns emerge
Component 5: c-TF-IDF (The Secret Sauce!)
c-TF-IDF (class-based Term Frequency-Inverse Document Frequency) is what makes BERTopic topics so coherent.
Traditional TF-IDF:
# Finds important words in a DOCUMENT compared to corpusTF-IDF(word, document) = TF(word, doc) × log(N / df(word))
c-TF-IDF:
# Finds important words in a CLUSTER compared to other clustersc-TF-IDF(word, cluster) = TF(word, cluster) × log(n_clusters / cf(word))
from bertopic.vectorizers import ClassTfidfTransformerctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True# Further reduce weight of very common words)
from bertopic.representation import KeyBERTInspiredrepresentation_model = KeyBERTInspired()# Before (c-TF-IDF): ["neural", "network", "deep", "learning"]# After (KeyBERT): ["neural networks", "deep learning", "network architecture"]
Option B: Maximal Marginal Relevance – Balance relevance and diversity
from bertopic.representation import MaximalMarginalRelevancerepresentation_model = MaximalMarginalRelevance(diversity=0.3)# Ensures keywords are relevant AND different from each other# Bad: ["learning", "learner", "learned", "learns"] (too similar)# Good: ["learning", "optimization", "regularization", "validation"] (diverse)
Option C: LLM-based – Natural language labels
from bertopic.representation import OpenAIprompt ="""I have a topic with keywords: [KEYWORDS]Generate a 2-5 word topic label.Only return the label."""representation_model = OpenAI(model="gpt-4",prompt=prompt)# Keywords: ["neural", "deep", "learning", "network"]# LLM Label: "Deep Neural Network Training"
Installing BERTopic
# Basic installationpip install bertopic# With all optional dependenciespip install bertopic[all]# Or install components individuallypip install bertopic sentence-transformers umap-learn hdbscan scikit-learn
Basic BERTopic Implementation
Simplest possible usage (3 lines):
from bertopic import BERTopic# Initialize with defaultstopic_model = BERTopic(language="english", verbose=True)# Fit and get topicstopics, probabilities = topic_model.fit_transform(documents)# View resultsprint(topic_model.get_topic_info())
# Get all topic informationtopic_info = topic_model.get_topic_info()print(topic_info.head())# Get specific topic keywordstopic_0_keywords = topic_model.get_topic(0)print("\nTopic 0 keywords:")for word, score in topic_0_keywords[:10]:print(f" {word:20s}{score:.4f}")# Get documents in a topictopic_0_docs = [documents[i] for i, t inenumerate(topics) if t ==0]print(f"\nTopic 0 has {len(topic_0_docs)} documents")# Search for topicssimilar_topics, similarity = topic_model.find_topics("deep learning", top_n=5)print(f"\nTopics related to 'deep learning': {similar_topics}")# Get topic distribution for new documentnew_doc = ["Latest advances in transformer models"]new_topic, new_prob = topic_model.transform(new_doc)print(f"\nNew document assigned to topic: {new_topic[0]}")
5. Topic Modeling with LLMs
Topic modeling goes beyond clustering by assigning meaningful labels and descriptions to each cluster.
Figure 3: From keywords to topic labels using LLMs
What is Topic Modeling?
Topic modeling identifies abstract “topics” that occur in a collection of documents. While clustering groups documents, topic modeling names and describes those groups.
Clustering: “These 100 documents are similar” Topic Modeling: “These documents are about ‘Neural Machine Translation'”
Plus optional LLM-generated label: “Deep Neural Network Training”
Semantically meaningful
Generating Topic Labels with LLMs
The most powerful feature: using GPT-4 or Claude to create human-readable topic labels.
Method 1: BERTopic Built-in LLM Support
from bertopic.representation import OpenAIfrom bertopic import BERTopic# Configure LLM labelingprompt ="""I have a topic described by these keywords: [KEYWORDS]Based on these keywords, generate a concise, descriptive topic label (2-6 words).The label should capture the main theme.Only return the label, nothing else.Keywords: [KEYWORDS]Label:"""llm_model = OpenAI(model="gpt-4",chat=True,prompt=prompt,exponential_backoff=True)# Create BERTopic with LLM labelingtopic_model = BERTopic(representation_model=llm_model,verbose=True)topics, probs = topic_model.fit_transform(documents)# View generated labelstopic_info = topic_model.get_topic_info()print(topic_info[['Topic', 'Count', 'Name']])
Output:
Topic Count Name-1234-1_outlier_documents01247 Deep Neural Network Training1982 Natural Language Processing2756 Computer Vision and Object Detection3543 Reinforcement Learning Algorithms
Method 2: Post-hoc Labeling with Claude
import anthropicdefgenerate_topic_label_claude(keywords, sample_docs):"""Generate topic label using Claude""" client = anthropic.Anthropic(api_key="your-api-key") keywords_str =", ".join([word for word, _ in keywords[:10]]) docs_str ="\n".join([f"• {doc[:150]}"for doc in sample_docs[:3]]) prompt =f"""<documents>Keywords that characterize this topic: {keywords_str}Sample documents from this cluster:{docs_str}</documents>Based on the keywords and sample documents, generate a concise, descriptive label (2-6 words) that captures the main theme of this topic.Respond with ONLY the label.""" message = client.messages.create(model="claude-3-5-sonnet-20241022",max_tokens=50,temperature=0.3,messages=[{"role": "user", "content": prompt}] )return message.content[0].text.strip()# Generate labels for all topicsfor topic_id inrange(len(set(topics)) -1): keywords = topic_model.get_topic(topic_id) topic_docs = [documents[i] for i, t inenumerate(topics) if t == topic_id][:5] label = generate_topic_label_claude(keywords, topic_docs)print(f"Topic {topic_id}: {label}")
6. Real-World Implementation: ArXiv Dataset Example
Let’s implement a complete, production-ready clustering system using the ArXiv NLP papers dataset.
Q1: What’s the difference between text clustering and topic modeling?
A: Text clustering groups similar documents together based on semantic meaning, while topic modeling labels and describes those groups with keywords or phrases.
Think of it this way:
Clustering: “These 100 documents are similar” (grouping)
Topic modeling: “These documents are about ‘neural networks and deep learning'” (labeling)
In practice, topic modeling often follows clustering. BERTopic combines both: it clusters documents (Stage 1-3) then extracts topics (Stage 4-6).
Q2: Why use BERTopic instead of traditional LDA for topic modeling?
A: BERTopic offers several advantages:
Contextual understanding: Uses BERT embeddings that understand “bank” (river) vs “bank” (financial)
No K specification: Discovers optimal number of topics automatically
Topic coherence: Generally produces more interpretable topics
When to use LDA:
Very large datasets (millions of documents)
Extremely limited compute resources
Need probabilistic topic distributions
Academic research requiring traditional methods
Q3: What does the -1 cluster in HDBSCAN represent?
A: The -1 cluster represents outliers – documents that don’t fit well into any cluster. These are data points too far from dense regions to be assigned to a cluster.
Common outlier types:
🔍 Legitimate edge cases (rare, unique topics)
🗑️ Noise, spam, or low-quality text
📝 Multi-topic documents blending themes
⚠️ Very short documents lacking context
Handling strategies:
Keep separate: Review manually, may contain insights
Assign to nearest: Use distance to cluster centroids
Adjust parameters: Reduce min_samples to be less strict
Accept as normal: 5-10% outliers is typical and healthy
Q4: Which embedding model should I use for text clustering?
A: Choose based on your priorities:
Speed priority:
all-MiniLM-L6-v2 (384-dim, 80MB, ~1000 docs/sec)
Best for: Prototyping, large datasets, real-time systems
Quality priority:
all-mpnet-base-v2 (768-dim, 420MB, ~400 docs/sec)
Best for: Production systems, critical applications
# Cluster 0: ML papers# Contains words: learning (500×), neural (300×), model (250×)# Cluster 1: NLP papers # Contains words: language (400×), text (350×), nlp (200×)# c-TF-IDF identifies:# Cluster 0 distinctive words: "neural", "deep", "network"# Cluster 1 distinctive words: "language", "syntax", "semantic"# Regular TF-IDF would miss cluster-level patterns!
Why it matters: c-TF-IDF produces more coherent, interpretable topics because it considers the cluster context.
Q8: Is BERTopic suitable for short texts like tweets?
A: Yes! BERTopic handles short texts much better than traditional methods like LDA.
Why it works:
Embeddings capture semantics even from few words
Contextual understanding helps with abbreviations
Handles informal language, emojis, hashtags
Best practices for short texts:
# 1. Use smaller min_cluster_sizehdbscan_model = HDBSCAN(min_cluster_size=10, # Lower for short textsmin_samples=5)# 2. Keep bigramsvectorizer_model = CountVectorizer(ngram_range=(1, 2), # Unigrams and bigramsstop_words="english")# 3. Consider shorter keywordstopic_model = BERTopic(top_n_words=5, # Fewer keywords for short textshdbscan_model=hdbscan_model,vectorizer_model=vectorizer_model)
Minimum text length: Generally works well with 5-10 words. Below that, consider:
Combining related short texts
Using bigrams/trigrams
Specialized short-text embeddings
Q9: How do I handle imbalanced datasets where some topics dominate?
A: Several strategies:
# Strategy 1: Normalize cluster sizes with nr_topicstopic_model = BERTopic(nr_topics=50) # Force 50 equal topics# Strategy 2: Adjust min_cluster_size dynamically# Smaller clusters for minority topicshdbscan_model = HDBSCAN(min_cluster_size=15,cluster_selection_epsilon=0.5# Merge similar clusters)# Strategy 3: Oversample minority topicsfrom sklearn.utils import resample# Identify small clusterscluster_counts = pd.Series(topics).value_counts()small_clusters = cluster_counts[cluster_counts <100].index# Oversample documents from small clustersoversampled_docs = []oversampled_topics = []for cluster_id in small_clusters: cluster_docs = [documents[i] for i, t inenumerate(topics) if t == cluster_id] cluster_topics = [topics[i] for i, t inenumerate(topics) if t == cluster_id]# Oversample to 100 documents resampled = resample( cluster_docs,n_samples=100,replace=True ) oversampled_docs.extend(resampled)# Retrain with balanced datatopic_model.update_topics(oversampled_docs, oversampled_topics)
Q10: Can I update topics with new documents without retraining?
A: Yes! BERTopic supports incremental updates:
# Initial trainingtopic_model = BERTopic()topics, probs = topic_model.fit_transform(initial_documents)# New documents arrivenew_documents = ["Latest AI research paper", "Novel NLP technique", ...]# Option 1: Assign to existing topics (fast)new_topics, new_probs = topic_model.transform(new_documents)# Option 2: Update model with new documents (slower, more accurate)all_documents = initial_documents + new_documentsall_topics =list(topics) +list(new_topics)topic_model.update_topics(docs=all_documents,topics=all_topics,vectorizer_model=updated_vectorizer)# Topics now reflect both old and new documents
When to use each:
transform(): Daily/weekly updates, streaming data
update_topics(): Monthly/quarterly, significant new content
13. Conclusion
Text clustering with LLMs represents a paradigm shift from keyword-matching to semantic understanding. By combining powerful embedding models (SBERT), effective dimensionality reduction (UMAP), and density-based clustering (HDBSCAN), we can automatically discover meaningful patterns in unstructured text at scale.
Key Takeaways
✅ LLM embeddings capture semantics that traditional bag-of-words methods miss ✅ BERTopic provides a modular framework that’s both powerful and customizable ✅ Three-stage pipeline (embed → reduce → cluster) is the foundation ✅ c-TF-IDF extracts distinctive keywords by analyzing clusters, not documents ✅ Production deployment requires careful attention to scalability and monitoring ✅ No one-size-fits-all solution – tune parameters for your specific use case
When to Use Text Clustering
Perfect for:
Organizing large document collections (research papers, support tickets)
Questions? Drop a comment below – I respond within 24 hours Found value? Share this guide with your team
References
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205.
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
A common task in natural language processing (NLP) is text classification. Use cases of text classification include sentiment analysis, intent detection, entity extraction, and language detection. This article will delve into how to use LLMs for text classification. We will see representation models and generative models. Under the representation model, we will see how to use task-specific models and embedding models to classify the text. Under the generative models, we will see open source and closed source models. While both generative and representation models can be applied to classification, they take different approaches.
2. Text Classification with Representation Models
Task-specific models and embedding models are two types of representation models that can be used for text classification. To obtain task-specific models, representation models, like bidirectional encoder representations from transformers (BERT), are trained for a particular task, like sentiment analysis. On the other hand, general-purpose models such as embedding models can be applied to a range of tasks outside classification, such as semantic search.
As it can be seen in the below diagram, when used in the text classification use case, representation models are kept frozen (untrainable). As the task-specific models are specially trained for the given task, when the text is given as input, it can classify the given text as per the task at hand. But when we are using the embedding model, we need to generate embeddings for the texts in the training set. Then train a classifier on the train dataset that has embeddings and corresponding labels. Once the classifier is trained, it can be used for classification.
3. Model Selection
The factors we should look into for selecting the model for text classification:
How does it fit the use case?
What is its language capability?
What is the underlying architecture?
What is the size of the model?
How is the performance? etc.
Underlying Architecture
BERT is an encoder-only architecture and is a popular choice for creating task-specific models and embedding models, and falls into the category of representation models. Generative Pre-trained Transformer (GPT) is a decoder-only architecture that falls into the generative models category. Encoder-only models are normally small in size. Variations of BERT are RoBERTa, DistilBERT, DeBERTa, etc. For task-specific use cases such as sentiment analysis, Twitter-RoBERTa-base can be a good starting point. For embedding models sentence-transformers/all-mpnet-base-v2 can be a good starting point as this is a small but performant model.
4. Text Classification using Task-specific Models
This is pretty straight forward. Text is passed to the tokenizer that splits the text into tokens. These tokens are consumed by the task-specific model for predicting the class of the text.
This is fine if we could find the task-specific models for our use case. Otherwise, if we have to fine-tune the model ourselves, we would need to check if we have sufficient budget(time, cost) for it. Another option is to resort to using the general-purpose embedding models.
5. Text Classification using Embedding Models
We can generate features using an embedding model rather than directly using the task-specific representation model for classification. These features can be used for training a classifier such as logistic regression.
5.1 What if we do not have the labeled data?
We have the definition of the labels, but we do not have the labeled data, we can utilize what is called “zero-shot classification“. Zero-shot model predicts the labels of input text even if it was not trained on them. Following diagram depicts the concept.
We can use zero-shot classification using embeddings. We can describe our labels based on what they should represent. The following diagram describes the concept.
To assign the labels to the input text/document, we can calculate the cosine similarity with the label embeddings to check which label it is close to.
6. Text Classification with Generative Models
Generative models are trained for a wide variety of tasks, so it will not work with the text classification out of the box. To make the generative model understand our context, we need to use the concept of prompt engineering. The prompt needs to be cleverly written so that the model understands what it is expected to do, what the candidate labels, etc.
6.1 Using Text-to-Text Transfer Model (T5)
The following diagram summarizes the different categories of the models:
The following diagram depicts the training steps:
We need to prefix each document with the prompt “Is the following sentence positive or negative?“
6.2 ChatGPT for Classification
The following diagram describes the training procedure of ChatGPT:
The model is trained using human preference data to generate text that resembles human preference.
For text classification, following is the sample prompt:
prompt = """Predict whether the following document is a positive or negative movie review:
[DOCUMENT]
If if is positive return 1 and if it is negative return 0. Do not give any other answers. """
References
Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
When you send a prompt to ChatGPT, Claude, or any other LLM-powered application, what actually happens behind the scenes? The journey from your input to the model’s response involves a sophisticated inference engine—a critical piece of infrastructure that determines everything from response latency to deployment costs.
Understanding LLM inference engines isn’t just academic curiosity. If you’re building production AI systems, these engines directly impact your application’s performance, scalability, and economics. A well-optimized inference setup can mean the difference between a responsive user experience and one that feels sluggish, between manageable costs and budget overruns.
This article breaks down the complete picture: the architectural components that power inference, the optimization techniques that make it feasible, the tools available for deployment, and the practical considerations you need to navigate when moving from prototype to production.
What is LLM Inference?
LLM inference is the process of using a trained language model to generate predictions or outputs based on input data. Unlike training—which involves adjusting billions of parameters over massive datasets—inference uses those fixed parameters to produce responses.
Think of it this way: training teaches the model language patterns, world knowledge, and reasoning capabilities. Inference puts that learned knowledge to work, transforming your prompt into coherent text, code, or structured outputs.
Inference vs Training: Key Differences
The computational characteristics of inference differ fundamentally from training:
Training processes enormous batches of data in parallel, updates model weights iteratively, runs for days or weeks on clusters of GPUs or TPUs, and prioritizes throughput over latency. The goal is learning—accuracy improves with more compute time.
Inference handles individual or small batches of requests sequentially, uses frozen model weights, completes in milliseconds to seconds, and prioritizes latency and cost efficiency. The goal is production readiness—users expect immediate responses.
These different priorities drive completely different optimization strategies. Training benefits from larger batch sizes and longer computation times. Inference demands the opposite: minimal latency, efficient memory usage, and cost-effective scaling.
The Two Phases of LLM Inference
LLM inference actually consists of two distinct computational phases, each with different characteristics and bottlenecks.
Prefill Phase: Processing the Input
The prefill phase processes your entire input prompt in parallel. The model ingests all input tokens simultaneously, computing attention across the complete context.
This phase is compute-bound. The GPU cores work intensively to process the prompt through all transformer layers, calculating self-attention and feed-forward operations. For long prompts, prefill can dominate total inference time.
During prefill, the model generates the KV (key-value) cache—a crucial data structure containing computed attention states for all input tokens. This cache prevents redundant calculations in the next phase.
Decode Phase: Generating the Output
The decode phase generates output tokens one at a time, autoregressively. Each new token depends on all previously generated tokens, making parallelization impossible within a single sequence.
This phase is memory-bound. The GPU repeatedly loads the KV cache from memory for each token generation. As the cache grows with output length, memory bandwidth becomes the bottleneck rather than computational throughput.
The autoregressive nature creates another challenge: you can’t know how many tokens you’ll generate until the model decides to stop. This unpredictability complicates batching and resource allocation.
Why This Distinction Matters
Understanding these two phases shapes optimization strategy:
Prefill optimization focuses on parallel processing efficiency and maximizing GPU utilization
Decode optimization targets memory bandwidth, cache management, and reducing data movement
Batching strategies must balance prefill compute intensity with decode memory constraints
Hardware selection depends on whether your workload is prefill-heavy (long prompts) or decode-heavy (long outputs)
Most production deployments see mixed workloads. Chat applications might have shorter prompts (fast prefill) but longer responses (extended decode). Summarization tasks might invert this pattern. Your infrastructure needs to handle both effectively.
Core Architectural Components
An LLM inference engine comprises several interconnected systems working together to process requests efficiently. Let’s examine each component and its role.
Request Router and Load Balancer
The router sits at the entry point, receiving incoming inference requests and distributing them across available model instances. This isn’t simple round-robin distribution—intelligent routing considers:
Advanced routers implement request queueing, priority handling, and dynamic scaling triggers. When traffic spikes, the router coordinates with orchestration systems to spin up additional instances.
KV Cache Manager
The KV cache stores attention key-value pairs computed during inference. For a 70B parameter model processing a 2K token context, the KV cache can consume 40GB+ of GPU memory.
Efficient cache management becomes critical:
PagedAttention (used by vLLM) breaks the KV cache into fixed-size blocks, similar to virtual memory paging in operating systems. This eliminates fragmentation and enables efficient sharing across sequences.
Multi-query and grouped-query attention architectures reduce cache size by sharing key-value pairs across multiple attention heads, cutting memory requirements without significant quality loss.
Cache eviction policies determine which cached data to keep when memory pressure increases. Simple LRU (Least Recently Used) can be effective, but more sophisticated policies consider factors like prompt prefix overlap and request priority.
Batch Scheduler
The scheduler orchestrates batch formation and execution. Simple batching groups requests that arrive simultaneously, but this leaves gaps when requests trickle in asynchronously.
Continuous batching (also called iteration-level batching) dynamically adds new requests to ongoing batches between decoding steps. When a sequence completes, the scheduler immediately slots in waiting requests. This dramatically improves throughput and GPU utilization.
The scheduler must also handle:
Priority queueing for time-sensitive requests
Sequence length prediction to avoid out-of-memory errors
Fairness to prevent starvation of long-running requests
Preemption to pause low-priority work when urgent requests arrive
Memory Management System
Beyond the KV cache, the memory manager handles model weights, activation tensors, and intermediate computations.
Model weight loading strategies include:
Loading complete models into GPU memory (fastest, most memory-intensive)
CPU-GPU streaming for models larger than GPU memory
Tensor parallelism to split weights across multiple GPUs
Activation checkpointing trades computation for memory by recomputing intermediate activations rather than storing them. This allows larger batch sizes at the cost of additional forward passes.
Memory pooling pre-allocates memory blocks to avoid allocation overhead during inference. The pool manager tracks usage and handles fragmentation.
Tokenization Pipeline
Before the model sees any text, tokenizers convert strings into numerical token IDs. This seems straightforward but has performance implications.
Vocabulary size affects embedding layer computation. Larger vocabularies (100K+ tokens) support more languages and reduce sequence length but increase memory footprint.
Tokenization algorithms (BPE, WordPiece, SentencePiece) have different characteristics for handling rare words, numbers, and code. The choice impacts both model quality and tokenization overhead.
Detokenization converts generated token IDs back to text. For streaming responses, partial detokenization must handle incomplete UTF-8 sequences correctly.
Sampling and Generation Control
After the model produces logits (raw prediction scores) for the next token, the sampling module determines which token to select.
Greedy sampling always picks the highest-probability token. Simple and deterministic, but can produce repetitive outputs.
Temperature scaling flattens or sharpens the probability distribution. Higher temperatures increase randomness; lower temperatures make outputs more deterministic.
Top-k and top-p (nucleus) sampling constrain selection to high-probability tokens, balancing diversity with coherence.
Advanced techniques include:
Repetition penalties to discourage repeated phrases
Frequency and presence penalties for vocabulary diversity
Constrained decoding to ensure valid JSON, code syntax, or grammar
Beam search for exploring multiple generation paths
Output Streaming and Response Handling
Modern inference engines support streaming responses—sending tokens as they’re generated rather than waiting for completion.
Server-Sent Events (SSE) or WebSocket connections deliver tokens incrementally to clients. This improves perceived latency dramatically; users see responses appearing word-by-word rather than waiting seconds for complete output.
Buffering strategies determine when to send token chunks. Character-by-character streaming maximizes responsiveness but increases overhead. Word-level or phrase-level buffering balances latency and efficiency.
Error handling during streaming requires careful design. If generation fails midway, the system must gracefully notify clients and clean up resources.
Memory Optimization Techniques
Memory is the primary constraint in LLM inference. A 70B parameter model in FP16 precision requires 140GB just for weights. Add KV cache, activations, and batching, and you quickly exceed available GPU memory. These techniques make large models deployable.
Quantization
Quantization reduces numerical precision, trading accuracy for memory savings and faster computation.
Weight-only quantization compresses model parameters from FP16 (16 bits) to INT8 (8 bits) or even INT4 (4 bits). A 4-bit quantized 70B model fits in 35GB instead of 140GB. The model performs computations in higher precision internally but loads compressed weights.
Activation quantization also compresses intermediate tensors during inference. This is trickier—activations have different distributions than weights and require calibration datasets for optimal quantization parameters.
Quantization methods include:
Post-Training Quantization (PTQ) converts a trained model without additional training. GPTQ and AWQ are popular PTQ methods that minimize accuracy loss through careful weight rounding.
Quantization-Aware Training (QAT) incorporates quantization into the training process, allowing the model to adapt. This produces better quality but requires access to training infrastructure.
Dynamic quantization adjusts precision per-layer or per-operation based on runtime characteristics. Some layers tolerate aggressive quantization; others need higher precision.
Mixed precision uses different precisions for different model components. Attention might use FP16, feed-forward layers INT8, and embeddings INT4. This balances quality and efficiency.
Real-world impact: 4-bit quantization can reduce memory by 75% with only 1-2% accuracy degradation for many models. This makes deployment feasible on consumer GPUs or reduces cloud costs substantially.
Model Pruning
Pruning removes unnecessary model parameters, creating smaller models that maintain most of the original’s capabilities.
Unstructured pruning removes individual weights based on magnitude or importance. This maximizes compression but requires specialized sparse matrix operations to achieve speedups.
Structured pruning removes entire neurons, attention heads, or layers. Less aggressive compression but compatible with standard matrix operations, making it easier to deploy.
Knowledge distillation trains a smaller model to mimic a larger one’s outputs. The “student” model learns to approximate the “teacher’s” behavior in a compressed form. This isn’t pruning per se but achieves similar goals—smaller models with retained capabilities.
Layer dropping removes entire transformer layers. Surprisingly, models often tolerate losing 20-30% of layers with minimal quality degradation, especially when combined with fine-tuning.
KV Cache Optimization
The KV cache grows linearly with sequence length and batch size, quickly consuming available memory.
PagedAttention divides the KV cache into fixed-size pages (typically 16-64 tokens). Sequences share page tables, enabling efficient memory use and eliminating fragmentation. When a sequence completes, its pages return to the free pool immediately.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache size by sharing keys and values across multiple query heads. Instead of maintaining separate KV pairs for each attention head, MQA uses single shared pairs. GQA groups heads for a middle ground between MQA and full multi-head attention.
A 70B model with 64 attention heads might reduce KV cache size by 8x using GQA with 8 groups.
Prefix caching shares KV cache entries for common prompt prefixes. Many requests start with system prompts or similar context. Computing these once and reusing the cached KV pairs eliminates redundant prefill computation.
Prompt compression techniques reduce effective context length by selecting only the most relevant tokens or summarizing earlier context. This maintains semantic content while reducing cache requirements.
Flash Attention
Flash Attention restructures the attention computation to minimize memory reads/writes—the primary bottleneck in transformer inference.
Standard attention materializes the full attention matrix (sequence_length × sequence_length), which scales quadratically and requires massive memory bandwidth. Flash Attention never materializes the full matrix, instead computing attention in blocks and fusing operations.
Key innovations:
Block-sparse computation processes attention in tiles that fit in fast SRAM
Kernel fusion combines multiple operations (softmax, matmul) into single GPU kernels
Recomputation trades a small amount of compute to drastically reduce memory I/O
Flash Attention 2 and 3 iterate on the original, achieving 2-3x speedups for long sequences with no accuracy loss. For inference with long contexts (32K+ tokens), Flash Attention is nearly mandatory.
Speculative Decoding
Speculative decoding accelerates the inherently sequential decode phase through a clever trick: use a small, fast “draft” model to generate candidate tokens, then verify them with the full target model in parallel.
The draft model runs autoregressively, generating several tokens quickly. The target model then processes all draft tokens simultaneously, accepting or rejecting each. On acceptance, you gained multiple tokens for only one target model forward pass.
Acceptance rate determines effectiveness. If the draft model predicts well (70-90% acceptance), you achieve 2-3x speedup. Poor draft models provide no benefit.
Draft model options:
Smaller versions of the target model (e.g., 7B drafting for 70B)
Specialized models trained for fast speculation
Previous layers of the same model (early exit strategies)
This technique works because validation is parallelizable even though generation isn’t. You’re trading draft model computation for target model decode steps—a favorable trade when the draft model is much smaller.
Compute Optimization Techniques
Memory optimizations help models fit and run; compute optimizations make them run faster.
Kernel Fusion
Modern GPU code executes as kernels—independent programs that run on the GPU. Each kernel launch has overhead, and data must travel between GPU memory and compute cores for each operation.
Kernel fusion combines multiple operations into single kernels, reducing overhead and data movement.
Operator fusion examples:
Combining matrix multiplication and activation function (matmul + ReLU)
Custom CUDA kernels hand-written for specific operation sequences can achieve 2-5x speedups over sequential execution. Projects like FasterTransformer and Flash Attention demonstrate massive gains from careful kernel engineering.
Graph compilation frameworks like TorchScript, ONNX Runtime, and TensorRT automatically analyze computation graphs and generate fused kernels. These tools make optimization accessible without hand-coding CUDA.
Tensor Parallelism
Tensor parallelism splits individual operations across multiple GPUs. Instead of each GPU holding the complete model, each holds slices of weight matrices.
Layer-wise splitting: Each GPU computes a portion of each layer’s output. For a matrix multiplication A × B, split B column-wise across GPUs. Each GPU computes its slice, then results are gathered through inter-GPU communication.
Communication overhead is the challenge. All-reduce operations synchronize results across GPUs. High-bandwidth interconnects (NVLink, InfiniBand) are essential—PCIe bottlenecks destroy performance.
Megatron-LM from NVIDIA pioneered efficient tensor parallelism for transformers, carefully partitioning attention and feed-forward layers to minimize communication.
When to use: Tensor parallelism shines when single models exceed single-GPU memory or when you need to reduce per-GPU memory usage within a node. It’s common to combine tensor parallelism within nodes with pipeline parallelism across nodes.
Pipeline Parallelism
Pipeline parallelism assigns different model layers to different GPUs. GPU 1 handles layers 1-10, GPU 2 handles layers 11-20, and so on.
Micro-batching divides each input batch into micro-batches that flow through the pipeline. While GPU 2 processes micro-batch 1, GPU 1 works on micro-batch 2. This keeps all GPUs busy rather than waiting for sequential processing.
Bubble overhead occurs during pipeline fill and drain—periods when some GPUs are idle. Smaller micro-batches reduce bubbles but increase communication overhead.
GPipe and PipeDream are influential frameworks for pipeline parallelism, implementing different strategies for schedule optimization and weight updates.
Inference considerations: Pipeline parallelism works better for high-throughput batch inference than low-latency single requests. The pipeline needs sustained traffic to stay filled.
Continuous Batching
Traditional static batching waits until enough requests accumulate, processes them as a batch, then waits again. This leaves GPUs idle and increases latency for early-arriving requests.
Continuous batching operates at the iteration level. Between generating each token, the scheduler checks for new requests and dynamically expands or contracts the batch.
Orca (from Microsoft) pioneered this approach, showing 10x+ throughput improvements over static batching. Modern inference servers like vLLM and TGI implement continuous batching as standard.
Implementation challenges:
Tracking completion state for each sequence independently
Balancing fairness—ensuring long sequences don’t starve short ones
Practical impact: Continuous batching is perhaps the single most impactful optimization for multi-user inference systems. It transforms GPU utilization from 20-30% (static batching) to 70-80%+ while simultaneously reducing average latency.
Operator Optimization
Even within fused kernels, specific optimizations for individual operators matter.
Matrix multiplication (GEMM) dominates transformer computation. Highly optimized GEMM libraries (cuBLAS, cuDNN, CUTLASS) implement sophisticated tiling and register allocation strategies. Using the right GEMM configuration for your matrix shapes can yield 2x speedups.
Softmax optimization is critical for attention. Numerically stable softmax requires finding the maximum value across inputs before exponentiation. Clever implementations fuse max-finding with exponentiation and reduce memory bandwidth.
Layer normalization appears in every transformer layer. Fused implementations compute mean, variance, and normalization in single passes rather than three separate operations.
Embedding lookup can bottleneck models with large vocabularies. Optimized implementations use GPU shared memory effectively and handle irregular access patterns.
Model Compilation
Compilation frameworks analyze model computation graphs and generate optimized execution plans.
TensorRT from NVIDIA performs layer fusion, precision calibration, and kernel auto-tuning. It can achieve 5-10x speedups for inference compared to eager execution, especially for smaller models.
ONNX Runtime provides cross-platform optimization, converting models to ONNX format and applying hardware-specific optimizations. It’s particularly strong for deployment across diverse environments (cloud, edge, mobile).
TorchScript and TorchInductor compile PyTorch models, eliminating Python overhead and enabling graph-level optimizations while maintaining PyTorch’s ecosystem compatibility.
Compilation trade-offs: Compilation adds upfront overhead (seconds to minutes) and reduces flexibility. Dynamic control flow and dynamic shapes complicate optimization. Most production systems pre-compile models during deployment rather than on-the-fly.
Popular Inference Tools and Frameworks
The inference ecosystem offers numerous tools, each with different strengths. Choosing the right one depends on your requirements: throughput, latency, hardware support, and deployment environment.
vLLM
vLLM focuses on high-throughput serving through PagedAttention and continuous batching.
Key features:
Exceptional throughput for multi-user scenarios
Automatic management of KV cache through paging
Support for major open-source models out-of-box
OpenAI-compatible API
Tensor parallelism and pipeline parallelism
Best for: High-traffic production deployments where throughput matters more than absolute lowest latency. Excellent for API services handling hundreds of concurrent requests.
Limitations: Primarily targets NVIDIA GPUs. Setup complexity is moderate—requires understanding of parallelism strategies for multi-GPU deployments.
TensorRT-LLM
NVIDIA’s TensorRT-LLM provides heavily optimized inference for NVIDIA hardware, combining TensorRT compilation with LLM-specific optimizations.
Key features:
State-of-the-art performance on NVIDIA GPUs
Extensive quantization support (INT8, INT4, FP8)
Flash Attention and custom fused kernels
Multi-GPU and multi-node scaling
Production-grade C++ backend
Best for: Maximum performance on NVIDIA hardware. When you need the absolute fastest inference and are willing to invest in optimization.
Limitations: NVIDIA-only. Steeper learning curve than higher-level frameworks. Model conversion can be complex.
Text Generation Inference (TGI)
HuggingFace’s TGI balances ease of use with performance, integrating deeply with the HuggingFace ecosystem.
Key features:
Simple deployment with Docker
Extensive model support from HuggingFace Hub
Continuous batching
Streaming responses
Good observability and monitoring
Best for: Teams already using HuggingFace models who want quick deployment without extensive optimization work. Great for rapid prototyping to production.
Limitations: Performance sometimes lags specialized frameworks like vLLM or TensorRT-LLM for specific workloads.
llama.cpp
llama.cpp enables running LLMs on CPU with reasonable performance, plus support for Apple Silicon and other non-NVIDIA hardware.
Key features:
Pure C/C++ implementation with minimal dependencies
CPU inference with optimized BLAS operations
Metal (Apple), Vulkan, and OpenCL backends
Extreme quantization (2-bit, 3-bit)
Low memory footprint
Best for: Edge deployment, running on consumer hardware, Apple devices, or environments without GPUs. Development and testing on local machines.
Limitations: Slower than GPU inference for large models. Best suited for smaller models (7B-13B parameters) or extremely quantized larger models.
DeepSpeed-Inference
Microsoft’s DeepSpeed-Inference extends their training framework into inference territory with strong multi-GPU support.
Key features:
Kernel optimizations specifically for transformer architectures
Tensor and pipeline parallelism
ZeRO-style optimization for memory efficiency
Integration with DeepSpeed training
Best for: Teams already using DeepSpeed for training who want consistent infrastructure. Large-scale deployments requiring sophisticated parallelism.
Limitations: Complexity—DeepSpeed has many configuration options. Better suited for researchers and engineers comfortable with deep learning systems.
LM Studio and Ollama
These tools target local model running with user-friendly interfaces.
LM Studio provides a GUI for downloading, configuring, and running models locally. It’s built for end-users rather than developers, but useful for quick testing.
Ollama offers CLI-based local model management with Docker-like simplicity. You can ollama run llama2 and have a model running in seconds.
Best for: Individual developers wanting local models for development/testing. Prototyping before cloud deployment. Privacy-sensitive applications requiring local inference.
Limitations: Not designed for production multi-user serving. Performance optimization is limited compared to specialized frameworks.
BentoML and Ray Serve
These MLOps frameworks provide infrastructure for deploying ML models, including LLMs, as production services.
BentoML offers model packaging, versioning, and deployment with strong API integration. It handles the operational concerns: logging, monitoring, A/B testing.
Ray Serve excels at distributed serving, leveraging Ray’s distributed computing capabilities. It can coordinate complex multi-model pipelines.
Best for: Organizations needing complete MLOps workflows, not just inference. When you want infrastructure that handles multiple model types, not just LLMs.
Limitations: Additional abstraction layer adds complexity. For pure LLM serving, specialized tools might offer better performance.
Hardware Considerations
Hardware choices dramatically affect inference performance and economics. The right hardware depends on your workload characteristics and constraints.
GPU Selection
NVIDIA A100/H100 represent top-tier inference performance. 80GB memory handles large models, high-bandwidth memory (HBM) accelerates memory-bound decode, and Tensor Cores provide specialized acceleration. These are expensive but deliver maximum throughput.
NVIDIA L4/L40 offer better cost-performance for inference-specific workloads. Lower power consumption than A100/H100 makes them attractive for large-scale deployment where TCO matters.
NVIDIA T4 remains popular for moderate-scale inference. Older generation but widely available and cost-effective for smaller models or lower-traffic scenarios.
AMD MI250/MI300 provide alternatives with competitive performance and sometimes better memory bandwidth. The software ecosystem is maturing but still lags NVIDIA’s.
Considerations:
Memory capacity determines maximum model size (before sharding across multiple GPUs)
Memory bandwidth affects decode phase throughput
FP16/BF16 Tensor Core support accelerates computation
NVLink/interconnect bandwidth matters for multi-GPU setups
CPU Inference
Modern CPUs can handle inference for smaller models or lower-throughput scenarios.
AMD EPYC processors with many cores and AVX-512 support provide reasonable inference performance. The ONNX Runtime and OpenVINO optimize well for AMD CPUs.
Intel Xeon with AMX (Advanced Matrix Extensions) accelerates matrix operations. 4th gen Xeon (Sapphire Rapids) and beyond include specific AI acceleration.
Advantages:
Much lower cost than GPUs
Already available in existing infrastructure
No need for specialized GPU environments
Lower power consumption
Limitations:
10-50x slower than GPU inference for large models
Practical only for smaller models (<13B parameters) or batch processing where latency is relaxed
Quantization becomes essential (4-bit or 3-bit)
Edge and Mobile Devices
Running LLMs on edge devices opens new possibilities but requires aggressive optimization.
Apple Silicon (M1/M2/M3) provides impressive performance through unified memory architecture and Neural Engine acceleration. 16GB+ models can run 7B parameter models comfortably.
Qualcomm Snapdragon mobile processors increasingly include NPU (Neural Processing Unit) cores for on-device AI. Models must be tiny (1-3B parameters) and heavily quantized.
Edge TPUs and specialized accelerators like Google Coral offer efficient inference for specific models but require conversion and sometimes training with quantization-aware techniques.
Considerations:
Memory is severely constrained (4-16GB typical)
Power consumption critical for battery devices
Thermal limits prevent sustained high performance
Quantization to 4-bit or lower nearly mandatory
Specialized AI Accelerators
Google TPUs excel at high-throughput inference with efficient matrix operations. TPU v4 and v5 provide strong performance, especially for Google’s own models.
AWS Inferentia/Trainium chips optimize for inference workloads with lower cost than GPU equivalents. Tight integration with AWS makes deployment straightforward.
Graphcore IPUs offer unique architecture with massive SRAM and explicit graph compilation. Strong for certain workloads but require significant optimization effort.
Cerebras wafer-scale engines provide enormous computational capacity but are expensive and specialized for specific use cases.
Trade-offs: Specialized accelerators often provide better performance per dollar and per watt than GPUs, but software ecosystem maturity varies. You may need custom optimization work and sacrifice flexibility.
Key Metrics for Inference Performance
Measuring inference performance requires tracking several metrics that capture different aspects of system behavior.
Latency Metrics
Time to First Token (TTFT) measures how long until the model generates the first output token. This captures prefill time plus any queueing delay. Critical for user experience—users perceive systems with low TTFT as more responsive.
Time Per Output Token (TPOT) measures average decode speed. Multiply TPOT by expected output length to estimate total generation time. TPOT dominates total latency for longer outputs.
End-to-End Latency is the complete time from request arrival to final response. Includes TTFT, all decode iterations, and any post-processing.
P50, P95, P99 latencies show latency distribution. Median (P50) indicates typical performance; P95 and P99 reveal worst-case behavior. Production systems must optimize for tail latencies—P99 often matters more than average.
Throughput Metrics
Requests Per Second (RPS) measures system capacity—how many requests the system handles per second under load.
Tokens Per Second (TPS) counts total output tokens generated per second across all requests. This normalizes for variable output lengths.
GPU Utilization shows what percentage of GPU compute capacity is actively used. Healthy inference systems achieve 70-80%+ utilization through effective batching.
Batch Size indicates average number of requests processed simultaneously. Larger batches generally improve throughput but increase per-request latency.
Cost Metrics
Cost Per 1000 Tokens normalizes costs across different deployments and providers. Industry standard for pricing LLM API access.
Total Cost of Ownership (TCO) includes hardware depreciation, power, cooling, and operational overhead—not just compute costs.
GPU Utilization vs Cost reveals efficiency. Low-utilization GPUs waste money; optimizing batching and scheduling directly impacts economics.
Quality Metrics
Token Acceptance Rate (for speculative decoding) shows how often draft model predictions are accepted. Higher rates mean better speedup.
Quantization Accuracy measures quality degradation from quantization. Typically evaluated on benchmarks like MMLU, HellaSwag, or task-specific evaluations.
Cache Hit Rate (for prefix caching) indicates how often shared prompt prefixes avoid redundant computation.
Production Deployment Best Practices
Moving from prototype to production requires addressing concerns beyond inference speed.
Model Versioning and Management
Model Registry tracks model versions, quantization configurations, and associated metadata. MLflow, Weights & Biases, or custom solutions provide this foundation.
A/B Testing Infrastructure enables comparing model variants in production. Route a percentage of traffic to each variant and measure performance differences.
Rollback Capability allows quick reversion when new models underperform. Keep previous versions warm and ready to take traffic.
Model Validation before deployment should include:
Accuracy evaluation on held-out benchmarks
Latency profiling under representative load
Safety testing for harmful outputs
Edge case handling verification
Scaling and Auto-scaling
Horizontal Scaling adds more inference instances as load increases. Kubernetes or orchestration platforms automate this process.
Vertical Scaling provisions larger instances with more GPUs or memory. Less flexible but simpler than managing distributed state.
Autoscaling Metrics should incorporate:
Request queue depth (scale up when requests are waiting)
GPU utilization (scale up approaching 90%+ sustained utilization)
Latency P95/P99 (scale up when tail latencies degrade)
Cold Start Mitigation keeps minimum capacity running even during low traffic. Starting GPU instances and loading large models takes minutes—unacceptable for sudden traffic spikes.
Geographic Distribution deploys models in multiple regions to reduce latency for global users and provide fault tolerance.
Monitoring and Observability
System Metrics:
GPU/CPU utilization, memory usage, power consumption
Request rate, latency distributions, error rates
Network bandwidth usage for distributed setups
Business Metrics:
Cost per request/token
User satisfaction signals (early abandonment, retries)
Revenue attribution for different model versions
Alerting Thresholds:
P99 latency exceeding SLA
Error rate above baseline
GPU out-of-memory events
Throughput drop indicating issues
Distributed Tracing tracks requests across multiple services (router → inference → post-processing) to identify bottlenecks.
Error Handling and Reliability
Graceful Degradation: When primary models are unavailable or overloaded, fallback to faster (smaller) models or cached responses rather than failing completely.
Timeout Management: Set reasonable timeouts for generation. For interactive applications, 30-60 seconds is typical; batch jobs may allow much longer.
Retry Logic: Implement exponential backoff for transient failures. Distinguish retryable errors (temporary overload) from permanent failures (invalid input).
Circuit Breakers: Automatically stop sending requests to failing instances, giving them time to recover.
Health Checks: Lightweight endpoints that verify instances can serve requests. Load balancers use these to route traffic only to healthy instances.
Rate Limiting: Per-user rate limits prevent abuse and ensure fair resource allocation. Implement both request-level and token-level limits.
Output Filtering: Screen generated content for PII, malicious code, or policy-violating content before returning to users.
Model Protection: Prevent model extraction attacks through:
Output randomness (avoid deterministic greedy decoding for public APIs)
Query limits per user
Anomaly detection for suspicious access patterns
API Authentication: Secure API access with tokens, keys, or OAuth. Track usage per credential for billing and abuse detection.
Prompt and Request Optimization
Prompt Engineering for Efficiency:
Minimize token usage while maintaining clarity
Use structured prompts consistently for prefix caching benefits
Request shorter outputs when possible
Streaming Configuration: Enable streaming for long-running requests to improve perceived latency and allow early user feedback.
Temperature and Sampling Tuning: Lower temperature (0.2-0.5) for deterministic tasks reduces variance and can slightly improve throughput. Higher temperature (0.7-1.0) for creative tasks.
Context Management: For multi-turn conversations:
Summarize or truncate history to fit context windows
Implement sliding window approaches
Cache common conversation starters
Debugging and Profiling Inference Performance
When inference doesn’t meet performance targets, systematic debugging reveals bottlenecks.
Profiling Tools
NVIDIA Nsight Systems provides timeline views of CPU and GPU activity. It shows kernel launches, memory transfers, and identifies gaps where GPUs sit idle.
PyTorch Profiler instruments PyTorch code, reporting time spent in each operation. Particularly useful for identifying inefficient operators.
TensorBoard Profiling integrates with TensorFlow/PyTorch to visualize computation graphs and operation timing.
Custom Instrumentation: Add timing code around critical sections:
Memory Bandwidth Saturation: Decode phase loads KV cache repeatedly. Solutions include quantization (less data to load) or better hardware (HBM3 vs HBM2).
Small Batch Sizes: GPUs thrive on parallel work. Single-request inference may utilize <10% of GPU compute. Enable continuous batching to keep GPUs fed.
Python Overhead: Eager execution in Python adds significant overhead. Compile models with TorchScript, TensorRT, or ONNX Runtime.
Inefficient Data Loading: CPU-to-GPU transfers can bottleneck if not pipelined correctly. Use asynchronous transfers and pinned memory.
Poor KV Cache Management: Fragmentation or inefficient allocation causes OOM errors even when sufficient memory exists. Use PagedAttention or implement careful memory pooling.
Suboptimal Quantization: Naive quantization can lose significant accuracy. Use calibration datasets and advanced methods like GPTQ or AWQ.
Latency Analysis Workflow
Measure baseline: Profile end-to-end latency and identify prefill vs decode contribution
Isolate components: Time tokenization, prefill, each decode iteration, detokenization separately
Check batching: Verify batch sizes match expectations and continuous batching is working
Examine GPU utilization: Low utilization indicates feeding issues; high utilization with slow performance suggests compute bottlenecks
Profile memory: Check memory bandwidth usage and KV cache efficiency
Compare to theoretical peaks: Calculate theoretical maximum throughput given hardware specs; significant gaps indicate optimization opportunity
Regression Testing
Performance Benchmarks: Maintain suites of representative requests covering diverse prompt lengths and output requirements. Run regularly to catch regressions.
Continuous Integration: Automate performance testing in CI/CD pipelines. Block deployments that regress key metrics beyond thresholds.
Historical Tracking: Log performance metrics over time to identify trends and correlate changes with code/configuration updates.
Future Directions in Inference Optimization
Inference optimization continues evolving rapidly. Several trends promise significant improvements.
Mixture of Experts (MoE)
MoE models activate only subsets of parameters per input, achieving the capacity of large models with the computational cost of smaller ones.
Routing mechanisms direct each input to relevant experts. This reduces FLOPs but creates irregular memory access patterns that challenge efficient batching.
Recent models like Mixtral demonstrate that MoE can be practical for inference, though specialized infrastructure is required to handle expert routing efficiently.
Speculative Decoding Evolution
Self-speculative decoding uses early layers of the target model as the draft model, eliminating need for separate models.
Multi-token prediction models trained to predict multiple future tokens simultaneously enable more aggressive speculation.
Adaptive speculation adjusts draft length based on prediction confidence, maximizing accepted tokens while minimizing wasted computation.
Hardware-Software Co-design
Custom inference accelerators designed specifically for transformer architectures will likely proliferate, optimizing for memory bandwidth and specific operation patterns.
Sparse attention mechanisms in hardware could enable longer contexts efficiently by computing only relevant attention scores.
Processing-in-memory architectures reduce data movement by performing computations where data resides, directly addressing the memory bandwidth bottleneck.
Model Architecture Innovation
Linear attention alternatives replace quadratic self-attention with linear-complexity mechanisms, enabling much longer contexts without proportional computational increases.
State space models (SSMs) like Mamba offer constant memory and compute per token regardless of context length, though with different capability trade-offs than transformers.
Hybrid architectures might combine transformers for reasoning with more efficient mechanisms for context processing.
Learned Optimization
Neural compilers use machine learning to generate optimized kernels, potentially surpassing hand-tuned implementations.
Learned scheduling applies reinforcement learning to batch scheduling and resource allocation decisions.
Adaptive quantization adjusts precision dynamically based on input characteristics and quality requirements.
Practical Guidelines for Getting Started
If you’re building an inference system, here’s how to approach it systematically.
Start with Existing Tools
Don’t build from scratch. Use vLLM, TGI, or similar frameworks that handle complexity for you. Focus on your application logic, not infrastructure.
Initial choices:
vLLM for multi-user serving with good defaults
TGI if you’re deeply invested in HuggingFace ecosystem
Ollama for local development and testing
Profile Before Optimizing
Measure actual performance before applying optimizations. Premature optimization wastes time. Profile to find real bottlenecks.
Enable continuous batching if not already active—often the single biggest throughput improvement
Apply quantization (4-bit or 8-bit) if memory-constrained
Implement prefix caching for common prompt patterns
Consider speculative decoding if latency-critical and you have compute budget
Explore advanced techniques like Flash Attention 3 or custom kernels only if needed
Establish Monitoring Early
Set up observability from the start. You can’t optimize what you don’t measure.
Essential metrics:
Request latency (P50, P95, P99)
Throughput (requests/sec, tokens/sec)
Error rates and types
Resource utilization (GPU, memory, network)
Cost tracking
Plan for Scale
Even if starting small, design for growth:
Use load balancers that support adding instances
Implement health checks for automated management
Design APIs to support versioning
Build monitoring to detect scaling needs early
Iterate Based on Data
Make changes incrementally and measure impact. A/B test significant modifications. Trust data over intuition.
Conclusion
LLM inference engines represent a complex intersection of algorithms, systems engineering, and hardware optimization. The architecture must balance conflicting demands: low latency, high throughput, cost efficiency, and quality maintenance.
Understanding the two-phase inference process (prefill and decode), recognizing that they have different bottlenecks, shapes effective optimization strategy. Memory management—particularly KV cache optimization—determines what models you can deploy and how many requests you can serve. Compute optimizations like continuous batching and kernel fusion transform theoretical hardware capability into realized throughput.
The tooling landscape provides solid foundations. vLLM, TensorRT-LLM, TGI, and others encapsulate years of optimization work, making efficient inference accessible without rebuilding everything from scratch. Choose tools matching your requirements: throughput or latency, cloud or edge, flexibility or maximum performance.
Production deployment extends beyond raw speed. Monitoring, scaling, error handling, and security separate proofs-of-concept from reliable systems. Measure continuously, optimize based on evidence, and design for inevitable growth and change.
As models grow larger and applications more demanding, inference efficiency becomes increasingly critical. The techniques covered here—from quantization to speculative decoding to specialized hardware—will continue evolving. Stay current with new developments, but master fundamentals first. A well-architected inference system built on solid principles will adapt as the technology advances.
The economics of AI applications depend directly on inference efficiency. Understanding these systems deeply gives you the capability to build responsive, cost-effective, scalable AI products. Start with existing tools, measure relentlessly, optimize systematically, and iterate based on real-world performance data.