← Back to Home

LLM Text Clustering and Topic Modeling: HDBSCAN and BERTopic Tutorial

ai_mlgenainatural-language-processing-nlp
#bertopic#cluster-analysis#document-clustering#hdbscan#llm-clustering#nlp-clustering#semantic-clustering#sentence-transformers#text-clustering#text-mining

Master semantic document clustering using embeddings, UMAP, HDBSCAN, and BERTopic

Quick Start: 5-Minute Text Clustering

Want to see text clustering in action immediately? Here's a minimal working example:

code
# Install: pip install bertopicfrom bertopic import BERTopic# Your documentsdocs = [    "Machine learning is transforming how we process data",    "Deep learning uses neural networks with multiple layers",    "Natural language processing enables computers to understand text",    "Computer vision allows machines to interpret visual information",    "Reinforcement learning trains agents through rewards and penalties",    "Neural networks are inspired by biological brain structures",    "Text mining extracts valuable insights from unstructured data",    "Image recognition has applications in healthcare and security"]# Cluster in 3 linestopic_model = BERTopic(min_topic_size=2, verbose=True)topics, probs = topic_model.fit_transform(docs)# View resultsprint(topic_model.get_topic_info())

Output:

code
Topic  Count  Name-1     0      -1_outliers 0     3      0_neural_networks_deep_learning 1     2      1_language_processing_text 2     2      2_visual_computer_vision

That's it! Now let's dive into how this works and how to customize it for production use.

1. What is Text Clustering with LLMs?

Have you ever needed to organize thousands of customer feedback responses? Or group millions of research papers by topic? Or identify emerging themes in social media conversations?

Text clustering is the unsupervised machine learning technique of grouping similar documents based on their semantic content, meaning, and relationships. Unlike classification, which requires labeled data, clustering automatically discovers patterns in unstructured text.

The Problem We're Solving

With recent advancements in Large Language Models (LLMs), we can now obtain extremely precise contextual and semantic representations of text. This has revolutionized text clustering's effectiveness compared to traditional methods.

Traditional approach limitations:

  • ❌ Bag-of-words models ignore context ("bank" always means the same thing)

  • ❌ TF-IDF misses semantic relationships ("car" and "automobile" treated as different)

  • ❌ Fixed vocabulary can't handle synonyms or related concepts

  • ❌ No understanding of actual word meanings

LLM-based clustering advantages:

  • ✅ Contextual embeddings capture meaning ("river bank" vs "financial bank")

  • ✅ Semantic similarity across different words ("ML" ≈ "machine learning")

  • ✅ Handles synonyms and related concepts automatically

  • ✅ Works across multiple languages with same model

Key Use Cases for Text Clustering

1. Customer Feedback Analysis

  • Group support tickets by issue type automatically

  • Identify recurring problems in product reviews

  • Discover emerging customer concerns before they become critical

  • Prioritize feature requests based on cluster size

2. Research Organization

  • Cluster academic papers by research topic

  • Discover emerging research trends in your field

  • Find related work for literature reviews

  • Organize large document repositories semantically

3. Content Management

  • Automatically categorize blog posts and articles

  • Group similar documents for easier navigation

  • Improve search relevance with semantic grouping

  • Enable topic-based content discovery

4. Data Quality & Labeling

  • Identify outliers and anomalies in datasets

  • Detect mislabeled data by finding odd cluster assignments

  • Find duplicate or near-duplicate content

  • Accelerate manual labeling by clustering first

5. Market Intelligence

  • Analyze competitor mentions and sentiment

  • Track brand perception across clusters

  • Identify market segments in customer data

  • Monitor emerging trends in social media

2. Why Use LLMs for Text Clustering?

Traditional Clustering Methods and Their Limitations

Before LLMs, text clustering relied on simpler, less effective representations:

TF-IDF + K-Means: The Classic Approach

code
# Traditional approachfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.cluster import KMeans# Convert text to numbers (loses meaning!)vectorizer = TfidfVectorizer(max_features=5000)tfidf_matrix = vectorizer.fit_transform(documents)# Must specify number of clusters upfrontkmeans = KMeans(n_clusters=10, random_state=42)clusters = kmeans.fit_predict(tfidf_matrix)

Critical limitations:

  • Treats "king" and "queen" as completely unrelated words

  • Misses that "car" and "automobile" mean the same thing

  • Can't understand negation: "not good" vs "good" look similar

  • Requires knowing number of clusters in advance

  • No contextual understanding whatsoever

LDA (Latent Dirichlet Allocation): Probabilistic Topic Modeling

code
from sklearn.decomposition import LatentDirichletAllocation# Also requires specifying number of topicslda = LatentDirichletAllocation(n_components=10, random_state=42)topic_distributions = lda.fit_transform(tfidf_matrix)

Key problems:

  • Assumes bag-of-words (word order doesn't matter)

  • No semantic understanding of relationships

  • Fixed number of topics must be specified

  • Poor performance on short texts (tweets, reviews)

  • Topics are just probability distributions over words

LSA (Latent Semantic Analysis): Dimensionality Reduction

code
from sklearn.decomposition import TruncatedSVDsvd = TruncatedSVD(n_components=100, random_state=42)lsa_features = svd.fit_transform(tfidf_matrix)

Limitations:

  • Only captures linear relationships between words

  • Loses word order information completely

  • Requires careful dimensionality tuning

  • Still no contextual awareness

How LLMs Transform Text Clustering

Modern LLM embeddings bring contextual understanding to clustering:

1. Contextual Understanding

Traditional (same embedding everywhere):

code
"bank" [0.2, 0.5, 0.1, 0.8, ...]  # Always identical

LLM-based (context-aware):

code
"river bank" [0.8, 0.1, 0.3, 0.2, ...]  # Geographic context"bank account" [0.1, 0.9, 0.2, 0.7, ...]  # Financial context

The same word gets different embeddings based on context!

2. Semantic Similarity

LLMs understand that these are similar concepts:

  • "automobile" ≈ "car" ≈ "vehicle" ≈ "auto"

  • "happy" ≈ "joyful" ≈ "pleased" ≈ "delighted"

  • "ML" ≈ "machine learning" ≈ "artificial intelligence"

  • "NLP" ≈ "natural language processing" ≈ "text analysis"

3. Multilingual Magic

Same model handles multiple languages in unified semantic space:

code
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")docs = [    "Machine learning is powerful",           # English    "El aprendizaje automático es poderoso", # Spanish    "機械学習は強力です",                      # Japanese    "机器学习很强大",                          # Chinese]# All embedded in the same semantic space!embeddings = model.encode(docs)# Similar concepts cluster together regardless of language

4. Handles Real-World Text Variations

LLMs naturally understand:

  • Synonyms: "start" = "begin" = "commence" = "initiate"

  • Acronyms: "ML" = "Machine Learning"

  • Misspellings: "recieve" ≈ "receive" (close embeddings)

  • Abbreviations: "AI" = "Artificial Intelligence"

  • Slang: "LOL" = "laughing" = "funny"

Comparison: Traditional vs LLM-Based Methods

Feature TF-IDF + K-Means LDA LLM + BERTopic
Context awareness ❌ None ❌ None ✅ Full contextual
Semantic understanding ❌ Keyword only ⚠️ Limited ✅ Deep semantic
Handles synonyms ❌ No ❌ No ✅ Yes
Multilingual ❌ Separate models ❌ Separate models ✅ Single model
# clusters needed ⚠️ Must specify K ⚠️ Must specify K ✅ Auto-discovers
Outlier detection ❌ Forces all docs ❌ Poor ✅ Explicit -1 cluster
Short text (tweets) ⚠️ Moderate ❌ Poor ✅ Excellent
Topic coherence ⚠️ Moderate ⚠️ Moderate ✅ High
Interpretability ✅ Clear keywords ✅ Probabilities ✅ Keywords + context
Speed ✅ Very fast ✅ Fast ⚠️ Slower
Memory usage ✅ Low ✅ Low ⚠️ Higher
Training needed ✅ None ✅ None ✅ None (pre-trained)
Best for Simple, clean text Academic research Production systems

Real-World Example: Why Context Matters

Let's see the difference in action:

code
# Traditional TF-IDF treats these as 75% similar (3 of 4 words match)doc1 = "The movie was not good at all"doc2 = "The movie was very good overall"# LLM embeddings correctly identify opposite sentimentsfrom sentence_transformers import SentenceTransformermodel = SentenceTransformer("all-MiniLM-L6-v2")emb1 = model.encode([doc1])[0]emb2 = model.encode([doc2])[0]from sklearn.metrics.pairwise import cosine_similaritysimilarity = cosine_similarity([emb1], [emb2])[0][0]print(f"Similarity: {similarity:.3f}")  # Low similarity (0.234)# LLM correctly understands "not good" ≠ "good"

3. The Text Clustering Pipeline: 3-Stage Approach

Text clustering with LLMs follows a systematic three-stage pipeline. Each stage is crucial for high-quality results.

Complete three-stage text clustering pipeline - Embedding → Dimensionality Reduction → Clustering
Complete three-stage text clustering pipeline - Embedding → Dimensionality Reduction → Clustering

Pipeline Overview

Let's explore each stage in detail.

Stage 1: Document Embedding

Goal: Transform text documents into dense numerical vectors that preserve semantic meaning.

Why Embeddings?

Computers can't process text directly. We need numbers that capture meaning:

code
# ❌ Bad: Simple encoding loses all meaning"I love this movie" [1, 2, 3, 4]"I hate this movie" [1, 5, 3, 4]  # These look similar (3 of 4 numbers match) but mean opposite things!# ✅ Good: Embeddings capture semantic relationships"I love this movie" [0.8, 0.2, 0.9, -0.1, ...]   # Positive sentiment"I hate this movie" [-0.7, 0.1, -0.8, 0.2, ...]  # Negative sentiment# These are far apart in embedding space (correct!)

Choosing the Right Embedding Model

This is the most important decision for your clustering system!

Popular Embedding Models:

Model Name Dimensions Size Speed Quality Best For
all-MiniLM-L6-v2 384 80MB ⚡⚡⚡ Fast ⭐⭐⭐⭐ Good General purpose, prototyping
all-mpnet-base-v2 768 420MB ⚡⚡ Medium ⭐⭐⭐⭐⭐ Excellent Production systems
stella-en-400M-v5 1024 1.6GB ⚡ Slow ⭐⭐⭐⭐⭐ Best Maximum accuracy
e5-large-v2 1024 1.3GB ⚡ Slow ⭐⭐⭐⭐⭐ Best Research
paraphrase-multilingual 768 970MB ⚡⚡ Medium ⭐⭐⭐⭐ Good 50+ languages

How to choose:

code
# For speed and efficiency (good starting point)from sentence_transformers import SentenceTransformerembedding_model = SentenceTransformer("all-MiniLM-L6-v2")# ✅ Fast: ~1000 docs/second# ✅ Small: 80MB download# ⚠️ Quality: Good but not best# For best quality (recommended for production)embedding_model = SentenceTransformer("all-mpnet-base-v2")# ✅ Quality: Excellent accuracy# ⚠️ Speed: ~400 docs/second# ⚠️ Size: 420MB# For domain-specific textembedding_model = SentenceTransformer("allenai/specter")  # Scientific papersembedding_model = SentenceTransformer("finbert")          # Financial documents

Generating Embeddings: Complete Implementation

code
from sentence_transformers import SentenceTransformerimport numpy as np# Step 1: Initialize modelprint("Loading embedding model...")embedding_model = SentenceTransformer("all-MiniLM-L6-v2")print(f"✓ Model loaded")print(f"✓ Embedding dimensions: {embedding_model.get_sentence_embedding_dimension()}")# Step 2: Prepare your documentsdocuments = [    "Deep learning revolutionizes computer vision applications",    "Neural networks learn hierarchical feature representations",    "Natural language processing enables human-computer interaction",    "Reinforcement learning optimizes decision-making agents",    # ... add thousands more documents]print(f"✓ Prepared {len(documents)} documents for embedding")# Step 3: Generate embeddings (batch processing for efficiency)print("Generating embeddings...")embeddings = embedding_model.encode(    documents,    batch_size=32,              # Process 32 documents at once    show_progress_bar=True,     # Visual progress feedback    convert_to_numpy=True,      # Return as NumPy array    normalize_embeddings=True   # L2 normalize for faster cosine similarity)print(f"✓ Generated embeddings: {embeddings.shape}")# Output: (5000, 384) means 5000 documents, 384 dimensions each# Step 4: Quality check - verify semantic similarity worksfrom sklearn.metrics.pairwise import cosine_similaritydoc1 = "Machine learning models learn from data"doc2 = "ML algorithms are trained on datasets"doc3 = "I enjoy eating chocolate cake"emb1 = embedding_model.encode([doc1])[0]emb2 = embedding_model.encode([doc2])[0]emb3 = embedding_model.encode([doc3])[0]sim_related = cosine_similarity([emb1], [emb2])[0][0]sim_unrelated = cosine_similarity([emb1], [emb3])[0][0]print(f"\n✓ Quality check:")print(f"  Similar docs similarity: {sim_related:.3f}")     # Should be high (>0.7)print(f"  Different docs similarity: {sim_unrelated:.3f}") # Should be low (<0.3)if sim_related > 0.7 and sim_unrelated < 0.3:    print("  ✅ Embeddings are working correctly!")else:    print("  ⚠️ Warning: Embeddings may need adjustment")

Expected output:

code
Loading embedding model... Model loaded Embedding dimensions: 384 Prepared 5000 documents for embeddingGenerating embeddings...100%|██████████| 157/157 [00:45<00:00,  3.47it/s] Generated embeddings: (5000, 384) Quality check:  Similar docs similarity: 0.847  Different docs similarity: 0.156 Embeddings are working correctly!

Pro Tips for Embeddings

code
# Tip 1: Save embeddings to avoid recomputingnp.save("my_embeddings.npy", embeddings)# Later: embeddings = np.load("my_embeddings.npy")# Tip 2: Use GPU for faster processingembedding_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")# Tip 3: Handle very long documents (>512 tokens)long_embeddings = embedding_model.encode(    long_documents,    batch_size=16,  # Reduce batch size for long docs    show_progress_bar=True)# Tip 4: Process in chunks for massive datasetsdef embed_large_dataset(docs, chunk_size=10000):    all_embeddings = []    for i in range(0, len(docs), chunk_size):        chunk = docs[i:i+chunk_size]        chunk_emb = embedding_model.encode(chunk, show_progress_bar=True)        all_embeddings.append(chunk_emb)    return np.vstack(all_embeddings)

Stage 2: Dimensionality Reduction with UMAP

Goal: Reduce embedding dimensions while preserving cluster structure.

Why Reduce Dimensions?

The problem: Embeddings are high-dimensional (384-1024 dimensions)

  • 📊 Impossible to visualize

  • 🐌 Clustering algorithms slow on high dimensions

  • 📏 Distance metrics less meaningful ("curse of dimensionality")

  • 💾 Memory intensive for large datasets

The solution: UMAP (Uniform Manifold Approximation and Projection)

UMAP reduces dimensions while preserving both local neighborhoods and global structure.

UMAP Implementation

code
from umap import UMAPimport numpy as np# Configure UMAPprint("Configuring UMAP for dimensionality reduction...")umap_model = UMAP(    n_neighbors=15,      # Balance local vs global structure    n_components=5,      # Reduce to 5 dimensions for clustering    min_dist=0.0,       # Allow tight clusters (0.0 = tightest)    metric='cosine',    # Best for normalized embeddings    random_state=42,    # Reproducibility    verbose=True        # Show progress)# Apply dimensionality reductionprint(f"Reducing {embeddings.shape} embeddings to 5 dimensions...")print("(This takes 5-15 minutes for large datasets)")reduced_embeddings = umap_model.fit_transform(embeddings)print(f"✓ Reduced to: {reduced_embeddings.shape}")# Output: (5000, 5) - 5000 documents, 5 dimensions# Verify dimensionality reduction qualityfrom sklearn.metrics import pairwise_distances# Sample 1000 points for speedsample_idx = np.random.choice(len(embeddings), 1000, replace=False)orig_distances = pairwise_distances(embeddings[sample_idx], metric='cosine')reduced_distances = pairwise_distances(reduced_embeddings[sample_idx], metric='euclidean')# Calculate correlation (should be high)correlation = np.corrcoef(orig_distances.flatten(), reduced_distances.flatten())[0,1]print(f"✓ Distance preservation: {correlation:.3f}")# Good: >0.7, Excellent: >0.8

Understanding UMAP Parameters

1. n_neighbors (default: 15)

Controls balance between local and global structure:

code
# Small values: Focus on local structure, more small clustersumap_local = UMAP(n_neighbors=5, n_components=5)# Use when: Want to find fine-grained clusters# Medium values: Balanced (recommended)umap_balanced = UMAP(n_neighbors=15, n_components=5)  # ✅ Start here# Use when: General purpose clustering# Large values: Focus on global structure, fewer large clustersumap_global = UMAP(n_neighbors=50, n_components=5)# Use when: Want broad topic categories

2. n_components (default: 5)

Number of dimensions to reduce to:

code
# For clustering: 5-10 dimensionsumap_clustering = UMAP(n_components=5)   # ✅ Recommended for clustering# For 2D visualizationumap_viz = UMAP(n_components=2)          # For scatter plots only# For 3D visualizationumap_3d = UMAP(n_components=3)           # For interactive 3D plots

3. min_dist (default: 0.0)

Controls cluster tightness:

code
# Tight clusters (recommended for clustering)umap_tight = UMAP(min_dist=0.0)          # ✅ For clustering# Spread out (better for visualization)umap_spread = UMAP(min_dist=0.3)         # For visual exploration

4. metric (default: 'euclidean')

Distance metric to use:

code
# For normalized embeddings (recommended)umap_cosine = UMAP(metric='cosine')      # ✅ Best for text embeddings# For non-normalizedumap_euclidean = UMAP(metric='euclidean')# Other optionsumap_manhattan = UMAP(metric='manhattan')

UMAP vs PCA Comparison

code
from sklearn.decomposition import PCA# PCA: Linear, fast, but loses non-linear structurepca_model = PCA(n_components=5, random_state=42)pca_reduced = pca_model.fit_transform(embeddings)# ⚠️ Only captures linear relationships# ✅ Very fast (seconds vs minutes)# ⚠️ May lose important cluster structure# UMAP: Non-linear, slower, preserves structureumap_model = UMAP(n_components=5, random_state=42)umap_reduced = umap_model.fit_transform(embeddings)# ✅ Preserves non-linear cluster topology# ⚠️ Slower (minutes vs seconds)# ✅ Better for clustering quality# Recommendation: Use UMAP for production, PCA for quick prototyping

Stage 3: Clustering with HDBSCAN

HDBSCAN is the only algorithm that:

  1. Discovers all valid topics automatically (no K specification)
  2. Handles both dense and sparse clusters equally well
  3. Explicitly identifies outliers (not forcing noise into good clusters)
  4. Provides hierarchical structure (see topic relationships)
  5. Has robust parameters (less tuning, more stable) This is why BERTopic, the leading topic modeling framework, chose HDBSCAN as its default clustering algorithm.

Goal: Group similar documents into clusters and identify outliers.

Why HDBSCAN?

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is perfect for text clustering because it:

No K required - Automatically finds optimal number of clusters
Finds outliers - Explicitly identifies documents that don't fit (cluster -1)
Varying densities - Can find both large and small clusters
Hierarchical - Shows topic relationships
Deterministic - Same data = same clusters (with fixed random_state)

HDBSCAN Implementation

code
from hdbscan import HDBSCANimport numpy as npimport pandas as pd# Configure HDBSCANprint("Configuring HDBSCAN for clustering...")hdbscan_model = HDBSCAN(    min_cluster_size=15,             # Minimum 15 documents per cluster    min_samples=10,                  # Conservative outlier detection    metric='euclidean',              # Standard for reduced embeddings    cluster_selection_method='eom',  # Excess of Mass (recommended)    prediction_data=True,            # Enable soft clustering predictions    core_dist_n_jobs=-1             # Use all CPU cores)# Fit and predict clustersprint(f"Clustering {len(reduced_embeddings)} documents...")print("(This takes 2-5 minutes)")clusters = hdbscan_model.fit_predict(reduced_embeddings)# Analyze resultsn_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)n_outliers = list(clusters).count(-1)outlier_pct = n_outliers / len(clusters) * 100print(f"\n{'='*60}")print("CLUSTERING RESULTS")print(f"{'='*60}")print(f"✓ Total documents: {len(clusters):,}")print(f"✓ Clusters discovered: {n_clusters}")print(f"✓ Outliers identified: {n_outliers:,} ({outlier_pct:.1f}%)")# Cluster size distributioncluster_sizes = pd.Series(clusters[clusters != -1]).value_counts().sort_values(ascending=False)print(f"\n✓ Cluster size statistics:")print(f"  • Largest cluster: {cluster_sizes.max()} documents")print(f"  • Smallest cluster: {cluster_sizes.min()} documents")print(f"  • Average cluster size: {cluster_sizes.mean():.1f} documents")print(f"  • Median cluster size: {cluster_sizes.median():.0f} documents")# Show size distributionprint(f"\n✓ Cluster size distribution:")bins = [(15,50), (50,100), (100,500), (500, float('inf'))]for min_size, max_size in bins:    count = sum((cluster_sizes >= min_size) & (cluster_sizes < max_size))    print(f"  • {min_size}-{int(max_size) if max_size != float('inf') else '500+'} docs: {count} clusters")

Expected output:

code
Configuring HDBSCAN for clustering...Clustering 5,000 documents...(This takes 2-5 minutes)============================================================CLUSTERING RESULTS============================================================ Total documents: 5,000 Clusters discovered: 47 Outliers identified: 234 (4.7%) Cluster size statistics: Largest cluster: 456 documents Smallest cluster: 15 documents Average cluster size: 101.4 documents Median cluster size: 87 documents Cluster size distribution: 15-50 docs: 12 clusters 50-100 docs: 18 clusters 100-500 docs: 16 clusters 500+ docs: 1 clusters

Understanding HDBSCAN Parameters

1. min_cluster_size (Most important!)

Minimum number of documents to form a cluster:

code
# Small clusters (fine-grained topics)hdbscan_fine = HDBSCAN(min_cluster_size=10)# Result: Many small, specific clusters# Use when: Want detailed topic granularity# Medium clusters (balanced - recommended)hdbscan_balanced = HDBSCAN(min_cluster_size=20)  # ✅ Start here# Result: Moderate number of interpretable clusters# Use when: General purpose clustering# Large clusters (broad topics)hdbscan_broad = HDBSCAN(min_cluster_size=50)# Result: Few large, general clusters# Use when: Want high-level categories

2. min_samples (Controls outlier sensitivity)

How conservative to be about outliers:

code
# Conservative (fewer outliers, more inclusive)hdbscan_inclusive = HDBSCAN(    min_cluster_size=15,    min_samples=10  # Higher = fewer outliers)# Result: ~5% outliers# Use when: Want to cluster most documents# Moderate (balanced)hdbscan_balanced = HDBSCAN(    min_cluster_size=15,    min_samples=5   # ✅ Good default)# Result: ~10% outliers# Aggressive (more outliers, stricter)hdbscan_strict = HDBSCAN(    min_cluster_size=15,    min_samples=1   # Lower = more outliers)# Result: ~20% outliers# Use when: Want only very coherent clusters

3. cluster_selection_method

How to select clusters from hierarchy:

code
# Excess of Mass (recommended, default)hdbscan_eom = HDBSCAN(cluster_selection_method='eom')  # ✅# Selects most stable clusters across hierarchy# Result: Balanced cluster sizes# Leaf (more clusters)hdbscan_leaf = HDBSCAN(cluster_selection_method='leaf')# Selects all leaf clusters in hierarchy# Result: Many fine-grained clusters

Understanding Outliers (-1 Cluster)

The -1 cluster is special - it contains documents that don't fit any cluster:

code
# Inspect outliersoutlier_mask = clusters == -1outlier_docs = [documents[i] for i, is_outlier in enumerate(outlier_mask) if is_outlier]print(f"Outlier examples ({len(outlier_docs)} total):")for i, doc in enumerate(outlier_docs[:5]):    print(f"{i+1}. {doc[:100]}...")

What outliers might be:

  • 🔍 Legitimate edge cases - Rare but valid topics

  • 🗑️ Noise - Spam, gibberish, low-quality text

  • 📝 Multi-topic documents - Blending multiple themes

  • ⚠️ Too short - Not enough content for meaningful embedding

  • 🌐 Different language - If using single-language model

How to handle outliers:

code
# Strategy 1: Keep separate for manual reviewoutlier_docs = documents[clusters == -1]# Review these manually - may contain insights# Strategy 2: Assign to nearest clusterfrom scipy.spatial.distance import cdistdef assign_outliers(embeddings, clusters):    outlier_indices = np.where(clusters == -1)[0]    cluster_ids = set(clusters) - {-1}        # Calculate cluster centroids    centroids = {}    for cid in cluster_ids:        cluster_points = embeddings[clusters == cid]        centroids[cid] = cluster_points.mean(axis=0)        # Assign each outlier to nearest centroid    for idx in outlier_indices:        distances = {            cid: np.linalg.norm(embeddings[idx] - centroid)            for cid, centroid in centroids.items()        }        nearest = min(distances, key=distances.get)                # Only assign if within reasonable distance        if distances[nearest] < 0.5:  # Threshold            clusters[idx] = nearest        return clusters# Applyclusters_with_assigned = assign_outliers(reduced_embeddings, clusters.copy())# Strategy 3: Create "Miscellaneous" topic# Keep as -1 but give it a descriptive label later# Strategy 4: Adjust parameters to reduce outliershdbscan_less_outliers = HDBSCAN(    min_cluster_size=15,    min_samples=3  # More lenient)

Complete 3-Stage Pipeline Example

code
"""Complete text clustering pipeline in one placeFrom raw documents to cluster assignments"""from sentence_transformers import SentenceTransformerfrom umap import UMAPfrom hdbscan import HDBSCANimport numpy as npdef cluster_documents(documents, save_path=None):    """    Complete clustering pipeline        Args:        documents: List of text strings        save_path: Optional path to save results            Returns:        clusters: Array of cluster assignments        embeddings: Document embeddings        reduced: Reduced embeddings    """    print(f"Starting clustering pipeline for {len(documents)} documents...\n")        # STAGE 1: EMBEDDING    print("[1/3] Generating embeddings...")    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")    embeddings = embedding_model.encode(        documents,        batch_size=32,        show_progress_bar=True,        normalize_embeddings=True    )    print(f"✓ Embeddings: {embeddings.shape}\n")        # STAGE 2: DIMENSIONALITY REDUCTION    print("[2/3] Reducing dimensions with UMAP...")    umap_model = UMAP(        n_neighbors=15,        n_components=5,        min_dist=0.0,        metric='cosine',        random_state=42    )    reduced_embeddings = umap_model.fit_transform(embeddings)    print(f"✓ Reduced: {reduced_embeddings.shape}\n")        # STAGE 3: CLUSTERING    print("[3/3] Clustering with HDBSCAN...")    hdbscan_model = HDBSCAN(        min_cluster_size=15,        min_samples=10,        metric='euclidean',        cluster_selection_method='eom',        prediction_data=True    )    clusters = hdbscan_model.fit_predict(reduced_embeddings)        # Results    n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)    n_outliers = list(clusters).count(-1)        print(f"✓ Clustering complete!\n")    print(f"Results:")    print(f"  • Clusters: {n_clusters}")    print(f"  • Outliers: {n_outliers} ({n_outliers/len(clusters)*100:.1f}%)")        # Save if requested    if save_path:        np.save(f"{save_path}_embeddings.npy", embeddings)        np.save(f"{save_path}_reduced.npy", reduced_embeddings)        np.save(f"{save_path}_clusters.npy", clusters)        print(f"\n✓ Saved results to {save_path}_*.npy")        return clusters, embeddings, reduced_embeddings# Usagedocuments = [...]  # Your documentsclusters, embeddings, reduced = cluster_documents(    documents,    save_path="my_clustering_results")

4. BERTopic Framework: Complete Guide

Now that we understand the 3-stage clustering pipeline, let's explore BERTopic - the modular framework that extends clustering with powerful topic modeling capabilities.

Full Pipeline - Clustering to Topic Modeling
Full Pipeline - Clustering to Topic Modeling

What is BERTopic?

BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters, allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

The key innovation: BERTopic takes our 3-stage clustering pipeline and adds three more components for topic extraction and refinement.

BERTopic's 6-Component Architecture

code
Input: Documents1. Embedding Model (SBERT) → Dense vectors2. Dimensionality Reduction (UMAP) → 5D vectors3. Clustering (HDBSCAN) → Cluster IDs4. Tokenization (CountVectorizer) → Word frequencies per cluster5. Weighting (c-TF-IDF) → Important words per cluster6. Representation Model (Optional) → Refined topic labelsOutput: Topics with keywords

Components 1-3: Same as our pipeline (Embedding → UMAP → HDBSCAN)

Components 4-6: NEW! These extract meaningful topic keywords

Component 4: CountVectorizer (Per-Cluster Tokenization)

Instead of analyzing individual documents, BERTopic concatenates all documents in a cluster and treats each cluster as one "mega-document":

code
from sklearn.feature_extraction.text import CountVectorizervectorizer_model = CountVectorizer(    ngram_range=(1, 2),     # Both single words and two-word phrases    stop_words="english",   # Remove common words (the, is, and, etc.)    min_df=5,               # Ignore words appearing in < 5 documents    max_df=0.7              # Ignore words appearing in > 70% of documents)# Traditional: Each document analyzed separately# BERTopic: All documents in cluster 0 → one big document# Result: Cluster-level patterns emerge

Component 5: c-TF-IDF (The Secret Sauce!)

c-TF-IDF (class-based Term Frequency-Inverse Document Frequency) is what makes BERTopic topics so coherent.

Traditional TF-IDF:

code
# Finds important words in a DOCUMENT compared to corpusTF-IDF(word, document) = TF(word, doc) × log(N / df(word))

c-TF-IDF:

code
# Finds important words in a CLUSTER compared to other clustersc-TF-IDF(word, cluster) = TF(word, cluster) × log(n_clusters / cf(word))

Example in action:

code
Cluster 0 (Machine Learning papers):Words: learning(500×), neural(300×), model(250×), network(200×)Cluster 1 (NLP papers):Words: language(400×), text(350×), nlp(200×), processing(180×)c-TF-IDF identifies distinctive words:Cluster 0: "neural", "deep", "training", "architecture"Cluster 1: "language", "syntax", "semantic", "parsing"Without c-TF-IDF, generic words like "learning" would dominate both!

Implementation:

code
from bertopic.vectorizers import ClassTfidfTransformerctfidf_model = ClassTfidfTransformer(    reduce_frequent_words=True  # Further reduce weight of very common words)

Component 6: Representation Models (Optional Refinement)

Refine topics beyond c-TF-IDF keywords:

Option A: KeyBERTInspired - Extract keyphrases

code
from bertopic.representation import KeyBERTInspiredrepresentation_model = KeyBERTInspired()# Before (c-TF-IDF): ["neural", "network", "deep", "learning"]# After (KeyBERT): ["neural networks", "deep learning", "network architecture"]

Option B: Maximal Marginal Relevance - Balance relevance and diversity

code
from bertopic.representation import MaximalMarginalRelevancerepresentation_model = MaximalMarginalRelevance(diversity=0.3)# Ensures keywords are relevant AND different from each other# Bad: ["learning", "learner", "learned", "learns"] (too similar)# Good: ["learning", "optimization", "regularization", "validation"] (diverse)

Option C: LLM-based - Natural language labels

code
from bertopic.representation import OpenAIprompt = """I have a topic with keywords: [KEYWORDS]Generate a 2-5 word topic label.Only return the label."""representation_model = OpenAI(    model="gpt-4",    prompt=prompt)# Keywords: ["neural", "deep", "learning", "network"]# LLM Label: "Deep Neural Network Training"

Installing BERTopic

code
# Basic installationpip install bertopic# With all optional dependenciespip install bertopic[all]# Or install components individuallypip install bertopic sentence-transformers umap-learn hdbscan scikit-learn

Basic BERTopic Implementation

Simplest possible usage (3 lines):

code
from bertopic import BERTopic# Initialize with defaultstopic_model = BERTopic(language="english", verbose=True)# Fit and get topicstopics, probabilities = topic_model.fit_transform(documents)# View resultsprint(topic_model.get_topic_info())

Output:

code
Topic  Count  Name-1     234    -1_outlier_documents 0     1247   0_machine_learning_neural_deep 1     982    1_natural_language_processing_text 2     756    2_computer_vision_image_detection 3     543    3_reinforcement_learning_agent_reward...

Advanced BERTopic Configuration

Full customization of all components:

code
from bertopic import BERTopicfrom sentence_transformers import SentenceTransformerfrom umap import UMAPfrom hdbscan import HDBSCANfrom sklearn.feature_extraction.text import CountVectorizerfrom bertopic.vectorizers import ClassTfidfTransformerfrom bertopic.representation import KeyBERTInspired# Component 1: Embeddingembedding_model = SentenceTransformer("all-mpnet-base-v2")# Component 2: UMAPumap_model = UMAP(    n_neighbors=15,    n_components=5,    min_dist=0.0,    metric='cosine',    random_state=42)# Component 3: HDBSCANhdbscan_model = HDBSCAN(    min_cluster_size=15,    min_samples=10,    metric='euclidean',    cluster_selection_method='eom',    prediction_data=True)# Component 4: Vectorizervectorizer_model = CountVectorizer(    ngram_range=(1, 2),    stop_words="english",    min_df=5,    max_df=0.7)# Component 5: c-TF-IDFctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)# Component 6: Representationrepresentation_model = KeyBERTInspired()# Create BERTopic with all custom componentstopic_model = BERTopic(    embedding_model=embedding_model,    umap_model=umap_model,    hdbscan_model=hdbscan_model,    vectorizer_model=vectorizer_model,    ctfidf_model=ctfidf_model,    representation_model=representation_model,    top_n_words=10,                    # Keywords per topic    nr_topics="auto",                  # Auto topic reduction    calculate_probabilities=True,      # Enable soft clustering    verbose=True)# Fit modeltopics, probabilities = topic_model.fit_transform(documents)print(f"✓ Discovered {len(set(topics)) - 1} topics")

Working with BERTopic Results

code
# Get all topic informationtopic_info = topic_model.get_topic_info()print(topic_info.head())# Get specific topic keywordstopic_0_keywords = topic_model.get_topic(0)print("\nTopic 0 keywords:")for word, score in topic_0_keywords[:10]:    print(f"  {word:20s} {score:.4f}")# Get documents in a topictopic_0_docs = [documents[i] for i, t in enumerate(topics) if t == 0]print(f"\nTopic 0 has {len(topic_0_docs)} documents")# Search for topicssimilar_topics, similarity = topic_model.find_topics("deep learning", top_n=5)print(f"\nTopics related to 'deep learning': {similar_topics}")# Get topic distribution for new documentnew_doc = ["Latest advances in transformer models"]new_topic, new_prob = topic_model.transform(new_doc)print(f"\nNew document assigned to topic: {new_topic[0]}")

5. Topic Modeling with LLMs

Topic modeling goes beyond clustering by assigning meaningful labels and descriptions to each cluster.

From keywords to topic labels using LLMs
From keywords to topic labels using LLMs

What is Topic Modeling?

Topic modeling identifies abstract "topics" that occur in a collection of documents. While clustering groups documents, topic modeling names and describes those groups.

Clustering: "These 100 documents are similar"
Topic Modeling: "These documents are about 'Neural Machine Translation'"

Traditional vs LLM-Based Topic Modeling

Traditional (LDA):

  • Topics are probability distributions over words

  • Output: [("neural", 0.05), ("network", 0.04), ("deep", 0.03), ...]

  • Hard to interpret

  • No semantic understanding

LLM-Based (BERTopic):

  • Topics are coherent keyword clusters

  • Output: ["neural networks", "deep learning", "training"]

  • Plus optional LLM-generated label: "Deep Neural Network Training"

  • Semantically meaningful

Generating Topic Labels with LLMs

The most powerful feature: using GPT-4 or Claude to create human-readable topic labels.

Method 1: BERTopic Built-in LLM Support

code
from bertopic.representation import OpenAIfrom bertopic import BERTopic# Configure LLM labelingprompt = """I have a topic described by these keywords: [KEYWORDS]Based on these keywords, generate a concise, descriptive topic label (2-6 words).The label should capture the main theme.Only return the label, nothing else.Keywords: [KEYWORDS]Label:"""llm_model = OpenAI(    model="gpt-4",    chat=True,    prompt=prompt,    exponential_backoff=True)# Create BERTopic with LLM labelingtopic_model = BERTopic(    representation_model=llm_model,    verbose=True)topics, probs = topic_model.fit_transform(documents)# View generated labelstopic_info = topic_model.get_topic_info()print(topic_info[['Topic', 'Count', 'Name']])

Output:

code
Topic  Count  Name-1     234    -1_outlier_documents 0     1247   Deep Neural Network Training 1     982    Natural Language Processing 2     756    Computer Vision and Object Detection 3     543    Reinforcement Learning Algorithms

Method 2: Post-hoc Labeling with Claude

code
import anthropicdef generate_topic_label_claude(keywords, sample_docs):    """Generate topic label using Claude"""        client = anthropic.Anthropic(api_key="your-api-key")        keywords_str = ", ".join([word for word, _ in keywords[:10]])    docs_str = "\n".join([f"• {doc[:150]}" for doc in sample_docs[:3]])        prompt = f"""<documents>Keywords that characterize this topic: {keywords_str}Sample documents from this cluster:{docs_str}</documents>Based on the keywords and sample documents, generate a concise, descriptive label (2-6 words) that captures the main theme of this topic.Respond with ONLY the label."""    message = client.messages.create(        model="claude-3-5-sonnet-20241022",        max_tokens=50,        temperature=0.3,        messages=[{"role": "user", "content": prompt}]    )        return message.content[0].text.strip()# Generate labels for all topicsfor topic_id in range(len(set(topics)) - 1):    keywords = topic_model.get_topic(topic_id)    topic_docs = [documents[i] for i, t in enumerate(topics) if t == topic_id][:5]        label = generate_topic_label_claude(keywords, topic_docs)    print(f"Topic {topic_id}: {label}")

6. Real-World Implementation: ArXiv Dataset Example

Let's implement a complete, production-ready clustering system using the ArXiv NLP papers dataset.

Dataset Overview

ArXiv NLP Dataset:

  • Documents: 44,949 research paper abstracts

  • Domain: Computation and Language (cs.CL)

  • Years: 1991-2024

  • Source: Hugging Face (maartengr/arxiv_nlp)

Complete Implementation

code
"""ArXiv NLP Paper ClusteringComplete pipeline from data loading to visualization"""# Step 1: Load Datafrom datasets import load_datasetprint("Loading ArXiv NLP dataset...")dataset = load_dataset("maartengr/arxiv_nlp")["train"]abstracts = dataset["Abstracts"]titles = dataset["Titles"]print(f"✓ Loaded {len(abstracts)} papers")# Step 2: Generate Embeddingsfrom sentence_transformers import SentenceTransformerprint("\n[1/3] Generating embeddings...")embedding_model = SentenceTransformer("all-mpnet-base-v2")embeddings = embedding_model.encode(    abstracts,    batch_size=32,    show_progress_bar=True,    normalize_embeddings=True)print(f"✓ Generated {embeddings.shape} embeddings")# Step 3: Create BERTopic Modelfrom bertopic import BERTopicfrom umap import UMAPfrom hdbscan import HDBSCANfrom sklearn.feature_extraction.text import CountVectorizerfrom bertopic.vectorizers import ClassTfidfTransformerprint("\n[2/3] Configuring BERTopic...")umap_model = UMAP(    n_neighbors=15,    n_components=5,    min_dist=0.0,    metric='cosine',    random_state=42)hdbscan_model = HDBSCAN(    min_cluster_size=15,    min_samples=10,    metric='euclidean',    cluster_selection_method='eom')vectorizer_model = CountVectorizer(    ngram_range=(1, 2),    stop_words="english",    min_df=5)ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)topic_model = BERTopic(    embedding_model=embedding_model,    umap_model=umap_model,    hdbscan_model=hdbscan_model,    vectorizer_model=vectorizer_model,    ctfidf_model=ctfidf_model,    top_n_words=10,    verbose=True)# Step 4: Fit Modelprint("\n[3/3] Clustering and extracting topics...")topics, probabilities = topic_model.fit_transform(abstracts, embeddings)n_topics = len(set(topics)) - 1n_outliers = list(topics).count(-1)print(f"\n{'='*60}")print("RESULTS")print(f"{'='*60}")print(f"✓ Discovered {n_topics} topics")print(f"✓ Outliers: {n_outliers} ({n_outliers/len(topics)*100:.1f}%)")# Step 5: Inspect Topicstopic_info = topic_model.get_topic_info()print(f"\nTop 10 Topics:")print(topic_info.head(11)[['Topic', 'Count', 'Name']])# Step 6: Detailed Topic Inspectionprint(f"\n{'='*60}")print("TOPIC DETAILS")print(f"{'='*60}")for i in range(min(5, n_topics)):    topic_id = topic_info.iloc[i+1]['Topic']  # Skip -1    topic_size = topic_info.iloc[i+1]['Count']        print(f"\nTOPIC {topic_id} ({topic_size} papers)")    print("-" * 60)        # Keywords    topic_words = topic_model.get_topic(topic_id)    print("Keywords:")    for word, score in topic_words[:8]:        print(f"  {word:25s} {score:.4f}")        # Sample papers    topic_papers = [(i, titles[i]) for i, t in enumerate(topics) if t == topic_id]    print(f"\nSample papers:")    for j, (idx, title) in enumerate(topic_papers[:3]):        print(f"  {j+1}. {title}")# Step 7: Save Modelprint(f"\n{'='*60}")print("SAVING")print(f"{'='*60}")topic_model.save("arxiv_bertopic_model")print("✓ Model saved to arxiv_bertopic_model/")# Step 8: Visualizationsprint("\nGenerating visualizations...")# Topic mapfig1 = topic_model.visualize_topics()fig1.write_html("arxiv_topics_map.html")print("✓ Saved: arxiv_topics_map.html")# Bar chartfig2 = topic_model.visualize_barchart(top_n_topics=20, n_words=8)fig2.write_html("arxiv_barchart.html")print("✓ Saved: arxiv_barchart.html")# Hierarchyfig3 = topic_model.visualize_hierarchy()fig3.write_html("arxiv_hierarchy.html")print("✓ Saved: arxiv_hierarchy.html")print(f"\n{'='*60}")print("✓ COMPLETE!")print(f"{'='*60}")

Expected Output:

code
Loading ArXiv NLP dataset... Loaded 44,949 papers[1/3] Generating embeddings...100%|██████████| 1405/1405 [08:32<00:00,  2.74it/s] Generated (44949, 768) embeddings[2/3] Configuring BERTopic...[3/3] Clustering and extracting topics...============================================================RESULTS============================================================ Discovered 127 topics Outliers: 2,341 (5.2%)Top 10 Topics:Topic  Count  Name-1     2341   -1_outliers 0     3456   0_neural_machine_translation 1     2234   1_question_answering_reading 2     1876   2_sentiment_analysis_opinion 3     1654   3_named_entity_recognition 4     1432   4_speech_recognition_acoustic...============================================================TOPIC DETAILS============================================================TOPIC 0 (3456 papers)------------------------------------------------------------Keywords:  neural                   0.0234  machine                  0.0198  translation              0.0187  nmt                      0.0156  encoder                  0.0142  decoder                  0.0138  attention                0.0121  transformer              0.0109Sample papers:  1. Attention Is All You Need  2. Neural Machine Translation by Jointly Learning to Align and Translate  3. Effective Approaches to Attention-based Neural Machine Translation[... continues for topics 1-4 ...]============================================================SAVING============================================================ Model saved to arxiv_bertopic_model/Generating visualizations... Saved: arxiv_topics_map.html Saved: arxiv_barchart.html Saved: arxiv_hierarchy.html============================================================ COMPLETE!============================================================

Results Analysis

code
# Calculate quality metricsfrom sklearn.metrics import silhouette_scoremask = topics != -1silhouette = silhouette_score(    embeddings[mask][:10000],  # Sample for speed    topics[mask][:10000],    metric='cosine')print(f"Clustering Quality:")print(f"  • Silhouette Score: {silhouette:.4f}")print(f"    {'Excellent' if silhouette > 0.7 else 'Good' if silhouette > 0.5 else 'Moderate'}")# Find most interesting topicsimport pandas as pdtopic_sizes = pd.Series(topics[mask]).value_counts()print(f"\n  • Largest cluster: {topic_sizes.max()} papers")print(f"  • Smallest cluster: {topic_sizes.min()} papers")print(f"  • Average size: {topic_sizes.mean():.1f} papers")# Niche but significant topicsniche_topics = topic_info[    (topic_info['Count'] > 50) &     (topic_info['Count'] < 200)]print(f"\nNiche Topics (50-200 papers):")for idx, row in niche_topics.head(5).iterrows():    if row['Topic'] == -1:        continue    print(f"  • Topic {row['Topic']}: {row['Name']} ({row['Count']} papers)")

8. Production Deployment Best Practices

Model Selection Guidelines

Decision Matrix:

Dataset Size Speed Priority Quality Priority Recommended Setup
< 1K docs High Medium MiniLM + UMAP + K-Means
1K-10K Medium High mpnet + UMAP + HDBSCAN
10K-100K High High mpnet + UMAP + HDBSCAN + batching
100K+ High Medium MiniLM + PCA + MiniBatchKMeans

Performance Optimization

code
# 1. Use GPU accelerationembedding_model = SentenceTransformer("all-mpnet-base-v2", device="cuda")# 2. Enable mixed precisionembeddings = embedding_model.encode(    documents,    batch_size=64,    convert_to_numpy=True,    show_progress_bar=True,    normalize_embeddings=True  # Faster cosine similarity)# 3. Cache embeddingsimport joblib# Savejoblib.dump(embeddings, "embeddings.pkl")# Loadembeddings = joblib.load("embeddings.pkl")# 4. Use approximate nearest neighbors for large datasetsfrom annoy import AnnoyIndexdef build_annoy_index(embeddings, n_trees=10):    dimension = embeddings.shape[1]    index = AnnoyIndex(dimension, 'angular')        for i, emb in enumerate(embeddings):        index.add_item(i, emb)        index.build(n_trees)    return index

Monitoring and Evaluation

code
import loggingfrom datetime import datetimeclass ClusteringMonitor:    """Monitor clustering performance in production"""        def __init__(self, log_file="clustering_metrics.log"):        logging.basicConfig(            filename=log_file,            level=logging.INFO,            format='%(asctime)s - %(message)s'        )        self.logger = logging.getLogger(__name__)        def log_clustering_run(self, n_docs, n_clusters, n_outliers,                            silhouette, runtime):        """Log clustering metrics"""        metrics = {            'timestamp': datetime.now().isoformat(),            'n_documents': n_docs,            'n_clusters': n_clusters,            'n_outliers': n_outliers,            'outlier_pct': n_outliers / n_docs * 100,            'silhouette_score': silhouette,            'runtime_seconds': runtime        }                self.logger.info(f"Clustering run: {metrics}")                # Alert if quality drops        if silhouette < 0.3:            self.logger.warning(f"Low silhouette score: {silhouette}")                if n_outliers / n_docs > 0.2:            self.logger.warning(f"High outlier rate: {n_outliers/n_docs*100:.1f}%")        def log_topic_update(self, topic_id, old_keywords, new_keywords):        """Log topic changes"""        added = set(new_keywords) - set(old_keywords)        removed = set(old_keywords) - set(new_keywords)                if added or removed:            self.logger.info(                f"Topic {topic_id} changed - Added: {added}, Removed: {removed}"            )# Usagemonitor = ClusteringMonitor()import timestart = time.time()# Run clusteringtopic_model = BERTopic()topics, probs = topic_model.fit_transform(documents)runtime = time.time() - start# Calculate metricsfrom sklearn.metrics import silhouette_scoren_outliers = list(topics).count(-1)mask = topics != -1silhouette = silhouette_score(embeddings[mask], topics[mask], metric='cosine')# Logmonitor.log_clustering_run(    n_docs=len(documents),    n_clusters=len(set(topics)) - 1,    n_outliers=n_outliers,    silhouette=silhouette,    runtime=runtime)

9. Common Challenges and Solutions

Challenge 1: Too Many Small Clusters

Problem: HDBSCAN creates 100+ tiny clusters instead of meaningful groups

Symptoms:

  • Many clusters with 10-20 documents

  • Fragmented topics

  • Hard to interpret

Solutions:

code
# Solution 1: Increase min_cluster_sizehdbscan_model = HDBSCAN(    min_cluster_size=50,  # Increase from 15    min_samples=10)# Solution 2: Use topic reduction in BERTopictopic_model = BERTopic(    hdbscan_model=hdbscan_model,    nr_topics=20  # Reduce to 20 topics)# Solution 3: Post-hoc mergingtopic_model.reduce_topics(documents, topics, nr_topics=20)# Solution 4: Adjust UMAP parameters for less granularityumap_model = UMAP(    n_neighbors=30,  # Increase (was 15)    n_components=5,    min_dist=0.1     # Increase (was 0.0))

Challenge 2: Poor Topic Coherence

Problem: Topics contain unrelated or nonsensical keywords

Solutions:

code
# Solution 1: Better preprocessingdef preprocess_text(text):    # Remove URLs    text = re.sub(r'http\S+', '', text)    # Remove emails    text = re.sub(r'\S+@\S+', '', text)    # Remove numbers    text = re.sub(r'\d+', '', text)    # Remove extra whitespace    text = ' '.join(text.split())    return text.lower()documents_clean = [preprocess_text(doc) for doc in documents]# Solution 2: Better stopword handlingvectorizer_model = CountVectorizer(    ngram_range=(1, 2),    stop_words="english",    min_df=10,  # Increase (ignore very rare words)    max_df=0.5  # Decrease (ignore very common words))# Solution 3: Use representation modelsfrom bertopic.representation import KeyBERTInspired, MaximalMarginalRelevancerepresentation_models = [    KeyBERTInspired(),    MaximalMarginalRelevance(diversity=0.3)]topic_model = BERTopic(    representation_model=representation_models)# Solution 4: Manual topic refinement# Merge similar topicstopics_to_merge = [[1, 5], [3, 7], [9, 12]]topic_model.merge_topics(documents, topics, topics_to_merge)

Challenge 3: High Outlier Rate (>20%)

Problem: Too many documents classified as outliers (-1 cluster)

Solutions:

code
# Solution 1: Reduce min_sampleshdbscan_model = HDBSCAN(    min_cluster_size=15,    min_samples=5,  # Decrease (was 10)    metric='euclidean')# Solution 2: Try different distance metrichdbscan_model = HDBSCAN(    min_cluster_size=15,    metric='manhattan',  # Try instead of euclidean    cluster_selection_method='eom')# Solution 3: Reduce dimensionality less aggressivelyumap_model = UMAP(    n_neighbors=15,    n_components=10,  # Increase (was 5)    min_dist=0.0)# Solution 4: Assign outliers to nearest clusterdef assign_outliers_to_nearest_cluster(embeddings, clusters):    from scipy.spatial.distance import cdist        outlier_mask = clusters == -1    outlier_indices = np.where(outlier_mask)[0]        # Get cluster centroids    cluster_ids = set(clusters) - {-1}    centroids = {}        for cluster_id in cluster_ids:        cluster_points = embeddings[clusters == cluster_id]        centroids[cluster_id] = cluster_points.mean(axis=0)        # Assign each outlier to nearest centroid    for idx in outlier_indices:        distances = {            cid: np.linalg.norm(embeddings[idx] - centroid)            for cid, centroid in centroids.items()        }                nearest_cluster = min(distances, key=distances.get)        clusters[idx] = nearest_cluster        return clusters# Applyclusters_fixed = assign_outliers_to_nearest_cluster(embeddings, clusters.copy())

Challenge 4: Slow Performance

Problem: Clustering takes too long on large datasets

Solutions:

code
# Solution 1: Use smaller embedding modelembedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # Fast, 80MB# Solution 2: Sample large datasetssample_size = 10000sample_indices = np.random.choice(len(documents), sample_size, replace=False)sample_docs = [documents[i] for i in sample_indices]# Cluster sampletopic_model = BERTopic()topics, probs = topic_model.fit_transform(sample_docs)# Predict on full datasetall_topics, all_probs = topic_model.transform(documents)# Solution 3: Use approximate UMAPumap_model = UMAP(    n_neighbors=15,    n_components=5,    metric='cosine',    low_memory=True,  # Use less memory, slightly slower    random_state=42)# Solution 4: Parallel processingfrom multiprocessing import Pooldef process_batch(batch):    return embedding_model.encode(batch)# Split into batchesbatch_size = 1000batches = [documents[i:i+batch_size] for i in range(0, len(documents), batch_size)]# Process in parallelwith Pool(processes=4) as pool:    batch_embeddings = pool.map(process_batch, batches)embeddings = np.vstack(batch_embeddings)

10. Comparison: BERTopic vs Alternatives

vs Traditional LDA

Aspect LDA BERTopic
Input Bag-of-words Embeddings
Context ❌ None ✅ Contextual
# Topics ⚠️ Must specify ✅ Auto-discovers
Outliers ❌ Forces assignment ✅ Explicit -1 cluster
Short text ❌ Poor ✅ Excellent
Speed ✅ Fast ⚠️ Slower
Interpretability ⚠️ Probability distributions ✅ Clear keywords
Reproducibility ⚠️ Varies ✅ Deterministic (with seed)

vs Top2Vec

Aspect Top2Vec BERTopic
Architecture Doc2Vec + UMAP + HDBSCAN SBERT + UMAP + HDBSCAN
Modularity ❌ Fixed pipeline ✅ Fully modular
Customization ⚠️ Limited ✅ Extensive
Topic refinement ❌ Basic ✅ Multiple representation models
Online learning ✅ Yes ✅ Yes
Hierarchical ❌ No ✅ Yes
Dynamic modeling ❌ No ✅ Yes

vs CTM (Contextualized Topic Models)

Aspect CTM BERTopic
Base model BERT + Neural Variational SBERT + HDBSCAN
Complexity ⚠️ High ✅ Medium
Training time ⚠️ Slow ✅ Fast
Stability ⚠️ Requires tuning ✅ Stable defaults
Zero-shot ❌ Needs training ✅ Immediate
Documentation ⚠️ Limited ✅ Extensive
Community ⚠️ Small ✅ Large

When to Use What

Use LDA when:

  • You need very fast processing

  • Working with large, clean corpora

  • Interpretable probability distributions are important

  • Limited computational resources

Use Top2Vec when:

  • You want Doc2Vec embeddings specifically

  • Simpler API preferred

  • Don't need customization

Use CTM when:

  • Academic research context

  • Need probabilistic framework

  • Have computational resources for training

Use BERTopic when:

  • Need production-ready solution ✅

  • Want modular, customizable pipeline ✅

  • Working with diverse text types ✅

  • Need hierarchical topics ✅

  • Want dynamic/online modeling ✅

  • Require extensive documentation ✅

11. Tools and Resources

Python Libraries

Core Libraries:

code
# Essentialpip install bertopic sentence-transformers umap-learn hdbscan# Visualizationpip install plotly datamapplot# Optional enhancementspip install spacypython -m spacy download en_core_web_sm# For LLM labelingpip install openai anthropic

Alternative Libraries:

code
# Traditional topic modelingpip install gensim  # For LDApip install scikit-learn  # For NMF, LSA# Other embedding modelspip install transformers torch# Approximate nearest neighborspip install annoy faiss-cpu

Embedding Models

General Purpose (Recommended):

  • all-MiniLM-L6-v2 - Fast, 384-dim, 80MB

  • all-mpnet-base-v2 - High quality, 768-dim, 420MB

  • stella-en-400M-v5 - State-of-the-art, 1024-dim, 1.6GB

Domain-Specific:

  • allenai/specter - Scientific papers

  • biobert-base-cased - Biomedical text

  • finbert - Financial documents

  • legal-bert-base-uncased - Legal documents

Multilingual:

  • paraphrase-multilingual-MiniLM-L12-v2 - 50+ languages

  • distiluse-base-multilingual-cased-v2 - Fast multilingual

Find more: MTEB Leaderboard

Datasets for Practice

  1. 20 Newsgroups - Classic text classification
code
from sklearn.datasets import fetch_20newsgroupsdocs = fetch_20newsgroups(subset='all')['data']
  1. ArXiv Papers - Academic abstracts
code
from datasets import load_datasetdataset = load_dataset("maartengr/arxiv_nlp")
  1. BBC News - News articles
code
# Download from: http://mlg.ucd.ie/datasets/bbc.html
  1. Amazon Reviews - Product reviews
code
from datasets import load_datasetdataset = load_dataset("amazon_polarity")
  1. Twitter Sentiment - Short texts
code
from datasets import load_datasetdataset = load_dataset("tweet_eval", "sentiment")

Documentation & Tutorials

Official Documentation:

Tutorials:

Books:

  • Hands-On Large Language Models by Jay Alammar & Maarten Grootendorst

Research Papers:


12. Frequently Asked Questions

Q1: What's the difference between text clustering and topic modeling?

A: Text clustering groups similar documents together based on semantic meaning, while topic modeling labels and describes those groups with keywords or phrases.

Think of it this way:

  • Clustering: "These 100 documents are similar" (grouping)

  • Topic modeling: "These documents are about 'neural networks and deep learning'" (labeling)

In practice, topic modeling often follows clustering. BERTopic combines both: it clusters documents (Stage 1-3) then extracts topics (Stage 4-6).

Q2: Why use BERTopic instead of traditional LDA for topic modeling?

A: BERTopic offers several advantages:

  1. Contextual understanding: Uses BERT embeddings that understand "bank" (river) vs "bank" (financial)

  2. No K specification: Discovers optimal number of topics automatically

  3. Better outlier handling: HDBSCAN explicitly identifies outliers (-1 cluster)

  4. Short text performance: Works well with tweets, reviews (LDA struggles)

  5. Modularity: Swap components (embedding model, clustering algorithm, etc.)

  6. Topic coherence: Generally produces more interpretable topics

When to use LDA:

  • Very large datasets (millions of documents)

  • Extremely limited compute resources

  • Need probabilistic topic distributions

  • Academic research requiring traditional methods

Q3: What does the -1 cluster in HDBSCAN represent?

A: The -1 cluster represents outliers - documents that don't fit well into any cluster. These are data points too far from dense regions to be assigned to a cluster.

Common outlier types:

  • 🔍 Legitimate edge cases (rare, unique topics)

  • 🗑️ Noise, spam, or low-quality text

  • 📝 Multi-topic documents blending themes

  • ⚠️ Very short documents lacking context

Handling strategies:

  1. Keep separate: Review manually, may contain insights

  2. Assign to nearest: Use distance to cluster centroids

  3. Adjust parameters: Reduce min_samples to be less strict

  4. Accept as normal: 5-10% outliers is typical and healthy

Q4: Which embedding model should I use for text clustering?

A: Choose based on your priorities:

Speed priority:

  • all-MiniLM-L6-v2 (384-dim, 80MB, ~1000 docs/sec)

  • Best for: Prototyping, large datasets, real-time systems

Quality priority:

  • all-mpnet-base-v2 (768-dim, 420MB, ~400 docs/sec)

  • Best for: Production systems, critical applications

Bleeding edge:

  • stella-en-400M-v5 (1024-dim, 1.6GB, ~200 docs/sec)

  • Best for: Research, maximum accuracy

Domain-specific:

  • Scientific papers: allenai/specter

  • Medical: biobert-base-cased

  • Financial: finbert

  • Legal: legal-bert-base-uncased

Always test on a sample of your data! Embeddings that work well for news articles may not be optimal for tweets.

Q5: Can BERTopic handle documents in multiple languages?

A: Yes! Use a multilingual embedding model:

code
from sentence_transformers import SentenceTransformer# Supports 50+ languagesmultilingual_model = SentenceTransformer(    "paraphrase-multilingual-MiniLM-L12-v2")topic_model = BERTopic(    embedding_model=multilingual_model,    language="multilingual")# Works with mixed-language documentsdocs = [    "Machine learning is powerful",           # English    "El aprendizaje automático es poderoso", # Spanish    "機械学習は強力です"                       # Japanese]topics, probs = topic_model.fit_transform(docs)

Important notes:

  • All documents embedded in same semantic space

  • Similar concepts cluster together regardless of language

  • Topic keywords may include multiple languages

  • For best single-language results, use language-specific models

Q6: How do I choose the right number of clusters?

A: Great news - you don't have to! HDBSCAN automatically discovers the optimal number based on data density.

What HDBSCAN does:

  1. Finds dense regions in embedding space

  2. Groups documents in dense regions into clusters

  3. Marks isolated documents as outliers (-1)

  4. Number of clusters emerges naturally from data

If you want control:

code
# More clusters (smaller groups)hdbscan_model = HDBSCAN(    min_cluster_size=10,  # Smaller minimum    min_samples=5)# Fewer clusters (larger groups)hdbscan_model = HDBSCAN(    min_cluster_size=50,  # Larger minimum    min_samples=10)# Or reduce topics post-hoctopic_model.reduce_topics(documents, topics, nr_topics=20)

Rule of thumb:

  • 1,000 docs → expect 10-30 clusters

  • 10,000 docs → expect 50-150 clusters

  • 100,000 docs → expect 100-300 clusters

Q7: How does c-TF-IDF differ from regular TF-IDF?

A: The key difference is the level of analysis:

Traditional TF-IDF:

  • Analyzes individual documents

  • Finds important words per document

  • Formula: TF(word, doc) × IDF(word, corpus)

c-TF-IDF (class-based):

  • Analyzes clusters (treats each cluster as one "mega-document")

  • Finds important words per cluster

  • Formula: TF(word, cluster) × IDF(word, all_clusters)

Example:

code
# Cluster 0: ML papers# Contains words: learning (500×), neural (300×), model (250×)# Cluster 1: NLP papers  # Contains words: language (400×), text (350×), nlp (200×)# c-TF-IDF identifies:# Cluster 0 distinctive words: "neural", "deep", "network"# Cluster 1 distinctive words: "language", "syntax", "semantic"# Regular TF-IDF would miss cluster-level patterns!
code
# Cluster 0: ML papers# Contains words: learning (500×), neural (300×), model (250×)# Cluster 1: NLP papers  # Contains words: language (400×), text (350×), nlp (200×)# c-TF-IDF identifies:# Cluster 0 distinctive words: "neural", "deep", "network"# Cluster 1 distinctive words: "language", "syntax", "semantic"# Regular TF-IDF would miss cluster-level patterns!

Why it matters: c-TF-IDF produces more coherent, interpretable topics because it considers the cluster context.

Q8: Is BERTopic suitable for short texts like tweets?

A: Yes! BERTopic handles short texts much better than traditional methods like LDA.

Why it works:

  • Embeddings capture semantics even from few words

  • Contextual understanding helps with abbreviations

  • Handles informal language, emojis, hashtags

Best practices for short texts:

code
# 1. Use smaller min_cluster_sizehdbscan_model = HDBSCAN(    min_cluster_size=10,  # Lower for short texts    min_samples=5)# 2. Keep bigramsvectorizer_model = CountVectorizer(    ngram_range=(1, 2),  # Unigrams and bigrams    stop_words="english")# 3. Consider shorter keywordstopic_model = BERTopic(    top_n_words=5,  # Fewer keywords for short texts    hdbscan_model=hdbscan_model,    vectorizer_model=vectorizer_model)

Minimum text length: Generally works well with 5-10 words. Below that, consider:

  • Combining related short texts

  • Using bigrams/trigrams

  • Specialized short-text embeddings

Q9: How do I handle imbalanced datasets where some topics dominate?

A: Several strategies:

code
# Strategy 1: Normalize cluster sizes with nr_topicstopic_model = BERTopic(nr_topics=50)  # Force 50 equal topics# Strategy 2: Adjust min_cluster_size dynamically# Smaller clusters for minority topicshdbscan_model = HDBSCAN(    min_cluster_size=15,    cluster_selection_epsilon=0.5  # Merge similar clusters)# Strategy 3: Oversample minority topicsfrom sklearn.utils import resample# Identify small clusterscluster_counts = pd.Series(topics).value_counts()small_clusters = cluster_counts[cluster_counts < 100].index# Oversample documents from small clustersoversampled_docs = []oversampled_topics = []for cluster_id in small_clusters:    cluster_docs = [documents[i] for i, t in enumerate(topics) if t == cluster_id]    cluster_topics = [topics[i] for i, t in enumerate(topics) if t == cluster_id]        # Oversample to 100 documents    resampled = resample(        cluster_docs,        n_samples=100,        replace=True    )    oversampled_docs.extend(resampled)# Retrain with balanced datatopic_model.update_topics(oversampled_docs, oversampled_topics)

Q10: Can I update topics with new documents without retraining?

A: Yes! BERTopic supports incremental updates:

code
# Initial trainingtopic_model = BERTopic()topics, probs = topic_model.fit_transform(initial_documents)# New documents arrivenew_documents = ["Latest AI research paper", "Novel NLP technique", ...]# Option 1: Assign to existing topics (fast)new_topics, new_probs = topic_model.transform(new_documents)# Option 2: Update model with new documents (slower, more accurate)all_documents = initial_documents + new_documentsall_topics = list(topics) + list(new_topics)topic_model.update_topics(    docs=all_documents,    topics=all_topics,    vectorizer_model=updated_vectorizer)# Topics now reflect both old and new documents

When to use each:

  • transform(): Daily/weekly updates, streaming data

  • update_topics(): Monthly/quarterly, significant new content

13. Conclusion

Text clustering with LLMs represents a paradigm shift from keyword-matching to semantic understanding. By combining powerful embedding models (SBERT), effective dimensionality reduction (UMAP), and density-based clustering (HDBSCAN), we can automatically discover meaningful patterns in unstructured text at scale.

Key Takeaways

LLM embeddings capture semantics that traditional bag-of-words methods miss
BERTopic provides a modular framework that's both powerful and customizable
Three-stage pipeline (embed → reduce → cluster) is the foundation
c-TF-IDF extracts distinctive keywords by analyzing clusters, not documents
Production deployment requires careful attention to scalability and monitoring
No one-size-fits-all solution - tune parameters for your specific use case

When to Use Text Clustering

Perfect for:

  • Organizing large document collections (research papers, support tickets)

  • Discovering emerging themes (social media, news monitoring)

  • Data exploration before building classifiers

  • Identifying outliers and data quality issues

  • Creating semantic navigation systems

Not ideal for:

  • Real-time classification (use trained classifiers instead)

  • When you need specific predefined categories (use classification)

  • Tiny datasets (<100 documents)

  • When interpretability isn't important

Implementation Checklist

Phase 1: Prototype (1-2 days)

  • [ ] Install BERTopic and dependencies

  • [ ] Load and explore your dataset

  • [ ] Run basic clustering with defaults

  • [ ] Inspect top 10 topics manually

  • [ ] Assess if approach is viable

Phase 2: Optimization (3-5 days)

  • [ ] Test different embedding models

  • [ ] Tune UMAP parameters (n_neighbors, n_components)

  • [ ] Tune HDBSCAN parameters (min_cluster_size, min_samples)

  • [ ] Experiment with representation models

  • [ ] Calculate quality metrics (silhouette, coherence)

Phase 3: Production (1-2 weeks)

  • [ ] Implement batch processing for large datasets

  • [ ] Add error handling and retries

  • [ ] Set up monitoring and logging

  • [ ] Create visualization dashboards

  • [ ] Document model parameters and decisions

  • [ ] Plan update strategy for new documents

Next Steps

Immediate actions:

  1. Download the code

  2. Try clustering on your own dataset

  3. Share results and questions in comments

Get Help

Questions? Drop a comment below - I respond within 24 hours
Found value? Share this guide with your team

References

  1. Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.

  2. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

  3. McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205.

  4. McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.

  5. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.

© 2025 Ranjan Kumar. All rights reserved.