Text Clustering and Topic Modeling using Large Language Models (LLMs)

1. Introduction

Text clustering is an unsupervised approach that helps in discovering patterns in data. Grouping similar texts according to their semantic content, meaning, relationships, etc. is the goal of text clustering. This makes it easier to cluster vast amounts of unstructured text and perform exploratory data analysis quickly. With recent advancements in large language models (LLMs), we can obtain extremely precise contextual and semantic representations of text. This has improved text clustering’s efficacy even more. Use cases for text clustering include identifying outliers, accelerating labelling, identifying data that has been erroneously labelled, and more.

Topic modelling facilitates the identification of (abstract) topics that arise in huge textual data collections. Clusters of text documents can be given meaning through this method.

We will learn how to use embedding models for text clustering and text-clustering-inspired method of topic modeling, namely BERTopic, generating labels using LLM given the keywords of the topic.

2. Pipeline for Text Clustering

The following diagram depicts the pipeline for text clustering that consists of the following three steps:

Use an embedding model to transform the input documents into embeddings.
Using a dimensionality reduction model, lower the dimensionality of embeddings.
Use a cluster model to identify groups of documents that share semantic similarities.

2.1 Embedding Documents

We know that embeddings are the numerical representations of text that attempt to capture its meaning. We can use embedding models that are optimized for semantic similarity tasks, for transforming the documents to embeddings in the first step. We can use the Massive Text Embedding Benchmark (MTEB) leaderboard, for selecting the embedding model for our requirement. For example “thenlper/gte-small” is a small but performant model with fast inference.

2.2 Dimensionality Reduction of Embeddings

It is difficult for clustering techniques to identify meaningful clusters if the dimension of the data is high. These techniques preserve the global structure of the data by finding the low-dimension representations. This techniques act as compression techniques, so these do not remove dimensions arbitrarily. Popular algorithms are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Compared to PCA, UMAP is better in handling nonlinear relationships and structures.

2.3 Clustering the Reduced Embeddings

The following diagram depicts the methods of text clustering:

Density-based clustering algorithms calculate the number of dimensions itself and do not force all data points to be part of the cluster. The data points not part of any cluster are marked as outliers. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a hierarchical variation of the DBSCAN algorithm that specializes in finding dense(micro)-clusters without being told the number of clusters in advance.

3. Text Clustering to Topic Modeling

Topic modelling is the term used to describe the process of identifying themes or hidden topics in a set of textual data. Traditionally, it involves finding a group of keywords or phrases that best represent and capture the essence of the topic. We need to understand the meaning of the topic through these keywords or phrases. Latent Dirichlet Allocation (LDA) is one such algorithm. Let’s discuss BERTopic in the following sections, which is a highly modular text clustering and topic modeling framework.

3.1 BERTopic: A Modular Topic Modeling Framework

Steps for performing topic modeling follow three steps of text clustering. The output of the third step of text clustering is fed to the fourth step of topic modeling. The following diagram depicts the steps (4th and 5th) for topic modeling:

4th step calculates the class-based term frequency, i.e., frequency (tf) of word (X) in cluster (C). This term is then multiplied with IDF (inverse document frequency) in the 5th step. The goal is to give more weight to the words in a cluster and less weight to the words appearing across all clusters.

The following diagram depicts the full pipeline from clustering to topic modeling. Though topic modeling follows clustering, they are largely independent of each other, and each component is modular. BERTopic can be customised, and we can choose another algorithm instead of the default ones.

3.2 Reranking in BERTopic

c-TF-IDF does not take into account the semantic structures, so BERTopic leverages the representation models (e.g. KeyBERTInspired, Maximal marginal relevance – MMR, spaCy) to rerank the topics found out in the previously discussed 5th step. This reranking is applied on each topic instead of every document. Many of the representation models are LLMs. Now with this step final pipeline extends to become the following:

3.3 Using LLM to generate a label for Topic

The following diagram explains how keywords combined with documents, along with a prompt, can be passed on to LLM for generating the label for the topic given keywords.

Final pipeline is as follows:

Final detailed pipeline:

References

Oreilly – Hands-On Large Language Models – Language Understanding and Generation by Jay Alammar & Maarten Grootendorst
Hugging Face – https://huggingface.co/