Introduction
Clustering as defined by Brian Everitt et al. [1]
Given a number of objects or individuals, each of which is described by a set of numerical measures, devise a classification scheme for grouping the objects into a number of classes such that objects within classes are similar in some respect and unlike those from other classes. The number of classes and the characteristics of each class are to be determined.
Definition as an optimization problem[2]:
Suppose we have a dataset:
X = \{ x_1, x_2, ..., x_n \}
the desired number of clusters is given k, and we have a function f that evaluates the quality of clustering and we want to compute a mapping:
\gamma : \{ 1, 2, ..., n\} \to \{1, 2, ..., k\}
that minimizes the function f subject to some constraints. The similarity measure is the key input to a clustering algorithm.
Usecases
- Identifying Fake News
- Spam filter
- Marketing and Sales
- Classifying network traffic
- Identifying fraudulent or criminal activity
- Document analysis
Types of clustering methods
- Hard Clustering
- Soft Clustering
- Hierarchical clustering
- Agglomerative (“bottom up”)
- Divisive (“top-down”)
- Flat clustering / non-hierarchical / partitional
Clustering Algorithms
k-means
A vector-based clustering algorithm.
DBSCAN
Hierarchical Clustering
Clustering Documents
For clustering documents, we need to represent the document in vector format.
How to vectorize a document? Any techniques?
Bag of words
tf-idf



Latent Dirichlet allocation (LDA)
Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.
David M. Blei[6]
References
- Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Oxford University Press, fourth edition, 2001.
- Document Clustering, Pankaj Jajoo
- A Friendly Introduction to Text Clustering, Korbinian Koch
- 7 Innovative Uses of Clustering Algorithms in the Real World, Claire Whittaker
- Cluster Analysis, 5th Edition, Brian S. Everitt et. al.
- Probabilistic Topic Models, by David M. Blei
- jsLDA: In-browser topic modeling