## Introduction

Clustering as defined by Brian Everitt et al. [1]

Given a number of objects or individuals, each of which is described by a set of numerical measures, devise a classification scheme for grouping the objects into a number of classes such that objects within classes are similar in some respect and unlike those from other classes. The number of classes and the characteristics of each class are to be determined.

Definition as an optimization problem[2]:

Suppose we have a dataset:

X = \{ x_1, x_2, ..., x_n \}

the desired number of clusters is given *k*, and we have a function *f* that evaluates the quality of clustering and we want to compute a mapping:

\gamma : \{ 1, 2, ..., n\} \to \{1, 2, ..., k\}

that minimizes the function *f* subject to some constraints. The similarity measure is the key input to a clustering algorithm.

## Usecases

- Identifying Fake News
- Spam filter
- Marketing and Sales
- Classifying network traffic
- Identifying fraudulent or criminal activity
- Document analysis

## Types of clustering methods

- Hard Clustering
- Soft Clustering
- Hierarchical clustering
- Agglomerative (“bottom up”)
- Divisive (“top-down”)

- Flat clustering / non-hierarchical / partitional

## Clustering Algorithms

### k-means

A vector-based clustering algorithm.

### DBSCAN

### Hierarchical Clustering

## Clustering Documents

For clustering documents, we need to represent the document in *vector* format.

### How to vectorize a document? Any techniques?

#### Bag of words

#### tf-idf

#### Latent Dirichlet allocation (LDA)

Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.

David M. Blei[6]

## References

- Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Oxford University Press, fourth edition, 2001.
- Document Clustering, Pankaj Jajoo
- A Friendly Introduction to Text Clustering, Korbinian Koch
- 7 Innovative Uses of Clustering Algorithms in the Real World, Claire Whittaker
- Cluster Analysis, 5th Edition, Brian S. Everitt et. al.
- Probabilistic Topic Models, by David M. Blei
- jsLDA: In-browser topic modeling