Skip to content

Ranjan Kumar

Menu
  • About Me
  • My Papers
Menu

Clustering

Posted on September 16, 2022September 16, 2022 by Ranjan Kumar

Introduction

Clustering as defined by Brian Everitt et al. [1]

Given a number of objects or individuals, each of which is described by a set of numerical measures, devise a classification scheme for grouping the objects into a number of classes such that objects within classes are similar in some respect and unlike those from other classes. The number of classes and the characteristics of each class are to be determined.

Definition as an optimization problem[2]:

Suppose we have a dataset:

X = \{ x_1, x_2, ..., x_n \}

the desired number of clusters is given k, and we have a function f that evaluates the quality of clustering and we want to compute a mapping:

\gamma : \{ 1, 2, ..., n\} \to \{1, 2, ..., k\}

that minimizes the function f subject to some constraints. The similarity measure is the key input to a clustering algorithm.

Usecases

  • Identifying Fake News
  • Spam filter
  • Marketing and Sales
  • Classifying network traffic
  • Identifying fraudulent or criminal activity
  • Document analysis

Types of clustering methods

  • Hard Clustering
  • Soft Clustering
  • Hierarchical clustering
    • Agglomerative (“bottom up”)
    • Divisive (“top-down”)
  • Flat clustering / non-hierarchical / partitional

Clustering Algorithms

k-means

A vector-based clustering algorithm.

StatQuest: K-means clustering

DBSCAN

Clustering with DBSCAN, Clearly Explained!!!

Hierarchical Clustering

StatQuest: Hierarchical Clustering

Clustering Documents

For clustering documents, we need to represent the document in vector format.

How to vectorize a document? Any techniques?

Bag of words

tf-idf

Image Courtesy[3]
Image Courtesy[3]
Image Courtesy[3]

Latent Dirichlet allocation (LDA)

Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.

David M. Blei[6]

References

  1. Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Oxford University Press, fourth edition, 2001.
  2. Document Clustering, Pankaj Jajoo
  3. A Friendly Introduction to Text Clustering, Korbinian Koch
  4. 7 Innovative Uses of Clustering Algorithms in the Real World, Claire Whittaker
  5. Cluster Analysis, 5th Edition, Brian S. Everitt et. al.
  6. Probabilistic Topic Models, by David M. Blei
  7. jsLDA: In-browser topic modeling

Search

Recent Posts

  • Leadership Pointers
  • Tips & Tricks – Software Development
  • Design Principles
  • Clustering
  • Search Engine
  • Medical Imaging
  • Cloud Computing Concepts
  • Distributed Deep Learning
  • Multimodal Deep Learning
  • Deep Reinforcement Learning

Categories

  • AI & ML
  • Cloud Computing
  • Computer Vision
  • Deep Learning
  • Deep Reinforcement Learning
  • Explainable AI
  • Medical Imaging
  • Others
  • Reinforcement Learning
  • Software Development
  • Uncategorized