Notes@HKU by Jax

Clustering

Introduction

Segments data and assigns them to groups based on their similarity. The goal is to find a natural grouping of the data.

Euclidean distance

The distance between points ii and jj with kk attributes is:

dij=k=1K(xikxjk)2d_{ij} = \sqrt{\sum_{k=1}^{K}(x_{ik} - x_{jk})^2}

Centroid distance

A centroid is a point with value in each attribute that is the average of all points in the cluster, given by:

ck=1nki=1nkxic_k = \frac{1}{n_k} \sum_{i=1}^{n_k} x_i

where nkn_k is the number of points in cluster kk and xix_i is the ii-th point in cluster kk.

Centroid distance is the distance between centroids of two clusters.

K-means clustering

Algorithm:

  1. Pick kk as wanted, then randomly assign each point to one of the clusters
  2. Repeat the following steps until no improvement is made:
    1. Compute the centroid of each cluster
    2. Reassign each point to the closest centroid

Hierarchical clustering

Algorithm:

  1. Start with each point as a separate cluster.
  2. Merge the two closest clusters, until all points are in one cluster.
  3. Draw a dendrogram to visualize the clusters.
  4. Draw horizontal line to cut the dendrogram. Num. of vertical lines the cut intersects is the number of clusters.
    • Good choice if line has lots of vertical wiggle room (i.e. two clusters are farther from each other)
    • Balance the objective with wiggle room

A dengdrogram is a tree-like diagram that shows the arrangement of the clusters. The height of each branches should be proportional to the centroid distance between the clusters.

Comparing clustering methods

  • K-means: faster, but kk must be specified.
  • Hierarchical: can help with choosing kk, but slower as distance computations, might not be feasible for large datasets.

On this page