Clustering
Introduction
Segments data and assigns them to groups based on their similarity. The goal is to find a natural grouping of the data.
A centroid is a point with value in each attribute that is the average of all points in the cluster, given by:
where is the number of points in cluster and is the -th point in cluster .
Centroid distance is the distance between centroids of two clusters.
Algorithm:
- Pick as wanted, then randomly assign each point to one of the clusters
- Repeat the following steps until no improvement is made:
- Compute the centroid of each cluster
- Reassign each point to the closest centroid
Algorithm:
- Start with each point as a separate cluster.
- Merge the two closest clusters, until all points are in one cluster.
- Draw a dendrogram to visualize the clusters.
- Draw horizontal line to cut the dendrogram. Num. of vertical lines the cut intersects is the number of clusters.
- Good choice if line has lots of vertical wiggle room (i.e. two clusters are farther from each other)
- Balance the objective with wiggle room
A dengdrogram is a tree-like diagram that shows the arrangement of the clusters. The height of each branches should be proportional to the centroid distance between the clusters.