Cluster Analysis

Cluster analysis stayed inside academic circles for a long time, but the recent “big data” wave made it relevant to BI, Data Visualization, and Data Mining users because big data sets in many cases are just an artificial union of big data subsets that almost unrelated to each other.

Cluster analysis is usually defined as a method to find groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. the main reason to use such method is to reduce the size of large data sets! Some people confuse this clustering with classification, segmentation, partitioning or results of queries – that is a mistake.

Clustering can be ambiguous, like on picture below and depends on type of clustering (e.g. partitional, separated, center-based, contiguous, density-bases, hierarchical) and algorithm (e.g. K-means).

The most popular approach is partitional K-Means clustering, where each cluster is associated with a centroid (center point), each point is assigned to the cluster with the closest centroid and the number of clusters (which is K !) must be specified. The basic algorithm is very simple:

  1. Select K points as the initial Centroids
  3. Form K clusters by assigning all points to the closest Centroid
  4. Recompute the Centroid for each cluster
  5. UNTIL “The Centroids don’t change or all changes are below predefined threshold”

The image below demonstrates the importance of choosing the initial centroids and shows six iterations leading to a successful K-Means based Clustering:

K-Means algorithm is sensitive to size of clusters, densities of datapoints, non-globular shapes of clusters and of course to outliers, but in combination with proper Data Visualization those problems can be solved in most cases.

Clustering is optimizing the cohesion within the cluster while maximizing the separation between cluster and datapoints outside of the cluster:

PCA can help you determine the best data visualization for your users. Contact us directly at 617-527-4722 or visit the contact us page.

View our interactive Data Visualization Demos: