Back to Glossary
Machine Learning

Clustering

Clustering is an unsupervised machine learning technique used to group similar data points together into clusters. The goal is to identify inherent groupings in the data without any prior knowledge of the class labels.

Explanation

Clustering algorithms work by defining a measure of similarity or distance between data points. Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity. The algorithm then iteratively assigns data points to clusters based on these distance measures, aiming to minimize the within-cluster variance and maximize the between-cluster variance. Popular clustering algorithms include K-means (partitioning data into K clusters based on distance to centroids), hierarchical clustering (building a hierarchy of clusters), DBSCAN (Density-Based Spatial Clustering of Applications with Noise - identifying clusters based on density of points), and Gaussian Mixture Models (GMMs) which assume data is generated from a mixture of Gaussian distributions. Clustering is valuable for exploratory data analysis, anomaly detection, customer segmentation, and image segmentation. It allows for the identification of patterns and structures that might not be apparent through simple observation.

Related Terms