Table of Contents
Clustering is the task of dividing data into sub-groups. Initially it is assumed that the data can be divided into a number of sub-groups. Next the location and size of clusters are determined. The algorithm determines the locations of cluster centroids by minimalization of sum of squared distances. The k-Means algorithm is a heuristic method, which returns correct (but not always optimal) results.
The k-means clustering algorithm is an iterative process that operates over a fixed number of clusters while attempting to satisfy the following requirements:
The centroid of each cluster is the mean of its members.
Each observation belongs to one of the clusters.
Because the results of the k-Means method strongly depend on the initialization of centroids, it is a good idea to perform more than one test for one data set. The classic k-Means algorithm gives approximately 72% confidence of obtaining the optimal result (measured for evenly distributed observations). This depends on the random selection of the cluster centroids. A modified variant, in which it is assumed that the initial cluster centroids are located in the center of the considered space, can also be used. This yields better results but is slower.
The k-Means algorithm is an unsupervised method.