Chapter 35. K-Means Clustering

Chapter 35. K-Means Clustering
Prev	Part IV. Modules	Next

Table of Contents

Introduction

Method description

Usage

Data requirements
Model building
Model statistics
Model application

Example of K-Means Clustering

References

Introduction

Clustering is the task of dividing data into sub-groups. Initially it is assumed that the data can be divided into a number of sub-groups. Next the location and size of clusters are determined. The algorithm determines the locations of cluster centroids by minimalization of sum of squared distances. The k-Means algorithm is a heuristic method, which returns correct (but not always optimal) results.

The k-means clustering algorithm is an iterative process that operates over a fixed number of clusters while attempting to satisfy the following requirements:

The centroid of each cluster is the mean of its members.
Each observation belongs to one of the clusters.

Because the results of the k-Means method strongly depend on the initialization of centroids, it is a good idea to perform more than one test for one data set. The classic k-Means algorithm gives approximately 72% confidence of obtaining the optimal result (measured for evenly distributed observations). This depends on the random selection of the cluster centroids. A modified variant, in which it is assumed that the initial cluster centroids are located in the center of the considered space, can also be used. This yields better results but is slower.

Figure 35.1. The idea of clustering

The k-Means algorithm is an unsupervised method.

Prev	Up	Next
References	Home	Method description