基于数据分布特性的聚类中心初始化方法

New Method for the Initialization of Clusters Based on Sata Distribution

摘要: 文中提出了一种新的基于数据局部和全局分布特性的K-Means初始化方法.算法通过对数据空间进行网格化后统计每个网格中数据点数目,选取具有数目局部最大值的网格,再利用距离优化方法全局的估算出K个初始聚类中心.在人工和真实数据集上,进行了与传统的聚类中心初始化算法的比较.实验结果表明,该算法利用局部最大值网格和距离优化的方法估算的聚类中心能够在保持及改善聚类效果的同时,明显减少迭代次数,提高收敛速度.

Abstract: A new initializing algorithm based on data distribution is proposed for K-Means in this paper.First we partition data space into grid and find the local-maximum cell which counts more data points than its neighborhood cells.Then we use distance optimization method to choose the seed clusters from local-maximum cells globally.Benchmark experiments evaluate the proposed method and five other typical initialization methods on both synthetic and real-life data sets,and the results demonstrated that our proposed algorithm gives faster convergence speed without descending in clustering performance.