一种基于Hadoop的改进减法聚类算法

An Improved Subtractive Clustering Algorithm Based on Hadoop

摘要: 传统的减法聚类算法时间复杂度高,算法不具有分布式特性,不满足大数据处理的要求.提出一种基于Hadoop的改进减法聚类算法,利用MapReduce模型改进减法聚类执行过程,实现求解邻域半径、初始化密度指标、更新密度指标和划分数据记录等过程的并行化.实验结果表明,同传统的串行算法相比,提出的算法能够对大数据进行快速聚类,同时表现出良好的稳定性与扩展性.

Abstract: Traditional subtractive clustering algorithm' time complexity is pretty high, and it doesn't have the characteristic of distributed processing. Therefore, it is not suitable for the processing requirement in big data environment. This paper proposes an improved subtractive clustering algorithm which is based on Hadoop. It applies multiple MapReduce processes to implement the parallelization of subtractive clustering in solving neighborhood radius, initializing density index, updating density index and dividing the data records. Experiment demonstrates that comparing to the traditional serial algorithms, the proposed improved algorithm can indeed cluster the big data fast and has good stability and expansibility.