基于Spark的并行K-means算法研究

许明杰; 蔚承建; 沈航

基于Spark的并行K-means算法研究

Research on K-means Algorithm of Spark Parallelization

摘要

摘要: 针对K-means算法在海量数据的处理过程中, 由迭代计算次数加大导致的内存不足的问题, 提出Spark并行化的K-means算法.将粒子群优化(PSO) 与K-means结合, 利用PSO来提高K-means的全局搜索能力, 得到初始聚类中心.利用Spark的迭代计算能力, 将K-means算法与Spark并行框架结合, 提升该算法模型对数据的处理速度, 缩短算法的整体运行时间.通过疾病检测数据进行实验, 结果表明Spark并行化的PSOK-means算法在保证准确率的同时大大提高了算法的效率, 本算法对于海量数据的聚类研究有着很好的应用场景.

Abstract: In view of the problem of insufficient memory caused by the increase of iterative computation in the process of mass data processing in K-means algorithm, this paper proposes K-means algorithm of Spark parallization.the algorithm uses particle swarm optimization (PSO) to improve the global search ability of K-means to get the initial clustering center.Through the iterative computing power of Spark, the K-means algorithm is combined with the Spark parallel framework to improve the processing speed of the model and reduce the overall running time of the algorithm.The experiment was carried out by disease detection data, the experimental results show that the Spark parallelized PSOK-means algorithm greatly improves the efficiency and accuracy of the algorithm, It will be good application scenarios for the clustering of massive data.

HTML全文

参考文献(6)

施引文献

资源附件(0)