柴变芳, 李有熠. 基于Spark的主动重叠K-means聚类算法[J]. 微电子学与计算机, 2021, 38(1): 70-76.
引用本文: 柴变芳, 李有熠. 基于Spark的主动重叠K-means聚类算法[J]. 微电子学与计算机, 2021, 38(1): 70-76.
CHAI Bian-fang, LI You-yi. Active overlapping K-means clustering algorithm based on spark[J]. Microelectronics & Computer, 2021, 38(1): 70-76.
Citation: CHAI Bian-fang, LI You-yi. Active overlapping K-means clustering algorithm based on spark[J]. Microelectronics & Computer, 2021, 38(1): 70-76.

基于Spark的主动重叠K-means聚类算法

Active overlapping K-means clustering algorithm based on spark

  • 摘要: 别大规模数据的潜在模式.但其存在两个问题:多次迭代Master和Worker节点间数据交换,导致算法运行效率低;对初始聚类中心敏感,导致聚类结果不稳定且收敛速度慢.为提高算法运行效率和结果稳定性,提出了一种主动重叠K-means聚类算法.其在各个分区上执行重叠K-means算法获得局部聚类中心,将结果汇总回收到Master节点,在Master节点运行重叠K-means算法聚合所有聚类中心,作为最终聚类中心;同时采用并行化主动选择策略获得更优的初始簇中心,提高算法准确性、收敛速度.实验结果表明,改进后的主动重叠聚类算法提高了算法准确性,降低了算法运行时间.

     

    Abstract: Parallel Overlapping K-means clustering algorithm (POKM) based on Spark framework can effectively identify potential pattern of large-scale data. But multiple iterations of data exchange between the Master and the Worker nodes lead to low efficiency of the algorithm, and it is sensitive to the initial clustering center, resulting in unstable clustering results and slow convergence rate. In order to improve the performance and stability of the algorithm, an active overlapping K-means clustering algorithm is proposed. It performs the overlapping K-means algorithm on each worker and obtains the local cluster center, and then collects the centers and runs the overlapping K-means algorithm on the Master node. At the same time, the parallel active selection strategy is adopted to obtain a better initial cluster center to improve the accuracy and convergence speed. Experiment results show that the improved active overlapping clustering algorithm improves the accuracy and reduces the running time.

     

/

返回文章
返回