基于抽样融合改进的大数据聚类方法

刘岩; 王存睿

基于抽样融合改进的大数据聚类方法

刘岩,
王存睿

An Improved Big Data Clustering Method Based on Sampling Fusion

摘要

摘要: 校园网络大数据集的有效挖掘以提高信息的使用价值, 对校园网络优化有着极其深远的影响, 为此, 本文提出一种基于leaders算法的校园网络大数据聚类改进算法leaders-k-means算法, 算法首先通过leaders算法对校园网大数据集进行初始聚类, 并根据初始聚类中心对校园网络大数据进行多次随机抽样形成多个小样本集, 然后利用初始聚类中心做为初始值对每个小样本集进行k-means聚类, 既保证了k-means算法初始值设置的合理性, 又使得算法在一个较小的样本集中聚类, 提高效率, 最后对聚类后的多样本集合并, 利用自下而上的层次聚类方法重新聚类获得原始样本的聚类中心.算法融合了层次方法、划分方法以及密度方法的优势, 通过对比实验验证, 算法取得较好的聚类效果.

Abstract: Effective mining of large data sets of campus network has been a very far-reaching impact on campus network optimization. So, in this paper, an improved large data clustering algorithm, named Leaders-k-means, was presented. In this method, the former Leaders algorithm is used to obtain initial cluster centers firstly and a number of small sample sets are formed on the basis of those centetrs by random sampling of the large data of the campus network, and then, the initial clustering center is used further as the initial value to carry out K-means clustering for each small sample set, which not only ensures the rationality of the initial value of K-means algorithm, but also makes the algorithm running in a small sample set improving the efficiency of the algorithm, and at last, these small sample sets which have been clustered by k-means method are combined into a larger sample set and the bottom-up hierarchical clustering method is used to obtain the final cluster centers of the original big data set. The proposed algorithm combines the advantages of hierarchical method, partition method and density method. The simulation results show further that the proposed algorithm has good clustering results.

HTML全文

参考文献(8)

施引文献

资源附件(0)