李校林, 杜托, 谢勇. 基于Hadoop的大数据频繁模式挖掘算法[J]. 微电子学与计算机, 2018, 35(9): 14-19.
引用本文: 李校林, 杜托, 谢勇. 基于Hadoop的大数据频繁模式挖掘算法[J]. 微电子学与计算机, 2018, 35(9): 14-19.
LI Xiao-lin, DU Tuo, XIE Yong. Algorithm for Mining Frequent Patterns in Big Data Based on Hadoop[J]. Microelectronics & Computer, 2018, 35(9): 14-19.
Citation: LI Xiao-lin, DU Tuo, XIE Yong. Algorithm for Mining Frequent Patterns in Big Data Based on Hadoop[J]. Microelectronics & Computer, 2018, 35(9): 14-19.

基于Hadoop的大数据频繁模式挖掘算法

Algorithm for Mining Frequent Patterns in Big Data Based on Hadoop

  • 摘要: 针对传统的频繁模式挖掘算法不能满足大数据环境下的挖掘需要, 提出一种高效挖掘大型数据库中频繁模式的并行算法H_PrePost.首先从压缩数据库、简化数据表示以及采用高效的连接和剪枝策略等方面对PrePost算法进行改进, 以提高单机模式下的挖掘效率.然后将改进算法迁移到Hadoop平台上, 利用MapReduce模型进行并行计算, 同时提出一种负载均衡策略保证集群高效运行.最后使用kulczynski度量和不平衡比对所挖掘的频繁模式进行评估, 以确保所挖掘模式具有实际应用价值.实验结果表明, H_PrePost算法可以有效挖掘大数据集中的频繁模式.

     

    Abstract: Aiming at the traditional frequent pattern mining algorithm can not meet the needs of mining in big data environment, a parallel algorithm for efficiently mining frequent patterns in large databases is proposed. Firstly, PrePost algorithm is improved from compressing database, simplifying data representation and using efficient connection and pruning strategy, which improve the efficiency of mining in stand-alone mode. Then, the improved algorithm is migrated to the Hadoop platform and the MapReduce model is used for parallel computing. A load balancing strategy is proposed to ensure the efficient operation of the cluster. Finally, the frequent pattern mining is evaluated using kulczynski metric and unbalance ratio to ensure that the mining pattern has practical value. Experimental results show that this algorithm can effectively mine the frequent patterns in big data sets.

     

/

返回文章
返回