基于特征串的网页文本并行去重算法

A Parallel Webpage Duplicate Removal Algorithm Based on Character String

摘要: 针对海量网页文本去重效率不高问题,提出了一种高效的并行网页去重算法.该算法利用Hadoop框架的Map/Reduce机制,通过对网页文本提取特征串,使用Google的Simhash算法对提取的特征串进行哈希映射得到相应的哈希码,然后对产生的哈希码进行海明距离比较,从而得到重复的网页数据.实验表明,与相关去重算法相比,所提算法有效地提高了文本去重计算效率.

Abstract: Against the inefficiency of the huge amount of webpage duplicate removal method, proposes a parallel webpage duplicate removal algorithm based on character string. Using the MapReduce model in Hadoop to extract character string from webpage content, and compute the character string into hash code by Simhash algorithm. Then compare the haming distance between all the hash code to find duplicate webpages. The Algorithm is proved to be more efficient than related algorithms based on experimental results.