A Parallel Webpage Duplicate Removal Algorithm Based on Character String

XIE Yao-bing. A Parallel Webpage Duplicate Removal Algorithm Based on Character String[J]. Microelectronics & Computer, 2015, 32(2): 69-72.

Citation:

XIE Yao-bing. A Parallel Webpage Duplicate Removal Algorithm Based on Character String[J]. Microelectronics & Computer, 2015, 32(2): 69-72.

Citation:

XIE Yao-bing. A Parallel Webpage Duplicate Removal Algorithm Based on Character String[J]. Microelectronics & Computer, 2015, 32(2): 69-72.

Abstract

Against the inefficiency of the huge amount of webpage duplicate removal method, proposes a parallel webpage duplicate removal algorithm based on character string. Using the MapReduce model in Hadoop to extract character string from webpage content, and compute the character string into hash code by Simhash algorithm. Then compare the haming distance between all the hash code to find duplicate webpages. The Algorithm is proved to be more efficient than related algorithms based on experimental results.

FullText(HTML)

Turn off MathJax

Article Contents

Export File