A Parallel Webpage Duplicate Removal Algorithm Based on Character String
-
Abstract
Against the inefficiency of the huge amount of webpage duplicate removal method, proposes a parallel webpage duplicate removal algorithm based on character string. Using the MapReduce model in Hadoop to extract character string from webpage content, and compute the character string into hash code by Simhash algorithm. Then compare the haming distance between all the hash code to find duplicate webpages. The Algorithm is proved to be more efficient than related algorithms based on experimental results.
-
-