XIE Yao-bing. A Parallel Webpage Duplicate Removal Algorithm Based on Character String[J]. Microelectronics & Computer, 2015, 32(2): 69-72.
Citation: XIE Yao-bing. A Parallel Webpage Duplicate Removal Algorithm Based on Character String[J]. Microelectronics & Computer, 2015, 32(2): 69-72.

A Parallel Webpage Duplicate Removal Algorithm Based on Character String

  • Against the inefficiency of the huge amount of webpage duplicate removal method, proposes a parallel webpage duplicate removal algorithm based on character string. Using the MapReduce model in Hadoop to extract character string from webpage content, and compute the character string into hash code by Simhash algorithm. Then compare the haming distance between all the hash code to find duplicate webpages. The Algorithm is proved to be more efficient than related algorithms based on experimental results.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return