Research on Parallel Text Classification System Based on Non-Balanced LSH
-
Abstract
In order to solve the problem of low efficiency of the K-Nearset Neighbors(KNN) classification algorithm in face of massive text, a non-balanced local sensitive hash classification algorithm based on hyper-plane is proposed, which has a more significant effect than the traditional local sensitive hash algorithm on improving the accuracy and real-time performance. At the same time, in order to further reduce the execution time of the classification algorithm and improve the classification efficiency, an efficient parallel text classification system baseed on Hadoop is designed which combines the classification algorithm and the Spark parallel computing model. The experimental results show that such text classification system has a high classification speed and a high classification accuracy.
-
-