Abstract:
The classic similar duplicate record detection algorithm SNM, With the increase of the recording dimension, the process of projecting can not only lead to the loss of data, but also the error rate of the algorithm will increase obviously.Aiming at the deficiency of SNM algorithm, using R-tree to construct index maintains the high dimension space characteristic of records.By clustering, the times of records comparing was reduced, so that the efficiency was improved.In order to avoid the influence of high dimensional data scarcity, an improved distance algorithm for measuring record similarity is proposed.Finally, the validity of the algorithm is verified by comparing the real data with the SNM algorithm in different dimensions.