基于R-树索引的高维相似重复记录检测改进算法

Research on High Dimensional Similarity Duplicate Record Detection Algorithm Based on R-tree Index

摘要: 经典的相似重复记录检测算法SNM算法随着记录维度的增加, 投影过程不仅会导致数据丢失, 算法的误差率也会明显增大.针对SNM算法的不足, 提出DRR算法, 利用R-树构建索引保留记录的高维空间特性, 通过聚类减少记录在叶子节点中的比较次数提高效率, 同时改进度量记录相似性的距离算法, 避免高维数据稀疏性的影响.最后, 通过真实数据在不同维度上分别与SNM算法进行对比, 验证了算法的有效性.

Abstract: The classic similar duplicate record detection algorithm SNM, With the increase of the recording dimension, the process of projecting can not only lead to the loss of data, but also the error rate of the algorithm will increase obviously.Aiming at the deficiency of SNM algorithm, using R-tree to construct index maintains the high dimension space characteristic of records.By clustering, the times of records comparing was reduced, so that the efficiency was improved.In order to avoid the influence of high dimensional data scarcity, an improved distance algorithm for measuring record similarity is proposed.Finally, the validity of the algorithm is verified by comparing the real data with the SNM algorithm in different dimensions.