林静怀, 刘治宇, 李军良, 高欣, 李泽科, 唐志军, 余斯航, 徐建航. 面向不平衡数据分类的高维超球体过采样方法[J]. 微电子学与计算机, 2021, 38(5): 65-72.
引用本文: 林静怀, 刘治宇, 李军良, 高欣, 李泽科, 唐志军, 余斯航, 徐建航. 面向不平衡数据分类的高维超球体过采样方法[J]. 微电子学与计算机, 2021, 38(5): 65-72.
LIN Jing-huai, LIU Zhi-yu, LI Jun-liang, GAO Xin, LI Ze-ke, TANG Zhi-jun, YU Si-hang, XU Jian-hang. High-dimensional hypersphere oversampling method for imbalance classification[J]. Microelectronics & Computer, 2021, 38(5): 65-72.
Citation: LIN Jing-huai, LIU Zhi-yu, LI Jun-liang, GAO Xin, LI Ze-ke, TANG Zhi-jun, YU Si-hang, XU Jian-hang. High-dimensional hypersphere oversampling method for imbalance classification[J]. Microelectronics & Computer, 2021, 38(5): 65-72.

面向不平衡数据分类的高维超球体过采样方法

High-dimensional hypersphere oversampling method for imbalance classification

  • 摘要: 在机器学习不平衡分类方法研究中,由于多数类与少数类样本数量之间存在较大差异,导致分类器易出现判定准确率低的问题.以SMOTE为代表的一类过采样方法是处理该问题的一种有效手段.该类方法在选定的线段中随机生成少数类新点来重新平衡数据集,但存在忽略少数类样本在超维空间中分布多样性的缺陷.本文提出一种面向不平衡数据分类的高维超球体过采样(HS-SMOTE)方法.在少数类样本集上通过随机抽样获得需要平衡的样本数目,在此基础上依次对每一样本通过欧氏距离选取其在少数类分布空间中的对应最近邻点,以两点连线中点为球心在超维空间构建采样超球体,在此区域内通过维度空间距离迭代随机生成所需的少数类新点,在类别样本数据再平衡的基础上增加少数类样本的空间分布多样性.在15组KEEL不平衡数据集上结合随机森林(RF)分类器开展了大量实验,与6种典型过采样方法相比,所提方法在G-mean以及F1-score指标上均有较好的表现,并通过了2种统计学假设检验方法的有效性验证.

     

    Abstract: In the research of imbalance classification methods of machine learning, the classifier is prone to the problem of low judgment accuracydue to the large difference of the number between the majority class and the minority class. A class of oversampling methods represented by SMOTE are effective to deal with this problem. These types of methods randomly generate the minority new points in the selected line segment to rebalance the data set, but there is the defect of ignoring the diversity of minority samples in the super-dimensional space. A high-dimensional Hypersphere-SMOTE (HS-SMOTE) method isproposed for imbalanced data classification. On the minority sample set, the number of samples that need to be balanced is obtained by random sampling, and based on this sampling, for each sample, its corresponding nearest neighbor is selected in turn through the Euclidean distance in the minority distribution space, and the midpoint of the two points is used for the center to construct a sampled hypersphere in the super-dimensional space. In this area, randomly generate the required minority new points through the dimensional space distance iteration, thus the spatial distribution diversity of the minority samples is increased on the basis of rebalancing the category sample data. A large number of experiments have been carried out on 15 sets of KEEL imbalanced data sets combining Random Forest (RF)classifiers. Compared with the 6 typical oversampling methods, the method proposed in the article has good performance on G-meanandF1-score indicators, and have passed the validity verification of two statistical hypothesis testing methods.

     

/

返回文章
返回