存在标签噪音的数据中基于低秩矩阵分解的多输出回归

刘志刚; 刘森泽

doi:10.19304/J.ISSN1000-7180.2021.0787

存在标签噪音的数据中基于低秩矩阵分解的多输出回归

Multi-Target regression via low rank matrix factorization for data with label noise

摘要

摘要: 多输出回归是指针对一组输入变量来估计其对应的多个连续属性值，其在数据挖掘领域有着广泛的应用.当前关于多输出回归任务的研究都是基于标签值准确的假设下实现的.然而在实际情况中，数据集的部分标签可能并不准确，即部分标签存在一定的噪声.在这种情况下，传统多输出回归方法性能较差.为了解决上述情况下的多输出回归问题，利用大数据中数据样本大的特点来提炼各个标签间的相关性，从而利用标签间的相关矩阵重构标签.由于多输出问题中的标签个数通常较多，因此可以一定程度上稀释掉部分标签的噪声干扰.此外，利用低秩矩阵分解对上述思路建立数学优化问题，并在此基础上引入核技巧以提升模型的非线性拟合能力.最后，采用非凸近似的手段求解该优化问题，从而保证了多输出回归模型的预测性能.实验18个数据集上同现有的6种多输出回归方法进行了比较，提出的方法在样本量较大的场景下性能优势较为明显.

Abstract: Multiple-target regression refers to a set of input variables to estimate its corresponding multiple continuous attribute values, which has a wide range of applications in the field of data mining. The current research on multi-target regression tasks is based on the assumption that the label value is accurate. However, in actual situations, some labels of the data set may not be accurate, that is, some labels have certain noise. In this case, the traditional multi-target regression methods usually cannot achieve good results. In order to solve the multi-target regression problem in the above situation, the characteristics of large data samples in big data are used to refine the correlation between labels, and then the correlation matrix between labels is used to reconstruct the labels. Since the number of labels in the multi-target problem is usually large, the noise interference of some labels can be diluted to a certain extent. In addition, low-rank matrix decomposition is used to establish a mathematical optimization problem for the above ideas, and on this basis, kernel techniques are introduced to improve the nonlinear fitting ability of the model. Finally, a non-convex approximation method is used to solve the optimization problem, thereby ensuring the prediction performance of the multi-target regression model. The experiment is compared with 6 existing multi-target regression models on 18 datasets. The method proposed in this paper has obvious performance advantages in scenarios with large sample sizes.

HTML全文

参考文献(17)

施引文献

资源附件(0)