李新鹏, 高欣, 何杨, 阎博, 孙汉旭, 李军良, 徐建航, 刘震宇, 庞博. 不平衡数据集下基于自适应加权Bagging-GBDT算法的磁盘故障预测模型[J]. 微电子学与计算机, 2020, 37(3): 14-19.
引用本文: 李新鹏, 高欣, 何杨, 阎博, 孙汉旭, 李军良, 徐建航, 刘震宇, 庞博. 不平衡数据集下基于自适应加权Bagging-GBDT算法的磁盘故障预测模型[J]. 微电子学与计算机, 2020, 37(3): 14-19.
LI Xin-peng, GAO Xin, HE Yang, YAN Bo, SUN Han-xu, LI Jun-liang, XU Jian-hang, LIU Zhen-yu, PANG Bo. Prediction model of disk failure based on adaptive weighted bagging-GBDT algorithm under imbalanced dataset[J]. Microelectronics & Computer, 2020, 37(3): 14-19.
Citation: LI Xin-peng, GAO Xin, HE Yang, YAN Bo, SUN Han-xu, LI Jun-liang, XU Jian-hang, LIU Zhen-yu, PANG Bo. Prediction model of disk failure based on adaptive weighted bagging-GBDT algorithm under imbalanced dataset[J]. Microelectronics & Computer, 2020, 37(3): 14-19.

不平衡数据集下基于自适应加权Bagging-GBDT算法的磁盘故障预测模型

Prediction model of disk failure based on adaptive weighted bagging-GBDT algorithm under imbalanced dataset

  • 摘要: 针对磁盘数据集中正负样本数目严重不平衡导致基于机器学习的分类算法易出现故障预测准确率低的问题,本文提出一种基于自适应加权Bagging-GBDT算法的磁盘故障预测模型.首先,提出基于聚类的分层欠采样方法对健康磁盘样本进行多次抽样,解决随机欠采样方法易丢弃潜在有用样本的问题;其次,将每次采样后样本与全部故障磁盘样本组合得到多个样本子集,通过训练这些子集建立多个预测精度较高的GBDT子分类模型;最后,根据待测点邻域样本类别自适应确定各子模型权重,据此通过加权硬投票集成最终的磁盘故障预测模型.在8组KEEL不平衡数据集上实验结果表明,与现有典型不平衡学习算法相比,少数类的召回率平均提升了9.46%;同时在磁盘公开数据集和某调度系统磁盘数据上对比验证了该方法在故障预测率上的先进性.

     

    Abstract: Aiming at the problem that the classification algorithm based on machine learning is prone to low accuracy of fault prediction due to the serious imbalance between the number of positive and negative samples in the disk dataset, this paper proposes a disk fault prediction model based on adaptive weighted Bagging-GBDT algorithm. Firstly, a hierarchical under-sampling method based on clustering algorithm is proposed to sample healthy disk samples several times to solve the problem that the random undersampling method is easy to discard potentially useful samples. Secondly, each sample after sampling is combined with all the failed disk samples to obtain several subsets. By training these subsets, a number of GBDT sub-classification models with higher prediction accuracy are established. Finally, the weights of each sub-model are adaptively determined through the neighborhood sample label of the test sample, and the final disk failure prediction model is integrated by weighted hard voting. The experimental results on 8 sets of KEEL imbalanced datasets show that the recall of the negative is increased by an average of 9.46% compared with the existing typical imbalanced learning algorithm. At the same time, the advancement of the method in the fault prediction rate is verified on disk public datasets and the disk data of a scheduling system.

     

/

返回文章
返回