黄震, 钱育蓉, 于炯, 英昌甜, 赵京霞. 一种Spark下分布式DBN并行加速策略[J]. 微电子学与计算机, 2018, 35(11): 100-105.
引用本文: 黄震, 钱育蓉, 于炯, 英昌甜, 赵京霞. 一种Spark下分布式DBN并行加速策略[J]. 微电子学与计算机, 2018, 35(11): 100-105.
HUANG Zhen, QIAN Yu-rong, YU Jiong, Ying Chang-tian, Zhao Jing-xia. A Parallel Acceleration Strategy for Distributed DBN in Spark[J]. Microelectronics & Computer, 2018, 35(11): 100-105.
Citation: HUANG Zhen, QIAN Yu-rong, YU Jiong, Ying Chang-tian, Zhao Jing-xia. A Parallel Acceleration Strategy for Distributed DBN in Spark[J]. Microelectronics & Computer, 2018, 35(11): 100-105.

一种Spark下分布式DBN并行加速策略

A Parallel Acceleration Strategy for Distributed DBN in Spark

  • 摘要: Spark下分布式深度信念网络(Distributed Deep Belief Network, DDBN)存在数据倾斜、缺乏细粒度数据置换、无法自动缓存重用度高的数据等问题, 导致了DDBN计算复杂高、运行时效性低的缺陷.为了提高DDBN的时效性, 提出一种Spark下DDBN数据并行加速策略, 其中包含基于标签集的范围分区(Label Set based on Range Partition, LSRP)算法和基于权重的缓存替换(Cache Replacement based on Weight, CRW)算法.通过LSRP算法解决数据倾斜问题, 采用CRW算法解决RDD(Resilient Distributed Datasets)重复利用以及缓存数据过多造成内存空间不足问题.结果表明:与传统DBN相比, DDBN训练速度提高约2.3倍, 通过LSRP和CRW大幅提高了DDBN分布式并行度.

     

    Abstract: DDBN(Distributed Deep Belief Network, DDBN) has many problems in Spark, such as data skew, lack of fine-grained data replacement, and unable to cache data with high re-usability automatically, resulting in high complexity and low timeliness of DDBN computing. In order to improve the timeliness of DDBN, a parallel acceleration strategy is proposed for DDBN in Spark, which includes LSRP(Label Set based on Range Partition, LSRP) algorithm and CRWS(Cache Replacement based on Weight Statistics, CRWS) algorithm. The problem of data skew is solved by LSRP algorithm, and CRW algorithm is used to solve the problem of RDD reuse and cached data caused by insufficient memory space. The results show that compared with the traditional DBN, the training speed of DDBN is increased by about 2.3 times, and the distributed parallelism of DDBN is greatly improved through LSRP and CRW.

     

/

返回文章
返回