Kai Zheng created HADOOP-13944:
----------------------------------

             Summary: [HDL] Support Deep Learning on Hadoop
                 Key: HADOOP-13944
                 URL: https://issues.apache.org/jira/browse/HADOOP-13944
             Project: Hadoop Common
          Issue Type: New Feature
            Reporter: Kai Zheng


Big data empowers Deep Learning (DL) and Hadoop is a natural platform to 
support this new computation as, of enormous data (HDFS) and vast CPU resources 
(YARN). Supporting Deep Learning in Hadoop platform layer has its particular 
advantages: it would be much easier to achieve the desired data affinity and 
hardware specific schedule, and it will also be flexible to support above 
computing and user facing frameworks such as Spark, Hive, Flink and Streams.

We’d like to propose to evolve Hadoop further embracing Deep Learning and 
provide the fundamental infrastructure to support the new computing. Briefly, 
the goals would be:
* A new layer in Hadoop for launching, distributing and executing Deep Learning 
workloads like for MapReduce;
* A framework in the new layer to leverage and support existing Deep Learning 
engines such as Tensorflow, Caffe/Intel-Caffe, mxnet, Nevana and etc.;
* Extend and enhance YARN to support the desired scheduling capabilities, like 
already raised in the community, for FPGA, GPU and etc.;
* Optimize HDFS storage and provide desired data formats for Deep Learning;
* Tools and libraries to submit and manage DL jobs, necessary web UIs for the 
monitoring and troubleshooting;
* Optionally, for the long term, a common Deep Learning domain representation 
for users to define DL jobs independent of concrete DL engines. 
Out of scope: new Deep Learning engine. We leverage and support existing DL 
engines, also allowing users to hook their owns.

The rational:
* Deep Learning is data and IO heavy, related advantages in HDFS and Hadoop: of 
vast data to learn from, already existing or easy loading into; data locality, 
still desired in DL; tiered storage support, to use faster devices like NVMe 
SSD, 3D XPoint and persistent memory; cache support, to use large memory for 
hot or repeatedly accessed data; even Ozone, the KV store for amounts of small 
objects and the desired API; and the cloud support.
* Deep Learning is computing heavy, related advantages in YARN: flexible, to 
support complex computing frameworks and applications; hardware capability 
aware, accordingly scheduling and distributing, thinking about FPGA, GPU and 
RDMA; large scale, proven scalability supporting thousands of nodes; nice 
facilities such as timeline service and richful interfaces (cmds, restful and 
web).
* As a common and low level facility layer, easier to optimize in bottom, yet 
powerful to support above frameworks, such as Spark, Flink, Hive and Streams. 
Don’t need to hack everywhere, but in a central place and common layer.
* Security, enterprise and distribution. A mature ecosystem for Deep Learning 
to build upon.

This is based on our survey and some preliminary work like Tensorflow on YARN 
(will document and discuss it separately under this umbrella). We welcome your 
feedback and valuable thoughts. When aligned, we’d like to contribute our work 
in Hadoop project space (maybe a new module like hadoop-deeplearning, similar 
to the cloud supports, in a separate branch) since from our point of view, the 
work can benefit more Hadoop users other than just in a Github repo.

Filing this unassigned, as it’s a team work for now, and hopefully, will be a 
community effort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to