Hi,

You misunderstood me. I exactly wanted to say that Spark should be aware of 
them. So I agree with you. The point is to have also the yarn GPU/fpga 
scheduling as an option aside a potential spark GPU/fpga scheduler.

For the other proposal - yes the interfaces are slow, but one has to think in 
which part they need to be improved for optimal performance ml framework, Spark 
or in both. My gut feeling is in both. 

Best regards

Best regards

> On 8. May 2018, at 07:11, Reynold Xin <r...@databricks.com> wrote:
> 
> I don't think it's sufficient to have them in YARN (or any other services) 
> without Spark aware of them. If Spark is not aware of them, then there is no 
> way to really efficiently utilize these accelerators when you run anything 
> that require non-accelerators (which is almost 100% of the cases in real 
> world workloads).
> 
> For the other two, the point is not to implement all the ML/DL algorithms in 
> Spark, but make Spark integrate well with ML/DL frameworks. Otherwise you 
> will have the problems I described (super low performance when exchanging 
> data between Spark and ML/DL frameworks, and hanging issues with MPI-based 
> programs).
> 
> 
>> On Mon, May 7, 2018 at 10:05 PM Jörn Franke <jornfra...@gmail.com> wrote:
>> Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA 
>> scheduling, so it might be worth to have the last point generic that not 
>> only the Spark scheduler, but all supported schedulers can use GPU.
>> 
>> For the other 2 points I just wonder if it makes sense to address this in 
>> the ml frameworks themselves or in Spark.
>> 
>>> On 8. May 2018, at 06:59, Xiangrui Meng <m...@databricks.com> wrote:
>>> 
>>> Thanks Reynold for summarizing the offline discussion! I added a few 
>>> comments inline. -Xiangrui
>>> 
>>>> On Mon, May 7, 2018 at 5:37 PM Reynold Xin <r...@databricks.com> wrote:
>>>> Hi all,
>>>> 
>>>> Xiangrui and I were discussing with a heavy Apache Spark user last week on 
>>>> their experiences integrating machine learning (and deep learning) 
>>>> frameworks with Spark and some of their pain points. Couple things were 
>>>> obvious and I wanted to share our learnings with the list.
>>>> 
>>>> (1) Most organizations already use Spark for data plumbing and want to be 
>>>> able to run their ML part of the stack on Spark as well (not necessarily 
>>>> re-implementing all the algorithms but by integrating various frameworks 
>>>> like tensorflow, mxnet with Spark).
>>>> 
>>>> (2) The integration is however painful, from the systems perspective:
>>>> 
>>>> Performance: data exchange between Spark and other frameworks are slow, 
>>>> because UDFs across process boundaries (with native code) are slow. This 
>>>> works much better now with Pandas UDFs (given a lot of the ML/DL 
>>>> frameworks are in Python). However, there might be some low hanging fruit 
>>>> gaps here.
>>> The Arrow support behind Pands UDFs can be reused to exchange data with 
>>> other frameworks. And one possibly performance improvement is to support 
>>> pipelining when supplying data to other frameworks. For example, while 
>>> Spark is pumping data from external sources into TensorFlow, TensorFlow 
>>> starts the computation on GPUs. This would significant improve speed and 
>>> resource utilization.
>>>> Fault tolerance and execution model: Spark assumes fine-grained task 
>>>> recovery, i.e. if something fails, only that task is rerun. This doesn’t 
>>>> match the execution model of distributed ML/DL frameworks that are 
>>>> typically MPI-based, and rerunning a single task would lead to the entire 
>>>> system hanging. A whole stage needs to be re-run.
>>> This is not only useful for integrating with 3rd-party frameworks, but also 
>>> useful for scaling MLlib algorithms. One of my earliest attempts in Spark 
>>> MLlib was to implement All-Reduce primitive (SPARK-1485). But we ended up 
>>> with some compromised solutions. With the new execution model, we can set 
>>> up a hybrid cluster and do all-reduce properly.
>>>  
>>>> Accelerator-aware scheduling: The DL frameworks leverage GPUs and 
>>>> sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t 
>>>> aware of those resources, leading to either over-utilizing the 
>>>> accelerators or under-utilizing the CPUs.
>>>> 
>>>> The good thing is that none of these seem very difficult to address (and 
>>>> we have already made progress on one of them). Xiangrui has graciously 
>>>> accepted the challenge to come up with solutions and SPIP to these.
>>>> 
>>> 
>>> I will do more home work, exploring existing JIRAs or creating new JIRAs 
>>> for the proposal. We'd like to hear your feedback and past efforts along 
>>> those directions if they were not fully captured by our JIRA.
>>>  
>>>> Xiangrui - please also chime in if I didn’t capture everything. 
>>>> 
>>>> 
>>> -- 
>>> Xiangrui Meng
>>> Software Engineer
>>> Databricks Inc. 

Reply via email to