Re: Spark In Memory Shuffle / 5403

Peter Liu Fri, 19 Oct 2018 10:03:02 -0700

Hi Peter,

Thanks for the additional information - this is really helpful (I
definitively got more than I was looking for :-)


Cheers,

Peter


On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko <petro.rude...@gmail.com>
wrote:

> Hi Peter, we're using a part of Crail - it's core library, called disni (
> https://github.com/zrlio/disni/). We couldn't reproduce results from that
> blog post, any case Crail is more platformic approach (it comes with it's
> own file system), while SparkRdma is a pluggable approach - it's just a
> plugin, that you can enable/disable for a particular workload, you can use
> any hadoop vendor, etc.
>
> The best optimization for shuffle between local jvms could be using
> something like short circuit local read (
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)
> to use unix socket for local communication or just directly read a part
> from other's jvm shuffle file. But yes, it's not available in spark out of
> box.
>
> Thanks,
> Peter Rudenko
>
> пт, 19 жовт. 2018 о 16:54 Peter Liu <peter.p...@gmail.com> пише:
>
>> Hi Peter,
>>
>> thank you for the reply and detailed information! Would this something
>> comparable with Crail? (
>> http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html)
>> I was more looking for something simple/quick making the shuffle between
>> the local jvms quicker (like the idea of using local ram disk) for my
>> simple use case.
>>
>> of course, a general and thorough implementation should cover the shuffle
>> between the nodes as major focus. hmm, looks like there is no
>> implementation within spark itself yet.
>>
>> very much appreciated!
>>
>> Peter
>>
>> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko <petro.rude...@gmail.com>
>> wrote:
>>
>>> Hey Peter, in SparkRDMA shuffle plugin (
>>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle
>>> file, to do Remote Direct Memory Access. If the shuffle data is bigger then
>>> RAM, Mellanox NIC support On Demand Paging, where OS invalidates
>>> translations which are no longer valid due to either non-present pages or
>>> mapping changes. So if you have an RDMA capable NIC (or you can try on
>>> Azure cloud
>>> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/
>>>  ), have a try. For network intensive apps you should get better
>>> performance.
>>>
>>> Thanks,
>>> Peter Rudenko
>>>
>>> чт, 18 жовт. 2018 о 18:07 Peter Liu <peter.p...@gmail.com> пише:
>>>
>>>> I would be very interested in the initial question here:
>>>>
>>>> is there a production level implementation for memory only shuffle and
>>>> configurable (similar to  MEMORY_ONLY storage level,  MEMORY_OR_DISK
>>>> storage level) as mentioned in this ticket,
>>>> https://github.com/apache/spark/pull/5403 ?
>>>>
>>>> It would be a quite practical and useful option/feature. not sure what
>>>> is the status of this ticket implementation?
>>>>
>>>> Thanks!
>>>>
>>>> Peter
>>>>
>>>> On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair <ravishankar.n...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks..great info. Will try and let all know.
>>>>>
>>>>> Best
>>>>>
>>>>> On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester <
>>>>> onmstes...@zoho.com> wrote:
>>>>>
>>>>>> create the ramdisk:
>>>>>> mount tmpfs /mnt/spark -t tmpfs -o size=2G
>>>>>>
>>>>>> then point spark.local.dir to the ramdisk, which depends on your
>>>>>> deployment strategy, for me it was through SparkConf object before 
>>>>>> passing
>>>>>> it to SparkContext:
>>>>>> conf.set("spark.local.dir","/mnt/spark")
>>>>>>
>>>>>> To validate that spark is actually using your ramdisk (by default it
>>>>>> uses /tmp), ls the ramdisk after running some jobs and you should see 
>>>>>> spark
>>>>>> directories (with date on directory name) on your ramdisk
>>>>>>
>>>>>>
>>>>>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>>>>>
>>>>>>
>>>>>> ---- On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair
>>>>>> <ravishankar.n...@gmail.com <ravishankar.n...@gmail.com>>* wrote ----
>>>>>>
>>>>>> What are the steps to configure this? Thanks
>>>>>>
>>>>>> On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester <
>>>>>> onmstes...@zoho.com.invalid> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>> I failed to config spark for in-memory shuffle so currently just
>>>>>> using linux memory mapped directory (tmpfs) as working directory of 
>>>>>> spark,
>>>>>> so everything is fast
>>>>>>
>>>>>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Spark In Memory Shuffle / 5403

Reply via email to