Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, Thanks for the additional information - this is really helpful (I definitively got more than I was looking for :-) Cheers, Peter On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko wrote: > Hi Peter, we're using a part of Crail - it's core library, called disni ( > https://github.com/zr

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hi Peter, we're using a part of Crail - it's core library, called disni ( https://github.com/zrlio/disni/). We couldn't reproduce results from that blog post, any case Crail is more platformic approach (it comes with it's own file system), while SparkRdma is a pluggable approach - it's just a plugi

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, thank you for the reply and detailed information! Would this something comparable with Crail? ( http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html) I was more looking for something simple/quick making the shuffle between the local jvms quicker (like the idea of using local

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hey Peter, in SparkRDMA shuffle plugin ( https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, Mellanox NIC support On Demand Paging, where OS invalidates translations which are no longer valid due to eith

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu
I would be very interested in the initial question here: is there a production level implementation for memory only shuffle and configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK storage level) as mentioned in this ticket, https://github.com/apache/spark/pull/5403 ? It would be

Re: Spark In Memory Shuffle

2018-10-18 Thread ☼ R Nair
Thanks..great info. Will try and let all know. Best On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester wrote: > create the ramdisk: > mount tmpfs /mnt/spark -t tmpfs -o size=2G > > then point spark.local.dir to the ramdisk, which depends on your > deployment strategy, for me it was through Spa

Re: Spark In Memory Shuffle

2018-10-18 Thread onmstester onmstester
create the ramdisk: mount tmpfs /mnt/spark -t tmpfs -o size=2G then point spark.local.dir to the ramdisk, which depends on your deployment strategy, for me it was through SparkConf object before passing it to SparkContext: conf.set("spark.local.dir","/mnt/spark") To validate that spark is actual

Re: Spark In Memory Shuffle

2018-10-17 Thread ☼ R Nair
What are the steps to configure this? Thanks On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester wrote: > Hi, > I failed to config spark for in-memory shuffle so currently just > using linux memory mapped directory (tmpfs) as working directory of spark, > so everything is fast > > Sent using Zoh

Re: Spark In Memory Shuffle

2018-10-17 Thread Gourav Sengupta
super duper, I also need to try this out. On Wed, Oct 17, 2018 at 2:39 PM onmstester onmstester wrote: > Hi, > I failed to config spark for in-memory shuffle so currently just > using linux memory mapped directory (tmpfs) as working directory of spark, > so everything is fast > > Sent using Zoho

Re: Spark In Memory Shuffle

2018-10-17 Thread onmstester onmstester
Hi, I failed to config spark for in-memory shuffle so currently just using  linux memory mapped directory (tmpfs) as working directory of spark, so everything is fast Sent using Zoho Mail On Wed, 17 Oct 2018 16:41:32 +0330  thomas lavocat wrote Hi everyone, The possibility to have in m

Spark In Memory Shuffle

2018-10-17 Thread thomas lavocat
Hi everyone, The possibility to have in memory shuffling is discussed in this issue https://github.com/apache/spark/pull/5403. It was in 2015. In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still shuffle using disks. I would like to know : What is the current state of in