Hi Peter, Thanks for the additional information - this is really helpful (I definitively got more than I was looking for :-)
Cheers, Peter On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko <petro.rude...@gmail.com> wrote: > Hi Peter, we're using a part of Crail - it's core library, called disni ( > https://github.com/zrlio/disni/). We couldn't reproduce results from that > blog post, any case Crail is more platformic approach (it comes with it's > own file system), while SparkRdma is a pluggable approach - it's just a > plugin, that you can enable/disable for a particular workload, you can use > any hadoop vendor, etc. > > The best optimization for shuffle between local jvms could be using > something like short circuit local read ( > https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html) > to use unix socket for local communication or just directly read a part > from other's jvm shuffle file. But yes, it's not available in spark out of > box. > > Thanks, > Peter Rudenko > > пт, 19 жовт. 2018 о 16:54 Peter Liu <peter.p...@gmail.com> пише: > >> Hi Peter, >> >> thank you for the reply and detailed information! Would this something >> comparable with Crail? ( >> http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html) >> I was more looking for something simple/quick making the shuffle between >> the local jvms quicker (like the idea of using local ram disk) for my >> simple use case. >> >> of course, a general and thorough implementation should cover the shuffle >> between the nodes as major focus. hmm, looks like there is no >> implementation within spark itself yet. >> >> very much appreciated! >> >> Peter >> >> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko <petro.rude...@gmail.com> >> wrote: >> >>> Hey Peter, in SparkRDMA shuffle plugin ( >>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle >>> file, to do Remote Direct Memory Access. If the shuffle data is bigger then >>> RAM, Mellanox NIC support On Demand Paging, where OS invalidates >>> translations which are no longer valid due to either non-present pages or >>> mapping changes. So if you have an RDMA capable NIC (or you can try on >>> Azure cloud >>> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/ >>> ), have a try. For network intensive apps you should get better >>> performance. >>> >>> Thanks, >>> Peter Rudenko >>> >>> чт, 18 жовт. 2018 о 18:07 Peter Liu <peter.p...@gmail.com> пише: >>> >>>> I would be very interested in the initial question here: >>>> >>>> is there a production level implementation for memory only shuffle and >>>> configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK >>>> storage level) as mentioned in this ticket, >>>> https://github.com/apache/spark/pull/5403 ? >>>> >>>> It would be a quite practical and useful option/feature. not sure what >>>> is the status of this ticket implementation? >>>> >>>> Thanks! >>>> >>>> Peter >>>> >>>> On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair <ravishankar.n...@gmail.com> >>>> wrote: >>>> >>>>> Thanks..great info. Will try and let all know. >>>>> >>>>> Best >>>>> >>>>> On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester < >>>>> onmstes...@zoho.com> wrote: >>>>> >>>>>> create the ramdisk: >>>>>> mount tmpfs /mnt/spark -t tmpfs -o size=2G >>>>>> >>>>>> then point spark.local.dir to the ramdisk, which depends on your >>>>>> deployment strategy, for me it was through SparkConf object before >>>>>> passing >>>>>> it to SparkContext: >>>>>> conf.set("spark.local.dir","/mnt/spark") >>>>>> >>>>>> To validate that spark is actually using your ramdisk (by default it >>>>>> uses /tmp), ls the ramdisk after running some jobs and you should see >>>>>> spark >>>>>> directories (with date on directory name) on your ramdisk >>>>>> >>>>>> >>>>>> Sent using Zoho Mail <https://www.zoho.com/mail/> >>>>>> >>>>>> >>>>>> ---- On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair >>>>>> <ravishankar.n...@gmail.com <ravishankar.n...@gmail.com>>* wrote ---- >>>>>> >>>>>> What are the steps to configure this? Thanks >>>>>> >>>>>> On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester < >>>>>> onmstes...@zoho.com.invalid> wrote: >>>>>> >>>>>> >>>>>> Hi, >>>>>> I failed to config spark for in-memory shuffle so currently just >>>>>> using linux memory mapped directory (tmpfs) as working directory of >>>>>> spark, >>>>>> so everything is fast >>>>>> >>>>>> Sent using Zoho Mail <https://www.zoho.com/mail/> >>>>>> >>>>>> >>>>>> >>>>>>