Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Michael Slavitch Fri, 01 Apr 2016 15:48:57 -0700

As I mentioned earlier this flag is now ignored.

On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <slavi...@gmail.com> wrote:


> Shuffling a 1tb set of keys and values (aka sort by key)  results in about
> 500gb of io to disk if compression is enabled. Is there any way to
> eliminate shuffling causing io?
>
> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Michael - I'm not sure if you actually read my email, but spill has
>> nothing to do with the shuffle files on disk. It was for the partitioning
>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>> memory when data doesn't fit in memory.
>>
>>
>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <slavi...@gmail.com>
>> wrote:
>>
>>> RAMdisk is a fine interim step but there is a lot of layers eliminated
>>> by keeping things in memory unless there is need for spillover.   At one
>>> time there was support for turning off spilling.  That was eliminated.
>>> Why?
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mri...@gmail.com>
>>> wrote:
>>>
>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>> test if these are the bottlenecks or something else is.
>>>> For most practical purposes, pointing local dir to ramdisk should
>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>
>>>> Are there concerns with taking that approach to test ? (I dont see
>>>> any, but I am not sure if I missed something).
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <slavi...@gmail.com>
>>>> wrote:
>>>> > I totally disagree that it’s not a problem.
>>>> >
>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
>>>> NVME
>>>> > drives.
>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>> buffer pool
>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>> TB/node
>>>> > range.
>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>> multiple
>>>> > executors writing out to disk increases greatly.
>>>> >
>>>> > I thought the whole point of Spark was in-memory computing?  It’s in
>>>> fact
>>>> > in-memory for some things but  use spark.local.dir as a buffer pool of
>>>> > others.
>>>> >
>>>> > Hence, the performance of  Spark is gated by the performance of
>>>> > spark.local.dir, even on large memory systems.
>>>> >
>>>> > "Currently it is not possible to not write shuffle files to disk.”
>>>> >
>>>> > What changes >would< make it possible?
>>>> >
>>>> > The only one that seems possible is to clone the shuffle service and
>>>> make it
>>>> > in-memory.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Apr 1, 2016, at 4:57 PM, Reynold Xin <r...@databricks.com> wrote:
>>>> >
>>>> > spark.shuffle.spill actually has nothing to do with whether we write
>>>> shuffle
>>>> > files to disk. Currently it is not possible to not write shuffle
>>>> files to
>>>> > disk, and typically it is not a problem because the network fetch
>>>> throughput
>>>> > is lower than what disks can sustain. In most cases, especially with
>>>> SSDs,
>>>> > there is little difference between putting all of those in memory and
>>>> on
>>>> > disk.
>>>> >
>>>> > However, it is becoming more common to run Spark on a few number of
>>>> beefy
>>>> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
>>>> improving
>>>> > performance for those. Meantime, you can setup local ramdisks on each
>>>> node
>>>> > for shuffle writes.
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <slavi...@gmail.com
>>>> >
>>>> > wrote:
>>>> >>
>>>> >> Hello;
>>>> >>
>>>> >> I’m working on spark with very large memory systems (2TB+) and
>>>> notice that
>>>> >> Spark spills to disk in shuffle.  Is there a way to force spark to
>>>> stay in
>>>> >> memory when doing shuffle operations?   The goal is to keep the
>>>> shuffle data
>>>> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
>>>> the IO
>>>> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
>>>> >>
>>>> >> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
>>>> >> Tungsten sort in 1.5.x
>>>> >>
>>>> >> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false,
>>>> but this
>>>> >> is ignored by the tungsten-sort shuffle manager; its optimized
>>>> shuffles will
>>>> >> continue to spill to disk when necessary.”
>>>> >>
>>>> >> If this is impossible via configuration changes what code changes
>>>> would be
>>>> >> needed to accomplish this?
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>>> >>
>>>> >
>>>> >
>>>>
>>> --
>>> Michael Slavitch
>>> 62 Renfrew Ave.
>>> Ottawa Ontario
>>> K1S 1Z5
>>>
>>
>> --
> Michael Slavitch
> 62 Renfrew Ave.
> Ottawa Ontario
> K1S 1Z5
>
-- 
Michael Slavitch
62 Renfrew Ave.
Ottawa Ontario
K1S 1Z5

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Reply via email to