Shuffle spill (memory) is the size of the deserialized form of the data in memory at the time when we spill it, whereas shuffle spill (disk) is the size of the serialized form of the data on disk after we spill it. This is why the latter tends to be much smaller than the former. Note that both metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times).
Andrew 2014-07-18 4:09 GMT-07:00 Sébastien Rainville <[email protected]> : > Hi, > > in the Spark UI, one of the metrics is "shuffle spill (memory)". What is > it exactly? Spilling to disk when the shuffle data doesn't fit in memory I > get it, but what does it mean to spill to memory? > > Thanks, > > - Sebastien > >
