It looks because different snappy version, if you disable compress or switch to lz4, the size is no different.
Yours, Xuefeng Wu 吴雪峰 敬上 > On 2015年2月10日, at 下午6:13, chris <christian.b...@performance-media.de> wrote: > > Hello, > > as the original message from Kevin Jung never got accepted to the > mailinglist, I quote it here completely: > > > Kevin Jung wrote >> Hi all, >> The size of shuffle write showing in spark web UI is much different when I >> execute same spark job on same input data(100GB) in both spark 1.1 and >> spark 1.2. >> At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 >> but 91.0GB in spark 1.2. >> I set spark.shuffle.manager option to hash because it's default value is >> changed but spark 1.2 writes larger file than spark 1.1. >> Can anyone tell me why this happens? >> >> Thanks >> Kevin > > I'm experiencing the same thing with my job and that's what I tested: > > * Spark 1.2.0 with Sort-based Shuffle > * Spark 1.2.0 with Hash-based Shuffle > * Spark 1.2.1 with Sort-based Shuffle > > All three combinations show the same behaviour, which contrasts from Spark > 1.1.0. > > In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for > almost four hours. Configuration is identical otherwise - I only added > org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for > Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102. > > > As a consequence (I think, but causality might be different) I see lots and > lots of disk spills. > > I cannot provide a small test case, but maybe the log entries for a single > worker thread can help someone investigate on this. (See below.) > > > I also opened an issue on this, see > https://issues.apache.org/jira/browse/SPARK-5715 > > Any help will be greatly appreciated, because otherwise I'm stuck with Spark > 1.1.0, as quadrupling runtime is not an option. > > Sincerely, > > Chris > > > > 2015-02-09T14:06:06.328+01:00 INFO org.apache.spark.executor.Executor > Running task 9.0 in stage 18.0 (TID 300) Executor task launch worker-18 > 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.CacheManager > Partition > rdd_35_9 not found, computing it Executor task launch worker-18 > 2015-02-09T14:06:06.351+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty > blocks out of 10 blocks Executor task launch worker-18 > 2015-02-09T14:06:06.351+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 0 ms Executor task launch worker-18 > 2015-02-09T14:06:07.396+01:00 INFO org.apache.spark.storage.MemoryStore > ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor > task launch worker-18 > 2015-02-09T14:06:07.397+01:00 INFO org.apache.spark.storage.MemoryStore > Block rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor > task launch worker-18 > 2015-02-09T14:06:07.398+01:00 INFO > org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_35_9 > Executor task launch worker-18 > 2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.CacheManager > Partition > rdd_38_9 not found, computing it Executor task launch worker-18 > 2015-02-09T14:06:07.399+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 10 non-empty > blocks out of 10 blocks Executor task launch worker-18 > 2015-02-09T14:06:07.400+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 0 ms Executor task launch worker-18 > 2015-02-09T14:06:07.567+01:00 INFO org.apache.spark.storage.MemoryStore > ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor > task launch worker-18 > 2015-02-09T14:06:07.568+01:00 INFO org.apache.spark.storage.MemoryStore > Block rdd_38_9 stored as values in memory (estimated size 92... Executor > task launch worker-18 > 2015-02-09T14:06:07.569+01:00 INFO > org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_38_9 > Executor task launch worker-18 > 2015-02-09T14:06:07.573+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 34 non-empty > blocks out of 50 blocks Executor task launch worker-18 > 2015-02-09T14:06:07.573+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 1 ms Executor task launch worker-18 > 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.CacheManager > Partition > rdd_41_9 not found, computing it Executor task launch worker-18 > 2015-02-09T14:06:38.931+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 3 non-empty > blocks out of 10 blocks Executor task launch worker-18 > 2015-02-09T14:06:38.931+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 0 ms Executor task launch worker-18 > 2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore > ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor > task launch worker-18 > 2015-02-09T14:06:38.945+01:00 INFO org.apache.spark.storage.MemoryStore > Block rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor > task launch worker-18 > 2015-02-09T14:06:38.946+01:00 INFO > org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_41_9 > Executor task launch worker-18 > 2015-02-09T14:06:38.946+01:00 WARN org.apache.spark.storage.BlockManager > Block rdd_41_9 replicated to only 0 peer(s) instead of 1 pee... Executor > task launch worker-18 > 2015-02-09T14:06:39.088+01:00 INFO org.apache.spark.storage.BlockManager > Found block rdd_3_9 locally Executor task launch worker-18 > 2015-02-09T14:06:41.389+01:00 INFO org.apache.spark.CacheManager > Partition > rdd_7_9 not found, computing it Executor task launch worker-18 > 2015-02-09T14:06:41.389+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 1 non-empty > blocks out of 1 blocks Executor task launch worker-18 > 2015-02-09T14:06:41.389+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 0 ms Executor task launch worker-18 > 2015-02-09T14:06:41.402+01:00 INFO org.apache.spark.storage.MemoryStore > ensureFreeSpace(38144) called with curMem=307529151, maxMem=... Executor > task launch worker-18 > 2015-02-09T14:06:41.402+01:00 INFO org.apache.spark.storage.MemoryStore > Block rdd_7_9 stored as values in memory (estimated size 37.... Executor > task launch worker-18 > 2015-02-09T14:06:41.404+01:00 INFO > org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_7_9 > Executor task launch worker-18 > 2015-02-09T14:07:00.019+01:00 INFO org.apache.spark.CacheManager > Partition > rdd_73_9 not found, computing it Executor task launch worker-18 > 2015-02-09T14:07:00.019+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 1 non-empty > blocks out of 1 blocks Executor task launch worker-18 > 2015-02-09T14:07:00.019+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 0 ms Executor task launch worker-18 > 2015-02-09T14:07:00.185+01:00 INFO org.apache.spark.storage.MemoryStore > ensureFreeSpace(826250) called with curMem=307567295, maxMem... Executor > task launch worker-18 > 2015-02-09T14:07:00.185+01:00 INFO org.apache.spark.storage.MemoryStore > Block rdd_73_9 stored as values in memory (estimated size 80... Executor > task launch worker-18 > 2015-02-09T14:07:00.186+01:00 INFO > org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_73_9 > Executor task launch worker-18 > 2015-02-09T14:07:00.190+01:00 INFO org.apache.spark.storage.BlockManager > Found block rdd_35_9 locally Executor task launch worker-18 > 2015-02-09T14:07:00.190+01:00 INFO org.apache.spark.storage.BlockManager > Found block rdd_38_9 locally Executor task launch worker-18 > 2015-02-09T14:07:00.194+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Getting 34 non-empty > blocks out of 50 blocks Executor task launch worker-18 > 2015-02-09T14:07:00.194+01:00 INFO > org.apache.spark.storage.ShuffleBlockFetcherIterator Started 0 remote > fetches in 1 ms Executor task launch worker-18 > 2015-02-09T14:07:17.967+01:00 INFO > org.apache.spark.util.collection.ExternalAppendOnlyMap Thread 117 spilling > in-memory map of 670.2 MB to disk (1 tim... Executor task launch worker-18 > 2015-02-09T14:07:46.716+01:00 INFO org.apache.spark.storage.BlockManager > Found block rdd_41_9 locally Executor task launch worker-18 > 2015-02-09T14:07:47.603+01:00 INFO org.apache.spark.storage.BlockManager > Found block rdd_3_9 locally Executor task launch worker-18 > 2015-02-09T14:07:47.897+01:00 INFO > org.apache.spark.util.collection.ExternalAppendOnlyMap Thread 117 spilling > in-memory map of 5.0 MB to disk (1 time ... Executor task launch worker-18 > 2015-02-09T14:07:48.270+01:00 INFO > org.apache.spark.util.collection.ExternalAppendOnlyMap Thread 117 spilling > in-memory map of 5.0 MB to disk (2 times... Executor task launch worker-18 > 2015-02-09T14:07:48.727+01:00 INFO > org.apache.spark.util.collection.ExternalAppendOnlyMap Thread 117 spilling > in-memory map of 5.0 MB to disk (3 times... Executor task launch worker-18 > 2015-02-09T14:07:49.021+01:00 INFO > org.apache.spark.util.collection.ExternalAppendOnlyMap Thread 117 spilling > in-memory map of 5.0 MB to disk (4 times... Executor task launch worker-18 > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21572.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org