It looks because different snappy version, if you disable compress or switch to 
lz4, the size is no different. 

Yours, Xuefeng Wu 吴雪峰 敬上

> On 2015年2月10日, at 下午6:13, chris <christian.b...@performance-media.de> wrote:
> 
> Hello,
> 
> as the original message from Kevin Jung never got accepted to the
> mailinglist, I quote it here completely:
> 
> 
> Kevin Jung wrote
>> Hi all,
>> The size of shuffle write showing in spark web UI is much different when I
>> execute same spark job on same input data(100GB) in both spark 1.1 and
>> spark 1.2.
>> At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1
>> but 91.0GB in spark 1.2.
>> I set spark.shuffle.manager option to hash because it's default value is
>> changed but spark 1.2 writes larger file than spark 1.1.
>> Can anyone tell me why this happens?
>> 
>> Thanks
>> Kevin
> 
> I'm experiencing the same thing with my job and that's what I tested:
> 
> * Spark 1.2.0 with Sort-based Shuffle
> * Spark 1.2.0 with Hash-based Shuffle
> * Spark 1.2.1 with Sort-based Shuffle
> 
> All three combinations show the same behaviour, which contrasts from Spark
> 1.1.0.
> 
> In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for
> almost four hours. Configuration is identical otherwise - I only added
> org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for
> Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102.
> 
> 
> As a consequence (I think, but causality might be different) I see lots and
> lots of disk spills.
> 
> I cannot provide a small test case, but maybe the log entries for a single
> worker thread can help someone investigate on this. (See below.)
> 
> 
> I also opened an issue on this, see
> https://issues.apache.org/jira/browse/SPARK-5715
> 
> Any help will be greatly appreciated, because otherwise I'm stuck with Spark
> 1.1.0, as quadrupling runtime is not an option.
> 
> Sincerely,
> 
> Chris
> 
> 
> 
> 2015-02-09T14:06:06.328+01:00    INFO    org.apache.spark.executor.Executor
> Running task 9.0 in stage 18.0 (TID 300)    Executor task launch worker-18
> 2015-02-09T14:06:06.351+01:00    INFO    org.apache.spark.CacheManager    
> Partition
> rdd_35_9 not found, computing it    Executor task launch worker-18
> 2015-02-09T14:06:06.351+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 10 non-empty
> blocks out of 10 blocks    Executor task launch worker-18
> 2015-02-09T14:06:06.351+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 0 ms    Executor task launch worker-18
> 2015-02-09T14:06:07.396+01:00    INFO    org.apache.spark.storage.MemoryStore
> ensureFreeSpace(2582904) called with curMem=300174944, maxMe...    Executor
> task launch worker-18
> 2015-02-09T14:06:07.397+01:00    INFO    org.apache.spark.storage.MemoryStore
> Block rdd_35_9 stored as bytes in memory (estimated size 2.5...    Executor
> task launch worker-18
> 2015-02-09T14:06:07.398+01:00    INFO
> org.apache.spark.storage.BlockManagerMaster    Updated info of block rdd_35_9
> Executor task launch worker-18
> 2015-02-09T14:06:07.399+01:00    INFO    org.apache.spark.CacheManager    
> Partition
> rdd_38_9 not found, computing it    Executor task launch worker-18
> 2015-02-09T14:06:07.399+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 10 non-empty
> blocks out of 10 blocks    Executor task launch worker-18
> 2015-02-09T14:06:07.400+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 0 ms    Executor task launch worker-18
> 2015-02-09T14:06:07.567+01:00    INFO    org.apache.spark.storage.MemoryStore
> ensureFreeSpace(944848) called with curMem=302757848, maxMem...    Executor
> task launch worker-18
> 2015-02-09T14:06:07.568+01:00    INFO    org.apache.spark.storage.MemoryStore
> Block rdd_38_9 stored as values in memory (estimated size 92...    Executor
> task launch worker-18
> 2015-02-09T14:06:07.569+01:00    INFO
> org.apache.spark.storage.BlockManagerMaster    Updated info of block rdd_38_9
> Executor task launch worker-18
> 2015-02-09T14:06:07.573+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 34 non-empty
> blocks out of 50 blocks    Executor task launch worker-18
> 2015-02-09T14:06:07.573+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 1 ms    Executor task launch worker-18
> 2015-02-09T14:06:38.931+01:00    INFO    org.apache.spark.CacheManager    
> Partition
> rdd_41_9 not found, computing it    Executor task launch worker-18
> 2015-02-09T14:06:38.931+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 3 non-empty
> blocks out of 10 blocks    Executor task launch worker-18
> 2015-02-09T14:06:38.931+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 0 ms    Executor task launch worker-18
> 2015-02-09T14:06:38.945+01:00    INFO    org.apache.spark.storage.MemoryStore
> ensureFreeSpace(0) called with curMem=307529127, maxMem=9261...    Executor
> task launch worker-18
> 2015-02-09T14:06:38.945+01:00    INFO    org.apache.spark.storage.MemoryStore
> Block rdd_41_9 stored as bytes in memory (estimated size 0.0...    Executor
> task launch worker-18
> 2015-02-09T14:06:38.946+01:00    INFO
> org.apache.spark.storage.BlockManagerMaster    Updated info of block rdd_41_9
> Executor task launch worker-18
> 2015-02-09T14:06:38.946+01:00    WARN    org.apache.spark.storage.BlockManager
> Block rdd_41_9 replicated to only 0 peer(s) instead of 1 pee...    Executor
> task launch worker-18
> 2015-02-09T14:06:39.088+01:00    INFO    org.apache.spark.storage.BlockManager
> Found block rdd_3_9 locally    Executor task launch worker-18
> 2015-02-09T14:06:41.389+01:00    INFO    org.apache.spark.CacheManager    
> Partition
> rdd_7_9 not found, computing it    Executor task launch worker-18
> 2015-02-09T14:06:41.389+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 1 non-empty
> blocks out of 1 blocks    Executor task launch worker-18
> 2015-02-09T14:06:41.389+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 0 ms    Executor task launch worker-18
> 2015-02-09T14:06:41.402+01:00    INFO    org.apache.spark.storage.MemoryStore
> ensureFreeSpace(38144) called with curMem=307529151, maxMem=...    Executor
> task launch worker-18
> 2015-02-09T14:06:41.402+01:00    INFO    org.apache.spark.storage.MemoryStore
> Block rdd_7_9 stored as values in memory (estimated size 37....    Executor
> task launch worker-18
> 2015-02-09T14:06:41.404+01:00    INFO
> org.apache.spark.storage.BlockManagerMaster    Updated info of block rdd_7_9
> Executor task launch worker-18
> 2015-02-09T14:07:00.019+01:00    INFO    org.apache.spark.CacheManager    
> Partition
> rdd_73_9 not found, computing it    Executor task launch worker-18
> 2015-02-09T14:07:00.019+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 1 non-empty
> blocks out of 1 blocks    Executor task launch worker-18
> 2015-02-09T14:07:00.019+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 0 ms    Executor task launch worker-18
> 2015-02-09T14:07:00.185+01:00    INFO    org.apache.spark.storage.MemoryStore
> ensureFreeSpace(826250) called with curMem=307567295, maxMem...    Executor
> task launch worker-18
> 2015-02-09T14:07:00.185+01:00    INFO    org.apache.spark.storage.MemoryStore
> Block rdd_73_9 stored as values in memory (estimated size 80...    Executor
> task launch worker-18
> 2015-02-09T14:07:00.186+01:00    INFO
> org.apache.spark.storage.BlockManagerMaster    Updated info of block rdd_73_9
> Executor task launch worker-18
> 2015-02-09T14:07:00.190+01:00    INFO    org.apache.spark.storage.BlockManager
> Found block rdd_35_9 locally    Executor task launch worker-18
> 2015-02-09T14:07:00.190+01:00    INFO    org.apache.spark.storage.BlockManager
> Found block rdd_38_9 locally    Executor task launch worker-18
> 2015-02-09T14:07:00.194+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Getting 34 non-empty
> blocks out of 50 blocks    Executor task launch worker-18
> 2015-02-09T14:07:00.194+01:00    INFO
> org.apache.spark.storage.ShuffleBlockFetcherIterator    Started 0 remote
> fetches in 1 ms    Executor task launch worker-18
> 2015-02-09T14:07:17.967+01:00    INFO
> org.apache.spark.util.collection.ExternalAppendOnlyMap    Thread 117 spilling
> in-memory map of 670.2 MB to disk (1 tim...    Executor task launch worker-18
> 2015-02-09T14:07:46.716+01:00    INFO    org.apache.spark.storage.BlockManager
> Found block rdd_41_9 locally    Executor task launch worker-18
> 2015-02-09T14:07:47.603+01:00    INFO    org.apache.spark.storage.BlockManager
> Found block rdd_3_9 locally    Executor task launch worker-18
> 2015-02-09T14:07:47.897+01:00    INFO
> org.apache.spark.util.collection.ExternalAppendOnlyMap    Thread 117 spilling
> in-memory map of 5.0 MB to disk (1 time ...    Executor task launch worker-18
> 2015-02-09T14:07:48.270+01:00    INFO
> org.apache.spark.util.collection.ExternalAppendOnlyMap    Thread 117 spilling
> in-memory map of 5.0 MB to disk (2 times...    Executor task launch worker-18
> 2015-02-09T14:07:48.727+01:00    INFO
> org.apache.spark.util.collection.ExternalAppendOnlyMap    Thread 117 spilling
> in-memory map of 5.0 MB to disk (3 times...    Executor task launch worker-18
> 2015-02-09T14:07:49.021+01:00    INFO
> org.apache.spark.util.collection.ExternalAppendOnlyMap    Thread 117 spilling
> in-memory map of 5.0 MB to disk (4 times...    Executor task launch worker-18
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21572.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to