The distinct call causes a shuffle, which always results in data being
written to disk.
-Sven

On Tue, Jan 13, 2015 at 12:21 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:

> Hi All,
>
>
>
> I am trying with some small data set. It is only 200m, and what I am doing
> is just do a distinct count on it.
>
> But there are a lot of spilling happen in the log (I attached in the end
> of the email).
>
>
>
> Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge
> instance type (which has 244G memory and 32 vCPU).
>
>
>
> My code is simple, run in the spark-shell (~/spark/bin/spark-shell
> --executor-cores 4 --executor-memory 10G)
>
>
>
> *val* llg = sc.textFile("s3://…/part-r-00000") // File is around 210.5M,
> 4.7M rows inside
>
> //val llg = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75",
> "-240990|161324,9051480,0,2,30.48,75"))
>
> *val* ids = llg.flatMap(line => line.split(",").slice(0,1)) //Try to get
> the first column as key
>
> *val* counts = ids.distinct.count
>
>
>
> I think I should have enough memory, so there should not have any spilling
> happen. Anyone can give me some idea why or where I can tuning the system
> to reduce the spilling (it is not an issue on this dataset, but I want to
> see how to tuning it up).
>
> The Spark UI shows only 24.2MB on the shuffle write. And if I have 10G
> memory for executor, why it need to spill.
>
>
>
> 2015-01-13 20:01:53,010 INFO
> [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerMaster
> (Logging.scala:logInfo(59)) - Updated info of block broadcast_2_piece0
>
> 2015-01-13 20:01:53,011 INFO  [Spark Context Cleaner] spark.ContextCleaner
> (Logging.scala:logInfo(59)) - Cleaned broadcast 2
>
> 2015-01-13 20:01:53,399 INFO  [Executor task launch worker-5]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 149
> spilling in-memory map of 23.4 MB to disk (3 times so far)
>
> 2015-01-13 20:01:53,516 INFO  [Executor task launch worker-7]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 151
> spilling in-memory map of 23.4 MB to disk (3 times so far)
>
> 2015-01-13 20:01:53,531 INFO  [Executor task launch worker-6]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 150
> spilling in-memory map of 23.2 MB to disk (3 times so far)
>
> 2015-01-13 20:01:53,793 INFO  [Executor task launch worker-4]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 148
> spilling in-memory map of 23.4 MB to disk (3 times so far)
>
> 2015-01-13 20:01:54,460 INFO  [Executor task launch worker-5]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 149
> spilling in-memory map of 23.2 MB to disk (4 times so far)
>
> 2015-01-13 20:01:54,469 INFO  [Executor task launch worker-7]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 151
> spilling in-memory map of 23.2 MB to disk (4 times so far)
>
> 2015-01-13 20:01:55,144 INFO  [Executor task launch worker-6]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 150
> spilling in-memory map of 24.2 MB to disk (4 times so far)
>
> 2015-01-13 20:01:55,192 INFO  [Executor task launch worker-4]
> collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 148
> spilling in-memory map of 23.2 MB to disk (4 times so far)
>
>
>
> I am trying to collect more benchmark for next step bigger dataset and
> more complex logic.
>
>
>
> Regards,
>
>
>
> Shuai
>



-- 
http://sites.google.com/site/krasser/?utm_source=sig

Reply via email to