Throughput Jobs

Night Wolf Tue, 19 May 2015 01:36:57 -0700

Hi all,

I have a job that, for every row, creates about 20 new objects (i.e. RDD of
100 rows in = RDD 2000 rows out). The reason for this is each row is tagged
with a list of the 'buckets' or 'windows' it belongs to.


The actual data is about 10 billion rows. Each executor has 60GB of memory.

Currently I have a mapPartitions task that is doing this object creation in
a Scala Map and then returning the HashMap as an iterator via .toIterator.

Is there a more efficient way to do this (assuming I can't use something
like flatMap).

The job runs (assuming each task size is small enough). But the GC time is
understandably off the charts.

I've reduced the spark cache memory percentage to 0.05 (as I just need
space for a few broadcasts and this is a data churn task). I've left the
shuffle memory percent unchanged.

What kinds of settings should I be tuning with regards to GC?

Looking at
https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf
slide
125 recommends some settings but I'm not sure what would be best here). I
tried using -XX:+UseG1GC but it pretty much causes my job to fail (all the
executors die). Are there any tips with respect to the ratio of new gen and
old gen space when creating lots of objects which will live in a data
structure until the entire partition is processed?

Any tips for tuning these kinds of jobs would be helpful!

Thanks,
~N

Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

Reply via email to