Hi all, I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100 rows in = RDD 2000 rows out). The reason for this is each row is tagged with a list of the 'buckets' or 'windows' it belongs to.
The actual data is about 10 billion rows. Each executor has 60GB of memory. Currently I have a mapPartitions task that is doing this object creation in a Scala Map and then returning the HashMap as an iterator via .toIterator. Is there a more efficient way to do this (assuming I can't use something like flatMap). The job runs (assuming each task size is small enough). But the GC time is understandably off the charts. I've reduced the spark cache memory percentage to 0.05 (as I just need space for a few broadcasts and this is a data churn task). I've left the shuffle memory percent unchanged. What kinds of settings should I be tuning with regards to GC? Looking at https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf slide 125 recommends some settings but I'm not sure what would be best here). I tried using -XX:+UseG1GC but it pretty much causes my job to fail (all the executors die). Are there any tips with respect to the ratio of new gen and old gen space when creating lots of objects which will live in a data structure until the entire partition is processed? Any tips for tuning these kinds of jobs would be helpful! Thanks, ~N