Question about Spark best practice when counting records.

Darin McBeath Fri, 27 Feb 2015 07:54:50 -0800

I have a fairly large Spark job where I'm essentially creating quite a few 
RDDs, do several types of joins using these RDDS resulting in a final RDD which 
I write back to S3.



Along the way, I would like to capture record counts for some of these RDDs. My 
initial approach was to use the count action on some of these intermediate  
RDDS (and cache them since the count would force the materialization of the RDD 
and the RDD would be needed again later).  This seemed to work 'ok' when my 
RDDs were fairly small/modest but as they grew in size I started to experience 
problems.

After watching a recent very good screencast on performance, this doesn't seem 
the correct approach as I believe I'm really breaking (or hindering) the 
pipelining concept in Spark.  If I remove all of my  counts, I'm only left with 
the one job/action (save as Hadoop file at the end).  Spark then seems to run 
smoother (and quite a bit faster) and I really don't need (or want) to even 
cache any of my intermediate RDDs.

So, the approach I've been kicking around is to use accumulators instead.  I 
was already using them to count 'bad' records but why not 'good' records as 
well? I realize that if I lose a partition that I might over count, but  
perhaps that is an acceptable trade-off.

I'm guessing that others have ran into this before so I would like to learn 
from the experience of others and how they have addressed this.

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Question about Spark best practice when counting records.

Reply via email to