I have a fairly large Spark job where I'm essentially creating quite a few RDDs, do several types of joins using these RDDS resulting in a final RDD which I write back to S3.
Along the way, I would like to capture record counts for some of these RDDs. My initial approach was to use the count action on some of these intermediate RDDS (and cache them since the count would force the materialization of the RDD and the RDD would be needed again later). This seemed to work 'ok' when my RDDs were fairly small/modest but as they grew in size I started to experience problems. After watching a recent very good screencast on performance, this doesn't seem the correct approach as I believe I'm really breaking (or hindering) the pipelining concept in Spark. If I remove all of my counts, I'm only left with the one job/action (save as Hadoop file at the end). Spark then seems to run smoother (and quite a bit faster) and I really don't need (or want) to even cache any of my intermediate RDDs. So, the approach I've been kicking around is to use accumulators instead. I was already using them to count 'bad' records but why not 'good' records as well? I realize that if I lose a partition that I might over count, but perhaps that is an acceptable trade-off. I'm guessing that others have ran into this before so I would like to learn from the experience of others and how they have addressed this. Thanks. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org