Thanks for you quick reply. Yes, that would be fine. I would rather wait/use the optimal approach as opposed to hacking some one-off solution.
Darin. ________________________________ From: Kostas Sakellis <kos...@cloudera.com> To: Darin McBeath <ddmcbe...@yahoo.com> Cc: User <user@spark.apache.org> Sent: Friday, February 27, 2015 12:19 PM Subject: Re: Question about Spark best practice when counting records. Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark. Kostas On Friday, February 27, 2015, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: I have a fairly large Spark job where I'm essentially creating quite a few RDDs, do several types of joins using these RDDS resulting in a final RDD which I write back to S3. > > >Along the way, I would like to capture record counts for some of these RDDs. >My initial approach was to use the count action on some of these intermediate >RDDS (and cache them since the count would force the materialization of the >RDD and the RDD would be needed again later). This seemed to work 'ok' when >my RDDs were fairly small/modest but as they grew in size I started to >experience problems. > >After watching a recent very good screencast on performance, this doesn't seem >the correct approach as I believe I'm really breaking (or hindering) the >pipelining concept in Spark. If I remove all of my counts, I'm only left >with the one job/action (save as Hadoop file at the end). Spark then seems to >run smoother (and quite a bit faster) and I really don't need (or want) to >even cache any of my intermediate RDDs. > >So, the approach I've been kicking around is to use accumulators instead. I >was already using them to count 'bad' records but why not 'good' records as >well? I realize that if I lose a partition that I might over count, but >perhaps that is an acceptable trade-off. > >I'm guessing that others have ran into this before so I would like to learn >from the experience of others and how they have addressed this. > >Thanks. > >Darin. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org