Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark.
Kostas On Friday, February 27, 2015, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote: > I have a fairly large Spark job where I'm essentially creating quite a few > RDDs, do several types of joins using these RDDS resulting in a final RDD > which I write back to S3. > > > Along the way, I would like to capture record counts for some of these > RDDs. My initial approach was to use the count action on some of these > intermediate RDDS (and cache them since the count would force the > materialization of the RDD and the RDD would be needed again later). This > seemed to work 'ok' when my RDDs were fairly small/modest but as they grew > in size I started to experience problems. > > After watching a recent very good screencast on performance, this doesn't > seem the correct approach as I believe I'm really breaking (or hindering) > the pipelining concept in Spark. If I remove all of my counts, I'm only > left with the one job/action (save as Hadoop file at the end). Spark then > seems to run smoother (and quite a bit faster) and I really don't need (or > want) to even cache any of my intermediate RDDs. > > So, the approach I've been kicking around is to use accumulators instead. > I was already using them to count 'bad' records but why not 'good' records > as well? I realize that if I lose a partition that I might over count, but > perhaps that is an acceptable trade-off. > > I'm guessing that others have ran into this before so I would like to > learn from the experience of others and how they have addressed this. > > Thanks. > > Darin. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;> > For additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > >