Re: Question about Spark best practice when counting records.

Kostas Sakellis Fri, 27 Feb 2015 09:20:07 -0800

Hey Darin,

Record count metrics are coming in Spark 1.3. Can you wait until it is
released? Or do you need a solution in older versions of spark.


Kostas

On Friday, February 27, 2015, Darin McBeath <ddmcbe...@yahoo.com.invalid>
wrote:

> I have a fairly large Spark job where I'm essentially creating quite a few
> RDDs, do several types of joins using these RDDS resulting in a final RDD
> which I write back to S3.
>
>
> Along the way, I would like to capture record counts for some of these
> RDDs. My initial approach was to use the count action on some of these
> intermediate  RDDS (and cache them since the count would force the
> materialization of the RDD and the RDD would be needed again later).  This
> seemed to work 'ok' when my RDDs were fairly small/modest but as they grew
> in size I started to experience problems.
>
> After watching a recent very good screencast on performance, this doesn't
> seem the correct approach as I believe I'm really breaking (or hindering)
> the pipelining concept in Spark.  If I remove all of my  counts, I'm only
> left with the one job/action (save as Hadoop file at the end).  Spark then
> seems to run smoother (and quite a bit faster) and I really don't need (or
> want) to even cache any of my intermediate RDDs.
>
> So, the approach I've been kicking around is to use accumulators instead.
> I was already using them to count 'bad' records but why not 'good' records
> as well? I realize that if I lose a partition that I might over count, but
> perhaps that is an acceptable trade-off.
>
> I'm guessing that others have ran into this before so I would like to
> learn from the experience of others and how they have addressed this.
>
> Thanks.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

Re: Question about Spark best practice when counting records.

Reply via email to