Re: Question about Spark best practice when counting records.

Darin McBeath Fri, 27 Feb 2015 09:32:08 -0800

Thanks for you quick reply.  Yes, that would be fine.   I would rather wait/use 
the optimal approach as opposed to hacking some one-off solution.


Darin.


________________________________
From: Kostas Sakellis <kos...@cloudera.com>
To: Darin McBeath <ddmcbe...@yahoo.com> 
Cc: User <user@spark.apache.org> 
Sent: Friday, February 27, 2015 12:19 PM
Subject: Re: Question about Spark best practice when counting records.



Hey Darin,

Record count metrics are coming in Spark 1.3. Can you wait until it is 
released? Or do you need a solution in older versions of spark.

Kostas

On Friday, February 27, 2015, Darin McBeath <ddmcbe...@yahoo.com.invalid> wrote:



I have a fairly large Spark job where I'm essentially creating quite a few 
RDDs, do several types of joins using these RDDS resulting in a final RDD which 
I write back to S3.
>
>
>Along the way, I would like to capture record counts for some of these RDDs. 
>My initial approach was to use the count action on some of these intermediate  
>RDDS (and cache them since the count would force the materialization of the 
>RDD and the RDD would be needed again later).  This seemed to work 'ok' when 
>my RDDs were fairly small/modest but as they grew in size I started to 
>experience problems.
>
>After watching a recent very good screencast on performance, this doesn't seem 
>the correct approach as I believe I'm really breaking (or hindering) the 
>pipelining concept in Spark.  If I remove all of my  counts, I'm only left 
>with the one job/action (save as Hadoop file at the end).  Spark then seems to 
>run smoother (and quite a bit faster) and I really don't need (or want) to 
>even cache any of my intermediate RDDs.
>
>So, the approach I've been kicking around is to use accumulators instead.  I 
>was already using them to count 'bad' records but why not 'good' records as 
>well? I realize that if I lose a partition that I might over count, but  
>perhaps that is an acceptable trade-off.
>
>I'm guessing that others have ran into this before so I would like to learn 
>from the experience of others and how they have addressed this.
>
>Thanks.
>
>Darin.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Question about Spark best practice when counting records.

Reply via email to