The number of records is not stored anywhere. You either need to save it at creation time or step through the RDD. On Wed, Jan 14, 2015 at 1:46 PM Michael Segel <msegel_had...@hotmail.com> wrote:
> Sorry, but the accumulator is still going to require you to walk through > the RDD to get an accurate count, right? > Its not being persisted? > > On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya <ilya.gane...@capitalone.com> > wrote: > > Alternative to doing a naive toArray is to declare an accumulator per > partition and use that. It's specifically what they were designed to do. > See the programming guide. > > > > Sent with Good (www.good.com) > > > -----Original Message----- > *From: *Tobias Pfeiffer [t...@preferred.jp] > *Sent: *Tuesday, January 13, 2015 08:06 PM Eastern Standard Time > *To: *Kevin Burton > *Cc: *Ganelin, Ilya; user@spark.apache.org > *Subject: *Re: quickly counting the number of rows in a partition? > > Hi, > > On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya < > ilya.gane...@capitalone.com> wrote: > >> Use the mapPartitions function. It returns an iterator to each partition. >> Then just get that length by converting to an array. >> > > On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton <bur...@spinn3r.com> wrote: > >> Doesn’t that just read in all the values? The count isn’t pre-computed? >> It’s not the end of the world if it’s not but would be faster. >> > > Well, "converting to an array" may not work due to memory constraints, > counting the items in the iterator may be better. However, there is no > "pre-computed" value. For counting, you need to compute all values in the > RDD, in general. If you think of > > items.map(x => /* throw exception */).count() > > then even though the count you want to get does not necessarily require > the evaluation of the function in map() (i.e., the number is the same), you > may not want to get the count if that code actually fails. > > Tobias > > ------------------------------ > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the > intended recipient, you are hereby notified that any review, > retransmission, dissemination, distribution, copying or other use of, or > taking of any action in reliance upon this information is strictly > prohibited. If you have received this communication in error, please > contact the sender and delete the material from your computer. > > >