Re: RDD.count

Sandy Ryza Sat, 28 Mar 2015 07:45:05 -0700

I definitely see the value in this.  However, I think at this point it
would be an incompatible behavioral change.  People often use count in
Spark to exercise their DAG.  Omitting processing steps that were
previously included would likely mislead many users into thinking their
pipeline was running faster.


It's possible there might be room for something like a new smartCount API
or a new argument to count that allows it to avoid unnecessary
transformations.

-Sandy

On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen <so...@cloudera.com> wrote:

> No, I'm not saying side effects change the count. But not executing
> the map() function at all certainly has an effect on the side effects
> of that function: the side effects which should take place never do. I
> am not sure that is something to be 'fixed'; it's a legitimate
> question.
>
> You can persist an RDD if you do not want to compute it twice.
>
> On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll <jimfcarr...@gmail.com>
> wrote:
> > Hi Sean,
> >
> > Thanks for the response.
> >
> > I can't imagine a case (though my imagination may be somewhat limited)
> where
> > even map side effects could change the number of elements in the
> resulting
> > map.
> >
> > I guess "count" wouldn't officially be an 'action' if it were implemented
> > this way. At least it wouldn't ALWAYS be one.
> >
> > My example was contrived. We're passing RDDs to functions. If that RDD
> is an
> > instance of my class, then its count() may take a shortcut. If I
> > map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call
> that
> > literally takes 100s to 1000s of times longer (seconds vs hours on some
> of
> > our datasets) and since my custom RDDs are immutable they cache the count
> > call so a second invocation is the cost of a method call's overhead.
> >
> > I could fix this in Spark if there's any interest in that change.
> Otherwise
> > I'll need to overload more RDD methods for my own purposes (like all of
> the
> > transformations). Of course, that will be more difficult because those
> > intermediate classes (like MappedRDD) are private, so I can't extend
> them.
> >
> > Jim
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RDD.count

Reply via email to