distinct is lazy because the map and reduceByKey functions it calls are
lazy as well.  When they're called, the only thing that happens is that
state is built up on the client side.  distinct will return an RDD for the
map operation that points to the RDD that it depends on, that in turn point
to the RDDs that they depend on. These collectively form a transformation
graph that the scheduler can use to create tasks when an action operation
is called.


On Tue, Mar 11, 2014 at 10:15 PM, David Thomas <dt5434...@gmail.com> wrote:

> I think you misunderstood my question - I should have stated it better.
> I'm not saying it should be applied immediately, but I'm trying to
> understand how Spark achieves this lazy computation transformations. May be
> this is due to my ignorance of how Scala works, but when I see the code, I
> see that the function is applied to the elements of RDD when I call
> distinct - or is it not applied immediately? How does the returned RDD
> 'keep track of the operation'?
>
>
> On Tue, Mar 11, 2014 at 10:06 PM, Ewen Cheslack-Postava 
> <m...@ewencp.org>wrote:
>
>> You should probably be asking the opposite question: why do you think it
>> *should* be applied immediately? Since the driver program hasn't requested
>> any data back (distinct generates a new RDD, it doesn't return any data),
>> there's no need to actually compute anything yet.
>>
>> As the documentation describes, if the call returns an RDD, it's
>> transforming the data and will just keep track of the operation it
>> eventually needs to perform. Only methods that return data back to the
>> driver should trigger any computation.
>>
>> (The one known exception is sortByKey, which really should be lazy, but
>> apparently uses an RDD.count call in its implementation:
>> https://spark-project.atlassian.net/browse/SPARK-1021).
>>
>>   David Thomas <dt5434...@gmail.com>
>>  March 11, 2014 at 9:49 PM
>> For example, is distinct() transformation lazy?
>>
>> when I see the Spark source code, distinct applies a map-> reduceByKey ->
>> map function to the RDD elements. Why is this lazy? Won't the function be
>> applied immediately to the elements of RDD when I call someRDD.distinct?
>>
>>   /**
>>    * Return a new RDD containing the distinct elements in this RDD.
>>    */
>>   def distinct(numPartitions: Int): RDD[T] =
>>     map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
>>
>>   /**
>>    * Return a new RDD containing the distinct elements in this RDD.
>>    */
>>   def distinct(): RDD[T] = distinct(partitions.size)
>>
>>
>

<<inline: compose-unknown-contact.jpg>>

Reply via email to