Re: How to tell if one RDD depends on another

Corey Nolet Thu, 26 Feb 2015 16:22:48 -0800

Ted. That one I know. It was the dependency part I was curious about
On Feb 26, 2015 7:12 PM, "Ted Yu" <[email protected]> wrote:


> bq. whether or not rdd1 is a cached rdd
>
> RDD has getStorageLevel method which would return the RDD's current
> storage level.
>
> SparkContext has this method:
>    * Return information about what RDDs are cached, if they are in mem or
> on disk, how much space
>    * they take, etc.
>    */
>   @DeveloperApi
>   def getRDDStorageInfo: Array[RDDInfo] = {
>
> Cheers
>
> On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet <[email protected]> wrote:
>
>> Zhan,
>>
>> This is exactly what I'm trying to do except, as I metnioned in my first
>> message, I am being given rdd1 and rdd2 only and I don't necessarily know
>> at that point whether or not rdd1 is a cached rdd. Further, I don't know at
>> that point whether or not rdd2 depends on rdd1.
>>
>> On Thu, Feb 26, 2015 at 6:54 PM, Zhan Zhang <[email protected]>
>> wrote:
>>
>>>  In this case, it is slow to wait for rdd1.saveAsHasoopFile(...)  to
>>> finish probably due to writing to hdfs.  a walk around for this particular
>>> case may be as follows.
>>>
>>>  val rdd1 = ......cache()
>>>
>>>  val rdd2 = rdd1.map().....()
>>> rdd1.count
>>>  future { rdd1.saveAsHasoopFile(...) }
>>>  future { rdd2.saveAsHadoopFile(…)]
>>>
>>>  In this way, rdd1 will be calculated once, and two saveAsHadoopFile
>>> will happen concurrently.
>>>
>>>  Thanks.
>>>
>>>  Zhan Zhang
>>>
>>>
>>>
>>>  On Feb 26, 2015, at 3:28 PM, Corey Nolet <[email protected]> wrote:
>>>
>>>  > What confused me is  the statement of *"The final result is that
>>> rdd1 is calculated twice.” *Is it the expected behavior?
>>>
>>>  To be perfectly honest, performing an action on a cached RDD in two
>>> different threads and having them (at the partition level) block until the
>>> parent are cached would be the behavior and myself and all my coworkers
>>> expected.
>>>
>>> On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet <[email protected]> wrote:
>>>
>>>>  I should probably mention that my example case is much over
>>>> simplified- Let's say I've got a tree, a fairly complex one where I begin a
>>>> series of jobs at the root which calculates a bunch of really really
>>>> complex joins and as I move down the tree, I'm creating reports from the
>>>> data that's already been joined (i've implemented logic to determine when
>>>> cached items can be cleaned up, e.g. the last report has been done in a
>>>> subtree).
>>>>
>>>>  My issue is that the 'actions' on the rdds are currently being
>>>> implemented in a single thread- even if I'm waiting on a cache to complete
>>>> fully before I run the "children" jobs, I'm still in a better placed than I
>>>> was because I'm able to run those jobs concurrently- right now this is not
>>>> the case.
>>>>
>>>> > What you want is for a request for partition X to wait if partition X
>>>> is already being calculated in a persisted RDD.
>>>>
>>>> I totally agree and if I could get it so that it's waiting at the
>>>> granularity of the partition, I'd be in a much much better place. I feel
>>>> like I'm going down a rabbit hole and working against the Spark API.
>>>>
>>>>
>>>> On Thu, Feb 26, 2015 at 6:03 PM, Sean Owen <[email protected]> wrote:
>>>>
>>>>> To distill this a bit further, I don't think you actually want rdd2 to
>>>>> wait on rdd1 in this case. What you want is for a request for
>>>>> partition X to wait if partition X is already being calculated in a
>>>>> persisted RDD. Otherwise the first partition of rdd2 waits on the
>>>>> final partition of rdd1 even when the rest is ready.
>>>>>
>>>>> That is probably usually a good idea in almost all cases. That much, I
>>>>> don't know how hard it is to implement. But I speculate that it's
>>>>> easier to deal with it at that level than as a function of the
>>>>> dependency graph.
>>>>>
>>>>> On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet <[email protected]>
>>>>> wrote:
>>>>> > I'm trying to do the scheduling myself now- to determine that rdd2
>>>>> depends
>>>>> > on rdd1 and rdd1 is a persistent RDD (storage level != None) so that
>>>>> I can
>>>>> > do the no-op on rdd1 before I run rdd2. I would much rather the DAG
>>>>> figure
>>>>> > this out so I don't need to think about all this.
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: How to tell if one RDD depends on another

Reply via email to