Re: Distributed dictionary building

Sean Owen Sun, 21 Sep 2014 01:17:02 -0700

Reference - https://issues.apache.org/jira/browse/SPARK-3098
I imagine zipWithUniqueID is also affected, but may not happen to have
exhibited in your test.


On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <[email protected]> wrote:
> Some more debug revealed that as Sean said I have to keep the dictionaries
> persisted till I am done with the RDD manipulation.....
>
> Thanks Sean for the pointer...would it be possible to point me to the JIRA
> as well ?
>
> Are there plans to make it more transparent for the users ?
>
> Is it possible for the DAG to speculate such things...similar to branch
> prediction ideas from comp arch...
>
>
>
> On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <[email protected]>
> wrote:
>>
>> I changed zipWithIndex to zipWithUniqueId and that seems to be working...
>>
>> What's the difference between zipWithIndex vs zipWithUniqueId ?
>>
>> For zipWithIndex we don't need to run the count to compute the offset
>> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's
>> not very clear from docs...
>>
>>
>> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <[email protected]>
>> wrote:
>>>
>>> I did not persist / cache it as I assumed zipWithIndex will preserve
>>> order...
>>>
>>> There is also zipWithUniqueId...I am trying that...If that also shows the
>>> same issue, we should make it clear in the docs...
>>>
>>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <[email protected]> wrote:
>>>>
>>>> From offline question - zipWithIndex is being used to assign IDs. From a
>>>> recent JIRA discussion I understand this is not deterministic within a
>>>> partition so the index can be different when the RDD is reevaluated. If you
>>>> need it fixed, persist the zipped RDD on disk or in memory.
>>>>
>>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am building a dictionary of RDD[(String, Long)] and after the
>>>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>>>>
>>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>>>>
>>>>> Output:
>>>>>
>>>>> Debug product almonds index 5187
>>>>>
>>>>> Now I take the same dictionary and write it out as:
>>>>>
>>>>> dictionary.map{case(product, index) => product + "," + index}
>>>>> .saveAsTextFile(outputPath)
>>>>>
>>>>> Inside the map I also print what's the product at index 5187 and I get
>>>>> a different product:
>>>>>
>>>>> Debug Index 5187 userOrProduct cardigans
>>>>>
>>>>> Is this an expected behavior from map ?
>>>>>
>>>>> By the way "almonds" and "apparel-cardigans" are just one off in the
>>>>> index...
>>>>>
>>>>> I am using spark-1.1 but it's a snapshot..
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Distributed dictionary building

Reply via email to