Re: Distributed dictionary building

Debasish Das Sat, 20 Sep 2014 13:49:54 -0700

I did not persist / cache it as I assumed zipWithIndex will preserve
order...


There is also zipWithUniqueId...I am trying that...If that also shows the
same issue, we should make it clear in the docs...

On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:

> From offline question - zipWithIndex is being used to assign IDs. From a
> recent JIRA discussion I understand this is not deterministic within a
> partition so the index can be different when the RDD is reevaluated. If you
> need it fixed, persist the zipped RDD on disk or in memory.
> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> wrote:
>
>> Hi,
>>
>> I am building a dictionary of RDD[(String, Long)] and after the
>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>
>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>
>> Output:
>>
>> Debug product almonds index 5187
>> Now I take the same dictionary and write it out as:
>>
>> dictionary.map{case(product, index) => product + "," + index}
>> .saveAsTextFile(outputPath)
>>
>> Inside the map I also print what's the product at index 5187 and I get a
>> different product:
>>
>> Debug Index 5187 userOrProduct cardigans
>>
>> Is this an expected behavior from map ?
>>
>> By the way "almonds" and "apparel-cardigans" are just one off in the
>> index...
>>
>> I am using spark-1.1 but it's a snapshot..
>>
>> Thanks.
>> Deb
>>
>>
>>

Re: Distributed dictionary building

Reply via email to