I changed zipWithIndex to zipWithUniqueId and that seems to be working...

What's the difference between zipWithIndex vs zipWithUniqueId ?

For zipWithIndex we don't need to run the count to compute the offset which
is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not
very clear from docs...


On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> I did not persist / cache it as I assumed zipWithIndex will preserve
> order...
>
> There is also zipWithUniqueId...I am trying that...If that also shows the
> same issue, we should make it clear in the docs...
>
> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> From offline question - zipWithIndex is being used to assign IDs. From a
>> recent JIRA discussion I understand this is not deterministic within a
>> partition so the index can be different when the RDD is reevaluated. If you
>> need it fixed, persist the zipped RDD on disk or in memory.
>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am building a dictionary of RDD[(String, Long)] and after the
>>> dictionary is built and cached, I find key "almonds" at value 5187 using:
>>>
>>> rdd.filter{case(product, index) => product == "almonds"}.collect
>>>
>>> Output:
>>>
>>> Debug product almonds index 5187
>>> Now I take the same dictionary and write it out as:
>>>
>>> dictionary.map{case(product, index) => product + "," + index}
>>> .saveAsTextFile(outputPath)
>>>
>>> Inside the map I also print what's the product at index 5187 and I get a
>>> different product:
>>>
>>> Debug Index 5187 userOrProduct cardigans
>>>
>>> Is this an expected behavior from map ?
>>>
>>> By the way "almonds" and "apparel-cardigans" are just one off in the
>>> index...
>>>
>>> I am using spark-1.1 but it's a snapshot..
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>

Reply via email to