I changed zipWithIndex to zipWithUniqueId and that seems to be working... What's the difference between zipWithIndex vs zipWithUniqueId ?
For zipWithIndex we don't need to run the count to compute the offset which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not very clear from docs... On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com> wrote: > I did not persist / cache it as I assumed zipWithIndex will preserve > order... > > There is also zipWithUniqueId...I am trying that...If that also shows the > same issue, we should make it clear in the docs... > > On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote: > >> From offline question - zipWithIndex is being used to assign IDs. From a >> recent JIRA discussion I understand this is not deterministic within a >> partition so the index can be different when the RDD is reevaluated. If you >> need it fixed, persist the zipped RDD on disk or in memory. >> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> wrote: >> >>> Hi, >>> >>> I am building a dictionary of RDD[(String, Long)] and after the >>> dictionary is built and cached, I find key "almonds" at value 5187 using: >>> >>> rdd.filter{case(product, index) => product == "almonds"}.collect >>> >>> Output: >>> >>> Debug product almonds index 5187 >>> Now I take the same dictionary and write it out as: >>> >>> dictionary.map{case(product, index) => product + "," + index} >>> .saveAsTextFile(outputPath) >>> >>> Inside the map I also print what's the product at index 5187 and I get a >>> different product: >>> >>> Debug Index 5187 userOrProduct cardigans >>> >>> Is this an expected behavior from map ? >>> >>> By the way "almonds" and "apparel-cardigans" are just one off in the >>> index... >>> >>> I am using spark-1.1 but it's a snapshot.. >>> >>> Thanks. >>> Deb >>> >>> >>> >