Some more debug revealed that as Sean said I have to keep the dictionaries persisted till I am done with the RDD manipulation.....
Thanks Sean for the pointer...would it be possible to point me to the JIRA as well ? Are there plans to make it more transparent for the users ? Is it possible for the DAG to speculate such things...similar to branch prediction ideas from comp arch... On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.da...@gmail.com> wrote: > I changed zipWithIndex to zipWithUniqueId and that seems to be working... > > What's the difference between zipWithIndex vs zipWithUniqueId ? > > For zipWithIndex we don't need to run the count to compute the offset > which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's > not very clear from docs... > > > On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> I did not persist / cache it as I assumed zipWithIndex will preserve >> order... >> >> There is also zipWithUniqueId...I am trying that...If that also shows the >> same issue, we should make it clear in the docs... >> >> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> From offline question - zipWithIndex is being used to assign IDs. From a >>> recent JIRA discussion I understand this is not deterministic within a >>> partition so the index can be different when the RDD is reevaluated. If you >>> need it fixed, persist the zipped RDD on disk or in memory. >>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am building a dictionary of RDD[(String, Long)] and after the >>>> dictionary is built and cached, I find key "almonds" at value 5187 using: >>>> >>>> rdd.filter{case(product, index) => product == "almonds"}.collect >>>> >>>> Output: >>>> >>>> Debug product almonds index 5187 >>>> Now I take the same dictionary and write it out as: >>>> >>>> dictionary.map{case(product, index) => product + "," + index} >>>> .saveAsTextFile(outputPath) >>>> >>>> Inside the map I also print what's the product at index 5187 and I get >>>> a different product: >>>> >>>> Debug Index 5187 userOrProduct cardigans >>>> >>>> Is this an expected behavior from map ? >>>> >>>> By the way "almonds" and "apparel-cardigans" are just one off in the >>>> index... >>>> >>>> I am using spark-1.1 but it's a snapshot.. >>>> >>>> Thanks. >>>> Deb >>>> >>>> >>>> >> >