>From offline question - zipWithIndex is being used to assign IDs. From a recent JIRA discussion I understand this is not deterministic within a partition so the index can be different when the RDD is reevaluated. If you need it fixed, persist the zipped RDD on disk or in memory. On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> wrote:
> Hi, > > I am building a dictionary of RDD[(String, Long)] and after the dictionary > is built and cached, I find key "almonds" at value 5187 using: > > rdd.filter{case(product, index) => product == "almonds"}.collect > > Output: > > Debug product almonds index 5187 > Now I take the same dictionary and write it out as: > > dictionary.map{case(product, index) => product + "," + index} > .saveAsTextFile(outputPath) > > Inside the map I also print what's the product at index 5187 and I get a > different product: > > Debug Index 5187 userOrProduct cardigans > > Is this an expected behavior from map ? > > By the way "almonds" and "apparel-cardigans" are just one off in the > index... > > I am using spark-1.1 but it's a snapshot.. > > Thanks. > Deb > > >