Reference - https://issues.apache.org/jira/browse/SPARK-3098 I imagine zipWithUniqueID is also affected, but may not happen to have exhibited in your test.
On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.da...@gmail.com> wrote: > Some more debug revealed that as Sean said I have to keep the dictionaries > persisted till I am done with the RDD manipulation..... > > Thanks Sean for the pointer...would it be possible to point me to the JIRA > as well ? > > Are there plans to make it more transparent for the users ? > > Is it possible for the DAG to speculate such things...similar to branch > prediction ideas from comp arch... > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.da...@gmail.com> > wrote: >> >> I changed zipWithIndex to zipWithUniqueId and that seems to be working... >> >> What's the difference between zipWithIndex vs zipWithUniqueId ? >> >> For zipWithIndex we don't need to run the count to compute the offset >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's >> not very clear from docs... >> >> >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com> >> wrote: >>> >>> I did not persist / cache it as I assumed zipWithIndex will preserve >>> order... >>> >>> There is also zipWithUniqueId...I am trying that...If that also shows the >>> same issue, we should make it clear in the docs... >>> >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote: >>>> >>>> From offline question - zipWithIndex is being used to assign IDs. From a >>>> recent JIRA discussion I understand this is not deterministic within a >>>> partition so the index can be different when the RDD is reevaluated. If you >>>> need it fixed, persist the zipped RDD on disk or in memory. >>>> >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I am building a dictionary of RDD[(String, Long)] and after the >>>>> dictionary is built and cached, I find key "almonds" at value 5187 using: >>>>> >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect >>>>> >>>>> Output: >>>>> >>>>> Debug product almonds index 5187 >>>>> >>>>> Now I take the same dictionary and write it out as: >>>>> >>>>> dictionary.map{case(product, index) => product + "," + index} >>>>> .saveAsTextFile(outputPath) >>>>> >>>>> Inside the map I also print what's the product at index 5187 and I get >>>>> a different product: >>>>> >>>>> Debug Index 5187 userOrProduct cardigans >>>>> >>>>> Is this an expected behavior from map ? >>>>> >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the >>>>> index... >>>>> >>>>> I am using spark-1.1 but it's a snapshot.. >>>>> >>>>> Thanks. >>>>> Deb >>>>> >>>>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org