zipWithUniqueId is also affected... I had to persist the dictionaries to make use of the indices lower down in the flow...
On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com> wrote: > Reference - https://issues.apache.org/jira/browse/SPARK-3098 > I imagine zipWithUniqueID is also affected, but may not happen to have > exhibited in your test. > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > > Some more debug revealed that as Sean said I have to keep the > dictionaries > > persisted till I am done with the RDD manipulation..... > > > > Thanks Sean for the pointer...would it be possible to point me to the > JIRA > > as well ? > > > > Are there plans to make it more transparent for the users ? > > > > Is it possible for the DAG to speculate such things...similar to branch > > prediction ideas from comp arch... > > > > > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das <debasish.da...@gmail.com> > > wrote: > >> > >> I changed zipWithIndex to zipWithUniqueId and that seems to be > working... > >> > >> What's the difference between zipWithIndex vs zipWithUniqueId ? > >> > >> For zipWithIndex we don't need to run the count to compute the offset > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient ? > It's > >> not very clear from docs... > >> > >> > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das <debasish.da...@gmail.com > > > >> wrote: > >>> > >>> I did not persist / cache it as I assumed zipWithIndex will preserve > >>> order... > >>> > >>> There is also zipWithUniqueId...I am trying that...If that also shows > the > >>> same issue, we should make it clear in the docs... > >>> > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com> wrote: > >>>> > >>>> From offline question - zipWithIndex is being used to assign IDs. > From a > >>>> recent JIRA discussion I understand this is not deterministic within a > >>>> partition so the index can be different when the RDD is reevaluated. > If you > >>>> need it fixed, persist the zipped RDD on disk or in memory. > >>>> > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> I am building a dictionary of RDD[(String, Long)] and after the > >>>>> dictionary is built and cached, I find key "almonds" at value 5187 > using: > >>>>> > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect > >>>>> > >>>>> Output: > >>>>> > >>>>> Debug product almonds index 5187 > >>>>> > >>>>> Now I take the same dictionary and write it out as: > >>>>> > >>>>> dictionary.map{case(product, index) => product + "," + index} > >>>>> .saveAsTextFile(outputPath) > >>>>> > >>>>> Inside the map I also print what's the product at index 5187 and I > get > >>>>> a different product: > >>>>> > >>>>> Debug Index 5187 userOrProduct cardigans > >>>>> > >>>>> Is this an expected behavior from map ? > >>>>> > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in the > >>>>> index... > >>>>> > >>>>> I am using spark-1.1 but it's a snapshot.. > >>>>> > >>>>> Thanks. > >>>>> Deb > >>>>> > >>>>> > >>> > >> > > >