great, thanks -- Nan Zhu
On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote: > Yes, Matei made a JIRA last week and I just suggested a PR: > https://github.com/apache/spark/pull/2508 > On Sep 23, 2014 2:55 PM, "Nan Zhu" <zhunanmcg...@gmail.com > (mailto:zhunanmcg...@gmail.com)> wrote: > > shall we document this in the API doc? > > > > Best, > > > > -- > > Nan Zhu > > > > > > On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: > > > > > zipWithUniqueId is also affected... > > > > > > I had to persist the dictionaries to make use of the indices lower down > > > in the flow... > > > > > > On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com > > > (mailto:so...@cloudera.com)> wrote: > > > > Reference - https://issues.apache.org/jira/browse/SPARK-3098 > > > > I imagine zipWithUniqueID is also affected, but may not happen to have > > > > exhibited in your test. > > > > > > > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.da...@gmail.com > > > > (mailto:debasish.da...@gmail.com)> wrote: > > > > > Some more debug revealed that as Sean said I have to keep the > > > > > dictionaries > > > > > persisted till I am done with the RDD manipulation..... > > > > > > > > > > Thanks Sean for the pointer...would it be possible to point me to the > > > > > JIRA > > > > > as well ? > > > > > > > > > > Are there plans to make it more transparent for the users ? > > > > > > > > > > Is it possible for the DAG to speculate such things...similar to > > > > > branch > > > > > prediction ideas from comp arch... > > > > > > > > > > > > > > > > > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das > > > > > <debasish.da...@gmail.com (mailto:debasish.da...@gmail.com)> > > > > > wrote: > > > > >> > > > > >> I changed zipWithIndex to zipWithUniqueId and that seems to be > > > > >> working... > > > > >> > > > > >> What's the difference between zipWithIndex vs zipWithUniqueId ? > > > > >> > > > > >> For zipWithIndex we don't need to run the count to compute the offset > > > > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient > > > > >> ? It's > > > > >> not very clear from docs... > > > > >> > > > > >> > > > > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das > > > > >> <debasish.da...@gmail.com (mailto:debasish.da...@gmail.com)> > > > > >> wrote: > > > > >>> > > > > >>> I did not persist / cache it as I assumed zipWithIndex will preserve > > > > >>> order... > > > > >>> > > > > >>> There is also zipWithUniqueId...I am trying that...If that also > > > > >>> shows the > > > > >>> same issue, we should make it clear in the docs... > > > > >>> > > > > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com > > > > >>> (mailto:so...@cloudera.com)> wrote: > > > > >>>> > > > > >>>> From offline question - zipWithIndex is being used to assign IDs. > > > > >>>> From a > > > > >>>> recent JIRA discussion I understand this is not deterministic > > > > >>>> within a > > > > >>>> partition so the index can be different when the RDD is > > > > >>>> reevaluated. If you > > > > >>>> need it fixed, persist the zipped RDD on disk or in memory. > > > > >>>> > > > > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com > > > > >>>> (mailto:debasish.da...@gmail.com)> > > > > >>>> wrote: > > > > >>>>> > > > > >>>>> Hi, > > > > >>>>> > > > > >>>>> I am building a dictionary of RDD[(String, Long)] and after the > > > > >>>>> dictionary is built and cached, I find key "almonds" at value > > > > >>>>> 5187 using: > > > > >>>>> > > > > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect > > > > >>>>> > > > > >>>>> Output: > > > > >>>>> > > > > >>>>> Debug product almonds index 5187 > > > > >>>>> > > > > >>>>> Now I take the same dictionary and write it out as: > > > > >>>>> > > > > >>>>> dictionary.map{case(product, index) => product + "," + index} > > > > >>>>> .saveAsTextFile(outputPath) > > > > >>>>> > > > > >>>>> Inside the map I also print what's the product at index 5187 and > > > > >>>>> I get > > > > >>>>> a different product: > > > > >>>>> > > > > >>>>> Debug Index 5187 userOrProduct cardigans > > > > >>>>> > > > > >>>>> Is this an expected behavior from map ? > > > > >>>>> > > > > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in > > > > >>>>> the > > > > >>>>> index... > > > > >>>>> > > > > >>>>> I am using spark-1.1 but it's a snapshot.. > > > > >>>>> > > > > >>>>> Thanks. > > > > >>>>> Deb > > > > >>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > >