Re: Distributed dictionary building

Nan Zhu Tue, 23 Sep 2014 07:45:17 -0700

great, thanks 

-- 
Nan Zhu



On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote:

> Yes, Matei made a JIRA last week and I just suggested a PR:
> https://github.com/apache/spark/pull/2508 
> On Sep 23, 2014 2:55 PM, "Nan Zhu" <zhunanmcg...@gmail.com 
> (mailto:zhunanmcg...@gmail.com)> wrote:
> > shall we document this in the API doc? 
> > 
> > Best, 
> > 
> > -- 
> > Nan Zhu
> > 
> > 
> > On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
> > 
> > > zipWithUniqueId is also affected...
> > > 
> > > I had to persist the dictionaries to make use of the indices lower down 
> > > in the flow...
> > > 
> > > On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen <so...@cloudera.com 
> > > (mailto:so...@cloudera.com)> wrote:
> > > > Reference - https://issues.apache.org/jira/browse/SPARK-3098
> > > > I imagine zipWithUniqueID is also affected, but may not happen to have
> > > > exhibited in your test.
> > > > 
> > > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das <debasish.da...@gmail.com 
> > > > (mailto:debasish.da...@gmail.com)> wrote:
> > > > > Some more debug revealed that as Sean said I have to keep the 
> > > > > dictionaries
> > > > > persisted till I am done with the RDD manipulation.....
> > > > >
> > > > > Thanks Sean for the pointer...would it be possible to point me to the 
> > > > > JIRA
> > > > > as well ?
> > > > >
> > > > > Are there plans to make it more transparent for the users ?
> > > > >
> > > > > Is it possible for the DAG to speculate such things...similar to 
> > > > > branch
> > > > > prediction ideas from comp arch...
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das 
> > > > > <debasish.da...@gmail.com (mailto:debasish.da...@gmail.com)>
> > > > > wrote:
> > > > >>
> > > > >> I changed zipWithIndex to zipWithUniqueId and that seems to be 
> > > > >> working...
> > > > >>
> > > > >> What's the difference between zipWithIndex vs zipWithUniqueId ?
> > > > >>
> > > > >> For zipWithIndex we don't need to run the count to compute the offset
> > > > >> which is needed for zipWithUniqueId and so zipWithIndex is efficient 
> > > > >> ? It's
> > > > >> not very clear from docs...
> > > > >>
> > > > >>
> > > > >> On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das 
> > > > >> <debasish.da...@gmail.com (mailto:debasish.da...@gmail.com)>
> > > > >> wrote:
> > > > >>>
> > > > >>> I did not persist / cache it as I assumed zipWithIndex will preserve
> > > > >>> order...
> > > > >>>
> > > > >>> There is also zipWithUniqueId...I am trying that...If that also 
> > > > >>> shows the
> > > > >>> same issue, we should make it clear in the docs...
> > > > >>>
> > > > >>> On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen <so...@cloudera.com 
> > > > >>> (mailto:so...@cloudera.com)> wrote:
> > > > >>>>
> > > > >>>> From offline question - zipWithIndex is being used to assign IDs. 
> > > > >>>> From a
> > > > >>>> recent JIRA discussion I understand this is not deterministic 
> > > > >>>> within a
> > > > >>>> partition so the index can be different when the RDD is 
> > > > >>>> reevaluated. If you
> > > > >>>> need it fixed, persist the zipped RDD on disk or in memory.
> > > > >>>>
> > > > >>>> On Sep 20, 2014 8:10 PM, "Debasish Das" <debasish.da...@gmail.com 
> > > > >>>> (mailto:debasish.da...@gmail.com)>
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> I am building a dictionary of RDD[(String, Long)] and after the
> > > > >>>>> dictionary is built and cached, I find key "almonds" at value 
> > > > >>>>> 5187 using:
> > > > >>>>>
> > > > >>>>> rdd.filter{case(product, index) => product == "almonds"}.collect
> > > > >>>>>
> > > > >>>>> Output:
> > > > >>>>>
> > > > >>>>> Debug product almonds index 5187
> > > > >>>>>
> > > > >>>>> Now I take the same dictionary and write it out as:
> > > > >>>>>
> > > > >>>>> dictionary.map{case(product, index) => product + "," + index}
> > > > >>>>> .saveAsTextFile(outputPath)
> > > > >>>>>
> > > > >>>>> Inside the map I also print what's the product at index 5187 and 
> > > > >>>>> I get
> > > > >>>>> a different product:
> > > > >>>>>
> > > > >>>>> Debug Index 5187 userOrProduct cardigans
> > > > >>>>>
> > > > >>>>> Is this an expected behavior from map ?
> > > > >>>>>
> > > > >>>>> By the way "almonds" and "apparel-cardigans" are just one off in 
> > > > >>>>> the
> > > > >>>>> index...
> > > > >>>>>
> > > > >>>>> I am using spark-1.1 but it's a snapshot..
> > > > >>>>>
> > > > >>>>> Thanks.
> > > > >>>>> Deb
> > > > >>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > > >
> > > 
> >

Re: Distributed dictionary building

Reply via email to