Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu
shall we document this in the API doc? Best, -- Nan Zhu On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: > zipWithUniqueId is also affected... > > I had to persist the dictionaries to make use of the indices lower down in > the flow... > > On Sun, Sep 21, 2014 at 1:15 AM, S

Re: Distributed dictionary building

2014-09-23 Thread Sean Owen
Yes, Matei made a JIRA last week and I just suggested a PR: https://github.com/apache/spark/pull/2508 On Sep 23, 2014 2:55 PM, "Nan Zhu" wrote: > shall we document this in the API doc? > > Best, > > -- > Nan Zhu > > On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote: > > zipWithUnique

Re: Distributed dictionary building

2014-09-23 Thread Nan Zhu
great, thanks -- Nan Zhu On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote: > Yes, Matei made a JIRA last week and I just suggested a PR: > https://github.com/apache/spark/pull/2508 > On Sep 23, 2014 2:55 PM, "Nan Zhu" (mailto:zhunanmcg...@gmail.com)> wrote: > > shall we document t

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
zipWithUniqueId is also affected... I had to persist the dictionaries to make use of the indices lower down in the flow... On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen wrote: > Reference - https://issues.apache.org/jira/browse/SPARK-3098 > I imagine zipWithUniqueID is also affected, but may not h

Re: Distributed dictionary building

2014-09-21 Thread Sean Owen
Reference - https://issues.apache.org/jira/browse/SPARK-3098 I imagine zipWithUniqueID is also affected, but may not happen to have exhibited in your test. On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das wrote: > Some more debug revealed that as Sean said I have to keep the dictionaries > persisted

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
Some more debug revealed that as Sean said I have to keep the dictionaries persisted till I am done with the RDD manipulation. Thanks Sean for the pointer...would it be possible to point me to the JIRA as well ? Are there plans to make it more transparent for the users ? Is it possible for t

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I changed zipWithIndex to zipWithUniqueId and that seems to be working... What's the difference between zipWithIndex vs zipWithUniqueId ? For zipWithIndex we don't need to run the count to compute the offset which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not very clea

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I did not persist / cache it as I assumed zipWithIndex will preserve order... There is also zipWithUniqueId...I am trying that...If that also shows the same issue, we should make it clear in the docs... On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen wrote: > From offline question - zipWithIndex is

Re: Distributed dictionary building

2014-09-20 Thread Sean Owen
>From offline question - zipWithIndex is being used to assign IDs. From a recent JIRA discussion I understand this is not deterministic within a partition so the index can be different when the RDD is reevaluated. If you need it fixed, persist the zipped RDD on disk or in memory. On Sep 20, 2014 8:

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key "almonds" at value 5187 using: rdd.filter{case(product, index) => product == "almonds"}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out