Re: A couple questions about shared variables

2014-09-24 Thread Nan Zhu
I proposed a fix https://github.com/apache/spark/pull/2524 Glad to receive feedbacks -- Nan Zhu On Tuesday, September 23, 2014 at 9:06 PM, Sandy Ryza wrote: > Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these > nuances. > > -Sandy > > On Mon, Sep 22, 2014

Re: A couple questions about shared variables

2014-09-23 Thread Sandy Ryza
Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these nuances. -Sandy On Mon, Sep 22, 2014 at 10:36 AM, Nan Zhu wrote: > I see, thanks for pointing this out > > > -- > Nan Zhu > > On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote: > > MapReduce counters do not

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
I see, thanks for pointing this out -- Nan Zhu On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote: > MapReduce counters do not count duplications. In MapReduce, if a task needs > to be re-run, the value of the counter from the second task overwrites the > value from the first t

Re: A couple questions about shared variables

2014-09-22 Thread Sandy Ryza
MapReduce counters do not count duplications. In MapReduce, if a task needs to be re-run, the value of the counter from the second task overwrites the value from the first task. -Sandy On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu wrote: > If you think it as necessary to fix, I would like to resub

Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
If you think it as necessary to fix, I would like to resubmit that PR (seems to have some conflicts with the current DAGScheduler) My suggestion is to make it as an option in accumulator, e.g. some algorithms utilizing accumulator for result calculation, it needs a deterministic accumulator,

Re: A couple questions about shared variables

2014-09-21 Thread Matei Zaharia
Hmm, good point, this seems to have been broken by refactorings of the scheduler, but it worked in the past. Basically the solution is simple -- in a result stage, we should not apply the update for each task ID more than once -- the same way we don't call job.listener.taskSucceeded more than on

Re: A couple questions about shared variables

2014-09-21 Thread Nan Zhu
Hi, Matei, Can you give some hint on how the current implementation guarantee the accumulator is only applied for once? There is a pending PR trying to achieving this (https://github.com/apache/spark/pull/228/files), but from the current implementation, I didn’t see this has been done? (may

Re: A couple questions about shared variables

2014-09-20 Thread Matei Zaharia
Hey Sandy, On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: Hey All,  A couple questions came up about shared variables recently, and I wanted to  confirm my understanding and update the doc to be a little more clear.  *Broadcast variables*  Now that tasks data i

A couple questions about shared variables

2014-09-20 Thread Sandy Ryza
Hey All, A couple questions came up about shared variables recently, and I wanted to confirm my understanding and update the doc to be a little more clear. *Broadcast variables* Now that tasks data is automatically broadcast, the only occasions where it makes sense to explicitly broadcast are: *