Re: RDD.broadcast

2016-04-28 Thread Reynold Xin
value.get(u.getLocationId(); > > Object result = method(f,u1,u2,l);//method implementation not important, > but requires all 3 objects > > return result; > > }); > > > > > > *From:* Marcin Tustin [mailto:mtus...@handybook.com] > *Sent:* 28 April

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
From: Marcin Tustin [mailto:mtus...@handybook.com] Sent: 28 April 2016 12:27 To: Deligiannis, Ioannis (UK) Cc: dev@spark.apache.org Subject: Re: RDD.broadcast I don't know what your notation really means. I'm very much unclear on why you can't use the filter method for 1. If you'

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin
gt; > > > > *From:* Marcin Tustin [mailto:mtus...@handybook.com > ] > *Sent:* 28 April 2016 12:08 > *To:* Deligiannis, Ioannis (UK) > *Cc:* dev@spark.apache.org > > *Subject:* Re: RDD.broadcast > > > > Why would you ever need to do this? I'm genuinely cu

Re: RDD.broadcast

2016-04-28 Thread Mike Hynes
I second knowing the use case for interest. I can imagine a case where knowledge of the RDD key distribution would help local computations, for relaticely few keys, but would be interested to hear your motive. Essentially, are you trying to achieve what would be an all-reduce type operation in MPI

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
small (reference) RDD is quite common and much faster than using “join” method. From: Marcin Tustin [mailto:mtus...@handybook.com] Sent: 28 April 2016 12:08 To: Deligiannis, Ioannis (UK) Cc: dev@spark.apache.org Subject: Re: RDD.broadcast Why would you ever need to do this? I'm genuinely curiou

Re: RDD.broadcast

2016-04-28 Thread Marcin Tustin
Why would you ever need to do this? I'm genuinely curious. I view collects as being solely for interactive work. On Thursday, April 28, 2016, wrote: > Hi, > > > > It is a common pattern to process an RDD, collect (typically a subset) to > the driver and then broadcast back. > > > > Adding an RDD

RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
Hi, It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back. Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap hold