Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
Hi Shixiong, > The Iterable from cogroup is CompactBuffer, which is already > materialized. It's not a lazy Iterable. So now Spark cannot handle > skewed data that some key has too many values that cannot be fit into > the memory.​ Cool, thanks for the confirmation. - Stephen -

Re: recent join/iterator fix

2014-12-29 Thread Shixiong Zhu
The Iterable from cogroup is CompactBuffer, which is already materialized. It's not a lazy Iterable. So now Spark cannot handle skewed data that some key has too many values that cannot be fit into the memory.​

Re: recent join/iterator fix

2014-12-29 Thread Sean Owen
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman wrote: > Yeah...I was trying to poke around, are the Iterables that Spark passes > into cogroup already materialized (e.g. the bug was making a copy of > an already-in-memory list) or are the Iterables streaming? The result of cogroup has values t

Re: recent join/iterator fix

2014-12-29 Thread Stephen Haberman
> It wasn't so much the cogroup that was optimized here, but what is > done to the result of cogroup. Right. > Yes, it was a matter of not materializing the entire result of a > flatMap-like function after the cogroup, since this will accept just > an Iterator (actually, TraversableOnce). Yeah.

Re: recent join/iterator fix

2014-12-29 Thread Sean Owen
It wasn't so much the cogroup that was optimized here, but what is done to the result of cogroup. Yes, it was a matter of not materializing the entire result of a flatMap-like function after the cogroup, since this will accept just an Iterator (actually, TraversableOnce). I'd say that wherever you

recent join/iterator fix

2014-12-28 Thread Stephen Haberman
Hey, I saw this commit go by, and find it fairly fascinating: https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305 For two reasons: 1) we have a report that is bogging down exactly in a .join with lots of elements, so, glad to see the fix, but, more interesting I think