Hi Shixiong,
> The Iterable from cogroup is CompactBuffer, which is already
> materialized. It's not a lazy Iterable. So now Spark cannot handle
> skewed data that some key has too many values that cannot be fit into
> the memory.
Cool, thanks for the confirmation.
- Stephen
-
The Iterable from cogroup is CompactBuffer, which is already materialized.
It's not a lazy Iterable. So now Spark cannot handle skewed data that some
key has too many values that cannot be fit into the memory.
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman
wrote:
> Yeah...I was trying to poke around, are the Iterables that Spark passes
> into cogroup already materialized (e.g. the bug was making a copy of
> an already-in-memory list) or are the Iterables streaming?
The result of cogroup has values t
> It wasn't so much the cogroup that was optimized here, but what is
> done to the result of cogroup.
Right.
> Yes, it was a matter of not materializing the entire result of a
> flatMap-like function after the cogroup, since this will accept just
> an Iterator (actually, TraversableOnce).
Yeah.
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,
TraversableOnce).
I'd say that wherever you
Hey,
I saw this commit go by, and find it fairly fascinating:
https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305
For two reasons: 1) we have a report that is bogging down exactly in
a .join with lots of elements, so, glad to see the fix, but, more
interesting I think