Java's default serialization is not the best/most efficient way of handling
ser/deser, did you try switching to Kryo serialization ?
c.f.
https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/
if you need a tutorial.
This should help in terms of both CPU and
The driver memory is set at 40G and OOM seems to be happening on the
executors. I might try a different broadcast block size (vs 4m) as Takeshi
suggested to see if it makes a difference.
On Mon, Mar 7, 2016 at 6:54 PM, Tristan Nixon wrote:
> Yeah, the spark engine is pretty clever and its best n
Yeah, the spark engine is pretty clever and its best not to pre-maturely
optimize. It would be interesting to profile your join vs. the collect on the
smaller dataset. I suspect that the join is faster (even before you broadcast
it back out).
I’m also curious about the broadcast OOM - did you t
So I just implemented the logic through a standard join (without collect
and broadcast) and it's working great.
The idea behind trying the broadcast was that since the other side of join
is a much larger dataset, the process might be faster through collect and
broadcast, since it avoids the shuffl
I’m not sure I understand - if it was already distributed over the cluster in
an RDD, why would you want to collect and then re-send it as a broadcast
variable? Why not simply use the RDD that is already distributed on the worker
nodes?
> On Mar 7, 2016, at 7:44 PM, Arash wrote:
>
> Hi Trista
Oh, How about increasing broadcast block size
in spark.broadcast.blockSize?
A default size is `4m` and it is too small agains ~1GB, I think.
On Tue, Mar 8, 2016 at 10:44 AM, Arash wrote:
> Hi Tristan,
>
> This is not static, I actually collect it from an RDD to the driver.
>
> On Mon, Mar 7
Hi Tristan,
This is not static, I actually collect it from an RDD to the driver.
On Mon, Mar 7, 2016 at 5:42 PM, Tristan Nixon wrote:
> Hi Arash,
>
> is this static data? Have you considered including it in your jars and
> de-serializing it from jar on each worker node?
> It’s not pretty, but
Hi Arash,
is this static data? Have you considered including it in your jars and
de-serializing it from jar on each worker node?
It’s not pretty, but it’s a workaround for serialization troubles.
> On Mar 7, 2016, at 5:29 PM, Arash wrote:
>
> Hello all,
>
> I'm trying to broadcast a variable
Hi Takeshi, we've the driver memory set to 40G. I don't think the driver is
having memory issues. It looks like the executors start to fail due to
memory issues and then the broadcast p2p algorithm starts to fail.
On Mon, Mar 7, 2016 at 5:36 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> I think a broadc
Hi,
I think a broadcast logic itself works well in spite of input size (no idea
about its efficiency).
How about your memory size in a driver?
when you broadcast some large variables, a driver eats much memory for
splitting the variable into blocks and their serializations.
Thanks,
maropu
On Tu
Hi Ankur,
For this specific test, I'm only running the few lines of code that are
pasted. Nothing else is cached in the cluster.
Thanks,
Arash
On Mon, Mar 7, 2016 at 4:07 PM, Ankur Srivastava wrote:
> Hi,
>
> We have a use case where we broadcast ~4GB of data and we are on
> m3.2xlarge so your
Hi,
We have a use case where we broadcast ~4GB of data and we are on m3.2xlarge
so your object size is not an issue. Also based on your explanation does
not look like a broadcast issue as it works when your partition size is
small.
Are you caching any other data? Because boradcast variable use th
Well, I'm trying to avoid a big shuffle/join, from what I could find
online, my understanding was that 1G broadcast should be doable, is that
not accurate?
On Mon, Mar 7, 2016 at 3:34 PM, Jeff Zhang wrote:
> Any reason why do you broadcast such large variable ? It doesn't make
> sense to me
>
>
Any reason why do you broadcast such large variable ? It doesn't make sense
to me
On Tue, Mar 8, 2016 at 7:29 AM, Arash wrote:
> Hello all,
>
> I'm trying to broadcast a variable of size ~1G to a cluster of 20 nodes
> but haven't been able to make it work so far.
>
> It looks like the executors
14 matches
Mail list logo