Better way to share large data across task managers

Dongwon Kim Sun, 20 Sep 2020 09:05:57 -0700

Hi,

I'm using Flink broadcast state similar to what Fabian explained in [1].
One difference might be the size of the broadcasted data; the size is
around 150MB.


I've launched 32 TMs by setting
- taskmanager.numberOfTaskSlots : 6
- parallelism of the non-broadcast side : 192

Here's some questions:
1) AFAIK, the broadcasted data (150MB) is sent to all 192 tasks. Is it
right?
2) Any recommended way to broadcast data only to 32 TMs so that 6 tasks in
each TM can read the broadcasted data? I'm considering implementing a
static class for the non-broadcast side to directly load data only once on
each TaskManager instead of the broadcast state (FYI, I'm using per-job
clusters on YARN, so each TM is only for a single job). However, I'd like
to use Flink native facilities if possible.

The type of broadcasted data is Map<Long, Int> with around 600K entries, so
every time the data is broadcasted a lot of GC is inevitable on each TM due
to the (de)serialization cost.

Any advice would be much appreciated.

Best,

Dongwon

[1] https://flink.apache.org/2019/06/26/broadcast-state.html

Better way to share large data across task managers

Reply via email to