Hi Greg!

Interesting... When you say the TaskManagers are dropping, are the
TaskManager processes crashing, or are they loosing connection to the
JobManager?

Greetings,
Stephan


On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <c...@greghogan.com> wrote:

> I recently discovered that AWS uses NUMA for its largest nodes. An example
> c4.8xlarge:
>
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26
> node 0 size: 29813 MB
> node 0 free: 24537 MB
> node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35
> node 1 size: 30574 MB
> node 1 free: 22757 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> I discovered yesterday that Flink performed ~20-30% faster on large
> datasets by running two NUMA-constrained TaskManagers per node. The
> JobManager node ran a single TaskManager. Resources were divided in half
> relative to running a single TaskManager.
>
> The changes from the tail of /bin/taskmanager.sh:
>
> -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}"
> +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh
> $STARTSTOP taskmanager "${args[@]}"
> +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh
> $STARTSTOP taskmanager "${args[@]}"
>
> After reverting this change the system is again stable. I had not
> experienced issues using numactl when running 16 nodes.
>
> Greg
>

Reply via email to