Re: Diagnosing TaskManager disappearance

Stephan Ewen Thu, 29 Oct 2015 10:01:32 -0700

Hi Greg!

Interesting... When you say the TaskManagers are dropping, are the
TaskManager processes crashing, or are they loosing connection to the
JobManager?


Greetings,
Stephan


On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <c...@greghogan.com> wrote:

> I recently discovered that AWS uses NUMA for its largest nodes. An example
> c4.8xlarge:
>
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26
> node 0 size: 29813 MB
> node 0 free: 24537 MB
> node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35
> node 1 size: 30574 MB
> node 1 free: 22757 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> I discovered yesterday that Flink performed ~20-30% faster on large
> datasets by running two NUMA-constrained TaskManagers per node. The
> JobManager node ran a single TaskManager. Resources were divided in half
> relative to running a single TaskManager.
>
> The changes from the tail of /bin/taskmanager.sh:
>
> -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}"
> +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh
> $STARTSTOP taskmanager "${args[@]}"
> +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh
> $STARTSTOP taskmanager "${args[@]}"
>
> After reverting this change the system is again stable. I had not
> experienced issues using numactl when running 16 nodes.
>
> Greg
>

Re: Diagnosing TaskManager disappearance

Reply via email to