I recently discovered that AWS uses NUMA for its largest nodes. An example
c4.8xlarge:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26
node 0 size: 29813 MB
node 0 free: 24537 MB
node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35
node 1 size: 30574 MB
node 1 free: 22757 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

I discovered yesterday that Flink performed ~20-30% faster on large
datasets by running two NUMA-constrained TaskManagers per node. The
JobManager node ran a single TaskManager. Resources were divided in half
relative to running a single TaskManager.

The changes from the tail of /bin/taskmanager.sh:

-"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}"
+numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh
$STARTSTOP taskmanager "${args[@]}"
+numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh
$STARTSTOP taskmanager "${args[@]}"

After reverting this change the system is again stable. I had not
experienced issues using numactl when running 16 nodes.

Greg

Reply via email to