I recently discovered that AWS uses NUMA for its largest nodes. An example c4.8xlarge:
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 node 0 size: 29813 MB node 0 free: 24537 MB node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35 node 1 size: 30574 MB node 1 free: 22757 MB node distances: node 0 1 0: 10 20 1: 20 10 I discovered yesterday that Flink performed ~20-30% faster on large datasets by running two NUMA-constrained TaskManagers per node. The JobManager node ran a single TaskManager. Resources were divided in half relative to running a single TaskManager. The changes from the tail of /bin/taskmanager.sh: -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}" +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}" +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}" After reverting this change the system is again stable. I had not experienced issues using numactl when running 16 nodes. Greg