Hi Greg! Interesting... When you say the TaskManagers are dropping, are the TaskManager processes crashing, or are they loosing connection to the JobManager?
Greetings, Stephan On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan <c...@greghogan.com> wrote: > I recently discovered that AWS uses NUMA for its largest nodes. An example > c4.8xlarge: > > $ numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 > node 0 size: 29813 MB > node 0 free: 24537 MB > node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35 > node 1 size: 30574 MB > node 1 free: 22757 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > I discovered yesterday that Flink performed ~20-30% faster on large > datasets by running two NUMA-constrained TaskManagers per node. The > JobManager node ran a single TaskManager. Resources were divided in half > relative to running a single TaskManager. > > The changes from the tail of /bin/taskmanager.sh: > > -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}" > +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh > $STARTSTOP taskmanager "${args[@]}" > +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh > $STARTSTOP taskmanager "${args[@]}" > > After reverting this change the system is again stable. I had not > experienced issues using numactl when running 16 nodes. > > Greg >