Hi all, We have a Flink 1.6 streaming application running on Amazon EMR, with a YARN session configured with 20GB for the Task Manager, 2GB for the Job Manager, and 4 slots (number of vCPUs), in detached mode. Each Core Node has 4 vCores, 32 GB mem, 32 GB disc, and each Task Node has 4 vCores, 8 GB mem, 32 GB disc. We have auto-scaling for Core Nodes based on the HDFS Utilization and Capacity Remaining GB, as well as auto-scaling for the Task Nodes based on YARN Available Memory and the number of Pending Containers. We've got Log Aggregation turned on as well. This runs well under normal pressure for about a week, where upon YARN can no longer allocate the resource requests from Flink, causing container requests to build up. Even when scaled up, the container requests don't seem to be fulfilled. I've seen that it seems to start. Does anyone have a good guide to setting up a streaming application on EMR with YARN?
Thank you, Austin Cawley-Edwards