I just wanted to leave an update about this issue, for someone else who might come across it. The problem was with memory, but it was disk memory and not heap/off-heap memory. Yarn was killing off my containers as they exceeded the threshold for disk utilization and this was manifesting as Task manager was lost/killed or JobClientActorConnectionTimeoutException: Lost connection to the JobManager. Digging deep into the individual instance node manager logs provided some hints about it being a disk issue.
Some fixes for this problem: yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage -- can be increased to alleviate the problem temporarily. Increasing the disk capacity on each task manager is a more long-term fix. Increasing the number of task managers increases available disk memory and hence is also a fix. Thanks! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/