To add one more data point... it seems like the recovery directory is the bottleneck somehow.. so if we delete the recovery directory and restart the job manager - it comes back and is responsive.
Of course, we lose all jobs, since none can be recovered... and that is of course not ideal. So the question seems to be why the recovery directory grows exponentially in the first place. I can't imagine we're the only ones to see this... or we must be configuring something wrong while testing Flink 1.3.1 Thanks for your help in advance Prashant -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-checkpoint-directories-exhibit-explosive-growth-tp14270p14271.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.