We've run into an issue that limits the max parallelism of jobs we can run and what it seems to boil down to is that the JobManager becomes unresponsive while essentially spending all of it's time discarding checkpoints from S3. This results in sluggish UI, sporadic AkkaAskTimeouts, heartbeat misses, etc.
Since S3 (and I assume HDFS) have policy that can be used to discard old objects without Flink actively deleting them I think it would be a useful feature to add the option to Flink to not ever discard checkpoints. I believe this will solve the problem. Any objections or other known solutions to this problem? -Jamie