JobManager scale limitation - Slow S3 checkpoint deletes

Jamie Grier Wed, 06 Mar 2019 06:41:03 -0800

We've run into an issue that limits the max parallelism of jobs we can run
and what it seems to boil down to is that the JobManager becomes
unresponsive while essentially spending all of it's time discarding
checkpoints from S3.  This results in sluggish UI, sporadic
AkkaAskTimeouts, heartbeat misses, etc.


Since S3 (and I assume HDFS) have policy that can be used to discard old
objects without Flink actively deleting them I think it would be a useful
feature to add the option to Flink to not ever discard checkpoints.  I
believe this will solve the problem.

Any objections or other known solutions to this problem?

-Jamie

JobManager scale limitation - Slow S3 checkpoint deletes

Reply via email to