Re: JobManager scale limitation - Slow S3 checkpoint deletes

Stephan Ewen Wed, 06 Mar 2019 07:42:18 -0800

I think having an option to not actively delete checkpoints (but rather
have the TTL feature of the file system take care of it) sounds like a good
idea.


I am curious why you get heartbeat misses and akka timeouts during deletes.
Are some parts of the deletes happening sychronously in the actor thread?

On Wed, Mar 6, 2019 at 3:40 PM Jamie Grier <jgr...@lyft.com.invalid> wrote:

> We've run into an issue that limits the max parallelism of jobs we can run
> and what it seems to boil down to is that the JobManager becomes
> unresponsive while essentially spending all of it's time discarding
> checkpoints from S3.  This results in sluggish UI, sporadic
> AkkaAskTimeouts, heartbeat misses, etc.
>
> Since S3 (and I assume HDFS) have policy that can be used to discard old
> objects without Flink actively deleting them I think it would be a useful
> feature to add the option to Flink to not ever discard checkpoints.  I
> believe this will solve the problem.
>
> Any objections or other known solutions to this problem?
>
> -Jamie
>

Re: JobManager scale limitation - Slow S3 checkpoint deletes

Reply via email to