ng on its
> > side, which would be really helpful for specific state backend
> > disaggregating computation and storage.
> >
> > Best
> > Yun Tang
> > ________
> > From: Thomas Weise
> > Sent: Thursday, March 7, 2019 12:06
> > To: dev@fli
would be really helpful for specific state backend
> disaggregating computation and storage.
>
> Best
> Yun Tang
>
> From: Thomas Weise
> Sent: Thursday, March 7, 2019 12:06
> To: dev@flink.apache.org
> Subject: Re: JobManager scale limitatio
12:06
To: dev@flink.apache.org
Subject: Re: JobManager scale limitation - Slow S3 checkpoint deletes
Nice!
Perhaps for file systems without TTL/expiration support (AFAIK includes
HDFS), cleanup could be performed in the task managers?
On Wed, Mar 6, 2019 at 6:01 PM Jamie Grier wrote:
> Yup,
Nice!
Perhaps for file systems without TTL/expiration support (AFAIK includes
HDFS), cleanup could be performed in the task managers?
On Wed, Mar 6, 2019 at 6:01 PM Jamie Grier wrote:
> Yup, it looks like the actor threads are spending all of their time
> communicating with S3. I've attached
Yup, it looks like the actor threads are spending all of their time
communicating with S3. I've attached a picture of a typical stack trace
for one of the actor threads [1]. At the end of that call stack what
you'll see is the thread blocking on synchronous communication with the S3
service. Thi
I think having an option to not actively delete checkpoints (but rather
have the TTL feature of the file system take care of it) sounds like a good
idea.
I am curious why you get heartbeat misses and akka timeouts during deletes.
Are some parts of the deletes happening sychronously in the actor th
We've run into an issue that limits the max parallelism of jobs we can run
and what it seems to boil down to is that the JobManager becomes
unresponsive while essentially spending all of it's time discarding
checkpoints from S3. This results in sluggish UI, sporadic
AkkaAskTimeouts, heartbeat miss