Re: JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-07 Thread Jamie Grier
ng on its > > side, which would be really helpful for specific state backend > > disaggregating computation and storage. > > > > Best > > Yun Tang > > ________ > > From: Thomas Weise > > Sent: Thursday, March 7, 2019 12:06 > > To: dev@fli

Re: JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-07 Thread Till Rohrmann
would be really helpful for specific state backend > disaggregating computation and storage. > > Best > Yun Tang > > From: Thomas Weise > Sent: Thursday, March 7, 2019 12:06 > To: dev@flink.apache.org > Subject: Re: JobManager scale limitatio

Re: JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-06 Thread Yun Tang
12:06 To: dev@flink.apache.org Subject: Re: JobManager scale limitation - Slow S3 checkpoint deletes Nice! Perhaps for file systems without TTL/expiration support (AFAIK includes HDFS), cleanup could be performed in the task managers? On Wed, Mar 6, 2019 at 6:01 PM Jamie Grier wrote: > Yup,

Re: JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-06 Thread Thomas Weise
Nice! Perhaps for file systems without TTL/expiration support (AFAIK includes HDFS), cleanup could be performed in the task managers? On Wed, Mar 6, 2019 at 6:01 PM Jamie Grier wrote: > Yup, it looks like the actor threads are spending all of their time > communicating with S3. I've attached

Re: JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-06 Thread Jamie Grier
Yup, it looks like the actor threads are spending all of their time communicating with S3. I've attached a picture of a typical stack trace for one of the actor threads [1]. At the end of that call stack what you'll see is the thread blocking on synchronous communication with the S3 service. Thi

Re: JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-06 Thread Stephan Ewen
I think having an option to not actively delete checkpoints (but rather have the TTL feature of the file system take care of it) sounds like a good idea. I am curious why you get heartbeat misses and akka timeouts during deletes. Are some parts of the deletes happening sychronously in the actor th

JobManager scale limitation - Slow S3 checkpoint deletes

2019-03-06 Thread Jamie Grier
We've run into an issue that limits the max parallelism of jobs we can run and what it seems to boil down to is that the JobManager becomes unresponsive while essentially spending all of it's time discarding checkpoints from S3. This results in sluggish UI, sporadic AkkaAskTimeouts, heartbeat miss