Hi, I had a look at the logs from the restoring job and couldn’t find anything suspicious in them. Everything looks as expected and the state files are properly found and transferred from S3. We are including rescaling in some end-to-end tests now and then let’s see what happens. If you say that you can reproduce the problem, does that mean reproduce from the single existing checkpoint or also creating other problematic checkpoints? I am asking because maybe a log from the job that produces the problematic checkpoint might be more helpful. You can create a ticket if you want.
Best, Stefan > Am 18.05.2018 um 09:02 schrieb Juho Autio <juho.au...@rovio.com>: > > I see. I appreciate keeping this option available even if it's "beta". The > current situation could be documented better, though. > > As long as rescaling from checkpoint is not officially supported, I would put > it behind a flag similar to --allowNonRestoredState. The flag could be called > --allowRescalingRestoredCheckpointState, for example. This would make sure > that users are aware that what they're using is experimental and might have > unexpected effects. > > As for the bug I faced, indeed I was able to reproduce it consistently. And I > have provided TRACE-level logs personally to Stefan. If there is no Jira > ticket for this yet, would you like me to create one? > > On Thu, May 17, 2018 at 1:00 PM, Stefan Richter <s.rich...@data-artisans.com > <mailto:s.rich...@data-artisans.com>> wrote: > Hi, > > > > > This raises a couple of questions: > > - Is it a bug though, that the state restoring goes wrong like it does for > > my job? Based on my experience it seems like rescaling sometimes works, but > > then you can have these random errors. > > If there is a problem, I would still consider it a bug because it should work > correctly. > > > - If it's not supported properly, why not refuse to restore a checkpoint if > > it would require rescaling? > > It should work properly, but I would preferred to keep this at the level of a > "hidden feature“ until it got some more exposure and also some questions > about the future of differences between savepoints and checkpoints are > solved. > > > - We have sometimes had Flink jobs where the state has become so heavy that > > cancelling with a savepoint times out & fails. Incremental checkpoints are > > still working because they don't timeout as long as the state is growing > > linearly. In that case if we want to scale up (for example to enable > > successful savepoint creation ;) ), the only thing we can do is to restore > > from the latest checkpoint. But then we have no way to scale up by > > increasing the cluster size, because we can't create a savepoint with a > > smaller cluster but on the other hand can't restore a checkpoint to a > > bigger cluster, if rescaling from a checkpoint is not supposed to be relied > > on. So in this case we're stuck and forced to start from an empty state? > > IMO there is a very good chance that this will simply become a normal feature > in the near future. > > Best, > Stefan > >