I see. I appreciate keeping this option available even if it's "beta". The current situation could be documented better, though.
As long as rescaling from checkpoint is not officially supported, I would put it behind a flag similar to --allowNonRestoredState. The flag could be called --allowRescalingRestoredCheckpointState, for example. This would make sure that users are aware that what they're using is experimental and might have unexpected effects. As for the bug I faced, indeed I was able to reproduce it consistently. And I have provided TRACE-level logs personally to Stefan. If there is no Jira ticket for this yet, would you like me to create one? On Thu, May 17, 2018 at 1:00 PM, Stefan Richter <s.rich...@data-artisans.com > wrote: > Hi, > > > > > This raises a couple of questions: > > - Is it a bug though, that the state restoring goes wrong like it does > for my job? Based on my experience it seems like rescaling sometimes works, > but then you can have these random errors. > > If there is a problem, I would still consider it a bug because it should > work correctly. > > > - If it's not supported properly, why not refuse to restore a checkpoint > if it would require rescaling? > > It should work properly, but I would preferred to keep this at the level > of a "hidden featureā until it got some more exposure and also some > questions about the future of differences between savepoints and > checkpoints are solved. > > > - We have sometimes had Flink jobs where the state has become so heavy > that cancelling with a savepoint times out & fails. Incremental checkpoints > are still working because they don't timeout as long as the state is > growing linearly. In that case if we want to scale up (for example to > enable successful savepoint creation ;) ), the only thing we can do is to > restore from the latest checkpoint. But then we have no way to scale up by > increasing the cluster size, because we can't create a savepoint with a > smaller cluster but on the other hand can't restore a checkpoint to a > bigger cluster, if rescaling from a checkpoint is not supposed to be relied > on. So in this case we're stuck and forced to start from an empty state? > > IMO there is a very good chance that this will simply become a normal > feature in the near future. > > Best, > Stefan > >