I see. I appreciate keeping this option available even if it's "beta". The
current situation could be documented better, though.

As long as rescaling from checkpoint is not officially supported, I would
put it behind a flag similar to --allowNonRestoredState. The flag could be
called --allowRescalingRestoredCheckpointState, for example. This would
make sure that users are aware that what they're using is experimental and
might have unexpected effects.

As for the bug I faced, indeed I was able to reproduce it consistently. And
I have provided TRACE-level logs personally to Stefan. If there is no Jira
ticket for this yet, would you like me to create one?

On Thu, May 17, 2018 at 1:00 PM, Stefan Richter <s.rich...@data-artisans.com
> wrote:

> Hi,
>
> >
> > This raises a couple of questions:
> > - Is it a bug though, that the state restoring goes wrong like it does
> for my job? Based on my experience it seems like rescaling sometimes works,
> but then you can have these random errors.
>
> If there is a problem, I would still consider it a bug because it should
> work correctly.
>
> > - If it's not supported properly, why not refuse to restore a checkpoint
> if it would require rescaling?
>
> It should work properly, but I would preferred to keep this at the level
> of a "hidden featureā€œ until it got some more exposure and also some
> questions about the future of differences between savepoints and
> checkpoints are solved.
>
> > - We have sometimes had Flink jobs where the state has become so heavy
> that cancelling with a savepoint times out & fails. Incremental checkpoints
> are still working because they don't timeout as long as the state is
> growing linearly. In that case if we want to scale up (for example to
> enable successful savepoint creation ;) ), the only thing we can do is to
> restore from the latest checkpoint. But then we have no way to scale up by
> increasing the cluster size, because we can't create a savepoint with a
> smaller cluster but on the other hand can't restore a checkpoint to a
> bigger cluster, if rescaling from a checkpoint is not supposed to be relied
> on. So in this case we're stuck and forced to start from an empty state?
>
> IMO there is a very good chance that this will simply become a normal
> feature in the near future.
>
> Best,
> Stefan
>
>

Reply via email to