I am able to reproduce this failure by loading the production savepoint into a 
locally running 1.11 flink job using the state processor API.    The same 
sequence of events occurs; the Kryo snapshot deserializer stores a null for the 
refactored Savepoint interface which causes subsequent failures to restore 
operator state.   The state backend is rocksdb.

Bodily copying the 1.9.0 source code for 
org.apache.flink.runtime.checkpoint.savepoint.Savepoint into my test job allows 
it to load the savepoint and restore the operator states.     But that is a 
terrible workaround and I am looking for a good solution.



From: Robert Metzger <rmetz...@apache.org>
Date: Wednesday, August 4, 2021 at 10:21 AM
To: Weston Woods <wwo...@spireon.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>, Timo Walther 
<twal...@apache.org>
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 
savepoint to fail

Hi Weston,

Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a 1.11 
runtime and it worked. So in principle this seems to be supported.

I'm including Timo into this thread, he has a lot of experience with the 
serializers.

On Tue, Aug 3, 2021 at 6:59 PM Weston Woods 
<wwo...@spireon.com<mailto:wwo...@spireon.com>> wrote:
Robert,

Thanks for your reply.    How should I interpret the savepoint compatibility 
table here 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table>
 if a 1.9 savepoint cannot be restored into a 1.11 runtime?



From: Robert Metzger <rmetz...@apache.org<mailto:rmetz...@apache.org>>
Date: Tuesday, August 3, 2021 at 11:52 AM
To: Weston Woods <wwo...@spireon.com<mailto:wwo...@spireon.com>>
Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 
savepoint to fail

Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I know 
that savepoint migration across multiple versions is not supported (1.9 can 
only migrate to 1.10, not 1.11). We have test coverage for these migrations, 
and I would be surprised if this "Savepoint" class migration is not covered in 
these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods 
<wwo...@spireon.com<mailto:wwo...@spireon.com>> wrote:
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very 
interesting reason that the Savepoint class was renamed and repackaged between 
those two releases.   Apparently a Kryo serializer has that class registered in 
the 1.9 runtime.     I can’t think of a good reason for that class to be 
registered with Kryo; none of the job operators reference any such thing.   Yet 
there it is causing the following exception and preventing upgrade to a new 
runtime.

Reply via email to