Thanks a lot for starting this thread. How many users are affected by the problem? Is it somebody else besides the initial issue reporter? If it is just one person, I would suggest to rather help pushing the 1.11.3 release over the line or work on more StateFun features ;)
On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <i...@ververica.com> wrote: > Hi Gordon, > Thanks for driving this discussion! > > I would go with the second suggestion - having two consecutive StateFun > releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release > might take a while, and this hot-fix release is important enough to get out > as early as possible. > > Cheers, > Igal. > > > > > On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <tzuli...@apache.org> > wrote: > > > Hi, > > > > We’re currently thinking about releasing StateFun 2.2.1, to address a > > critical bug that causes restores from checkpoints / savepoints to fail > > under certain circumstances [1]. > > > > To provide a bit more context, the full fix for this issue is two-fold: > > > > 1. *Fix restoring from checkpoints / savepoints taken with the same > > StateFun version:* this has already been fixed in StateFun, with > > changes backported to `flink-statefun/release-2.2`. > > 2. *Allow restoring from older savepoints taken with StateFun <= > > 2.2.0:* this requires a few fixes to Flink around restoring heap-based > > timers [2] and iterating through key groups in restored raw keyed > state > > streams [3]. These fixes will be included in Flink 1.11.3 [4], > meaning that > > to fix this, StateFun will need to wait until Flink 1.11.3 is out and > > upgrade its Flink dependency. > > > > The main discussion point here is whether or not it makes sense for > > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the > problems > > 1) and 2) can be solved together in a single hotfix release. > > > > The other option is to release StateFun 2.2.1 already with fixes for > > problem 1) only, and have another follow-up hotfix release 2.2.2 after > > Flink 1.11.3 is available. > > > > I propose to keep a close eye on the progress of Flink 1.11.3 (you can > > track progress on the 1.11.3 discussion thread [4]), and *make a decision > > here mid-week on Wednesday, Nov. 4th*. > > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 > > because it could take a while, we can start with a StateFun 2.2.1 RC > right > > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we > can > > wait for a few more days. > > > > What do you think? > > > > Cheers, > > Gordon > > > > [1] https://issues.apache.org/jira/browse/FLINK-19692 > > [2] https://github.com/apache/flink/pull/13761 > > [3] https://github.com/apache/flink/pull/13772 > > [4] > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html > > >