I think that's a good idea. (Opt-in for existing users, until the backward compatibility issues are resolved.)
On Wed, Feb 5, 2020 at 11:57 AM Arvid Heise <ar...@ververica.com> wrote: > Couldn't we treat a missing option as legacy, but set the new scheduler as > the default value in all newly shipped flink-conf.yaml? > > In this way, old users get the old behavior (either implicitly or > explicitly) unless they explicitly upgrade. > New users benefit from the new scheduler. > > On Wed, Feb 5, 2020 at 8:13 PM Gary Yao <g...@apache.org> wrote: > > > It is indeed unfortunate that these issues are discovered only now. I > think > > Thomas has a valid point, and we might be risking the trust of our users > > here. > > > > What are our options? > > > > 1. Document this behavior and how to work around it copiously in the > > release notes [1] > > 2. Try to restore the previous behavior > > 3. Change default value of jobmanager.scheduler to "legacy" and > rollout > > the feature in 1.11 > > 4. Change default value of jobmanager.scheduler to "legacy" and > rollout > > the feature earliest in 1.10.1 > > > > [1] > > > > > https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86 > > > > On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <se...@apache.org> wrote: > > > > > Should we make these a blocker? I am not sure - we could also clearly > > > state in the release notes how to restore the old behavior, if your > setup > > > assumes that behavior. > > > > > > Release candidates for this release have been out since mid December, > it > > > is a bit unfortunate that these things have been raised so late. > > > Having these rather open ended tickets (how to re-define the existing > > > metrics in the new scheduler/failover handling) now as release blockers > > > would mean that 1.10 is indefinitely delayed. > > > > > > Are we sure we want to do that? > > > > > > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <t...@apache.org> wrote: > > > > > >> Hi Gary, > > >> > > >> Thanks for the clarification! > > >> > > >> When we upgrade to a new Flink release, we don't start with a default > > >> flink-conf.yaml but upgrade our existing tooling and configuration. > > >> Therefore we notice this issue as part of the upgrade to 1.10, and not > > >> when > > >> we upgraded to 1.9. > > >> > > >> I would expect many other users to be in the same camp, and therefore > > >> consider making these regressions a blocker for 1.10? > > >> > > >> Thanks, > > >> Thomas > > >> > > >> > > >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <g...@apache.org> wrote: > > >> > > >> > > also notice that the exception causing a restart is no longer > > >> displayed > > >> > > in the UI, which is probably related? > > >> > > > >> > Yes, this is also related to the new scheduler. I created > FLINK-15917 > > >> [1] > > >> > to > > >> > track this. Moreover, I created a ticket about the uptime metric not > > >> > resetting > > >> > [2]. Both issues already exist in 1.9 if > > >> > "jobmanager.execution.failover-strategy" is set to "region", which > is > > >> the > > >> > case > > >> > in the default flink-conf.yaml. > > >> > > > >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was > enough > > to > > >> > fall > > >> > back to the previous behavior. > > >> > > > >> > In 1.10, you can still fall back to the previous behavior by setting > > >> > "jobmanager.scheduler: legacy" and unsetting > > >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml > > >> > > > >> > I would not consider these issues blockers since there is a > workaround > > >> for > > >> > them, but of course we would like to see the new scheduler getting > > some > > >> > production exposure. More detailed release notes about the caveats > of > > >> the > > >> > new > > >> > scheduler will be added to the user documentation. > > >> > > > >> > > > >> > > The watermark issue was > > >> > https://issues.apache.org/jira/browse/FLINK-14470 > > >> > > > >> > This should be fixed now [3]. > > >> > > > >> > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-15917 > > >> > [2] https://issues.apache.org/jira/browse/FLINK-15918 > > >> > [3] https://issues.apache.org/jira/browse/FLINK-8949 > > >> > > > >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <t...@apache.org> wrote: > > >> > > > >> >> Hi Gary, > > >> >> > > >> >> Thanks for the reply. > > >> >> > > >> >> --> > > >> >> > > >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <g...@apache.org> wrote: > > >> >> > > >> >> > Hi Thomas, > > >> >> > > > >> >> > > 2) Was there a change in how job recovery reflects in the > uptime > > >> >> metric? > > >> >> > > Didn't uptime previously reset to 0 on recovery (now it just > > keeps > > >> >> > > increasing)? > > >> >> > > > >> >> > The uptime is the difference between the current time and the > time > > >> when > > >> >> the > > >> >> > job transitioned to RUNNING state. By default we no longer > > transition > > >> >> the > > >> >> > job > > >> >> > out of the RUNNING state when restarting. This has something to > do > > >> with > > >> >> the > > >> >> > new scheduler which enables pipelined region failover by default > > [1]. > > >> >> > Actually > > >> >> > we enabled pipelined region failover already in the binary > > >> distribution > > >> >> of > > >> >> > Flink 1.9 by setting: > > >> >> > > > >> >> > jobmanager.execution.failover-strategy: region > > >> >> > > > >> >> > in the default flink-conf.yaml. Unless you have removed this > config > > >> >> option > > >> >> > or > > >> >> > you are using a custom yaml, you should be seeing this behavior > in > > >> Flink > > >> >> > 1.9. > > >> >> > If you do not want region failover, set > > >> >> > > > >> >> > jobmanager.execution.failover-strategy: full > > >> >> > > > >> >> > > > >> >> We are using the default (the > jobmanager.execution.failover-strategy > > >> >> setting is not present in our flink config). > > >> >> > > >> >> The change in behavior I see is between the 1.9 based deployment > and > > >> the > > >> >> 1.10 RC. > > >> >> > > >> >> Our 1.9 branch is here: > > >> >> https://github.com/lyft/flink/tree/release-1.9-lyft > > >> >> > > >> >> I also notice that the exception causing a restart is no longer > > >> displayed > > >> >> in the UI, which is probably related? > > >> >> > > >> >> > > >> >> > > > >> >> > > 1) Is the low watermark display in the UI still broken? > > >> >> > > > >> >> > I was not aware that this is broken. Is there an issue tracking > > this > > >> >> bug? > > >> >> > > > >> >> > > >> >> The watermark issue was > > >> https://issues.apache.org/jira/browse/FLINK-14470 > > >> >> > > >> >> (I don't have a good way to verify it is fixed at the moment.) > > >> >> > > >> >> Another problem with this 1.10 RC is that the > checkpointAlignmentTime > > >> >> metric is missing. (I have not been able to investigate this > further > > >> yet.) > > >> >> > > >> >> > > >> >> > > > >> >> > Best, > > >> >> > Gary > > >> >> > > > >> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651 > > >> >> > > > >> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <t...@apache.org> > > wrote: > > >> >> > > > >> >> >> I opened a PR for FLINK-15868 > > >> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>: > > >> >> >> https://github.com/apache/flink/pull/11006 > > >> >> >> > > >> >> >> With that change, I was able to run an application that consumes > > >> from > > >> >> >> Kinesis. > > >> >> >> > > >> >> >> I should have data tomorrow regarding the performance. > > >> >> >> > > >> >> >> Two questions/observations: > > >> >> >> > > >> >> >> 1) Is the low watermark display in the UI still broken? > > >> >> >> 2) Was there a change in how job recovery reflects in the uptime > > >> >> metric? > > >> >> >> Didn't uptime previously reset to 0 on recovery (now it just > keeps > > >> >> >> increasing)? > > >> >> >> > > >> >> >> Thanks, > > >> >> >> Thomas > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <t...@apache.org> > > >> wrote: > > >> >> >> > > >> >> >> > I found another issue with the Kinesis connector: > > >> >> >> > > > >> >> >> > https://issues.apache.org/jira/browse/FLINK-15868 > > >> >> >> > > > >> >> >> > > > >> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <g...@apache.org> > > wrote: > > >> >> >> > > > >> >> >> >> Hi everyone, > > >> >> >> >> > > >> >> >> >> I am hereby canceling the vote due to: > > >> >> >> >> > > >> >> >> >> FLINK-15837 > > >> >> >> >> FLINK-15840 > > >> >> >> >> > > >> >> >> >> Another RC will be created later today. > > >> >> >> >> > > >> >> >> >> Best, > > >> >> >> >> Gary > > >> >> >> >> > > >> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <g...@apache.org> > > >> wrote: > > >> >> >> >> > > >> >> >> >> > Hi everyone, > > >> >> >> >> > Please review and vote on the release candidate #1 for the > > >> version > > >> >> >> >> 1.10.0, > > >> >> >> >> > as follows: > > >> >> >> >> > [ ] +1, Approve the release > > >> >> >> >> > [ ] -1, Do not approve the release (please provide specific > > >> >> comments) > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > The complete staging area is available for your review, > which > > >> >> >> includes: > > >> >> >> >> > * JIRA release notes [1], > > >> >> >> >> > * the official Apache source release and binary convenience > > >> >> releases > > >> >> >> to > > >> >> >> >> be > > >> >> >> >> > deployed to dist.apache.org [2], which are signed with the > > key > > >> >> with > > >> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3], > > >> >> >> >> > * all artifacts to be deployed to the Maven Central > > Repository > > >> >> [4], > > >> >> >> >> > * source code tag "release-1.10.0-rc1" [5], > > >> >> >> >> > > > >> >> >> >> > The announcement blog post is in the works. I will update > > this > > >> >> voting > > >> >> >> >> > thread with a link to the pull request soon. > > >> >> >> >> > > > >> >> >> >> > The vote will be open for at least 72 hours. It is adopted > by > > >> >> >> majority > > >> >> >> >> > approval, with at least 3 PMC affirmative votes. > > >> >> >> >> > > > >> >> >> >> > Thanks, > > >> >> >> >> > Yu & Gary > > >> >> >> >> > > > >> >> >> >> > [1] > > >> >> >> >> > > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845 > > >> >> >> >> > [2] > > >> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/ > > >> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > >> >> >> >> > [4] > > >> >> >> >> > > >> >> > > https://repository.apache.org/content/repositories/orgapacheflink-1325 > > >> >> >> >> > [5] > > >> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1 > > >> >> >> >> > > > >> >> >> >> > > >> >> >> > > > >> >> >> > > >> >> > > > >> >> > > >> > > > >> > > > > > >