I think that's a good idea.

(Opt-in for existing users, until the backward compatibility issues are
resolved.)


On Wed, Feb 5, 2020 at 11:57 AM Arvid Heise <ar...@ververica.com> wrote:

> Couldn't we treat a missing option as legacy, but set the new scheduler as
> the default value in all newly shipped flink-conf.yaml?
>
> In this way, old users get the old behavior (either implicitly or
> explicitly) unless they explicitly upgrade.
> New users benefit from the new scheduler.
>
> On Wed, Feb 5, 2020 at 8:13 PM Gary Yao <g...@apache.org> wrote:
>
> > It is indeed unfortunate that these issues are discovered only now. I
> think
> > Thomas has a valid point, and we might be risking the trust of our users
> > here.
> >
> > What are our options?
> >
> >     1. Document this behavior and how to work around it copiously in the
> > release notes [1]
> >     2. Try to restore the previous behavior
> >     3. Change default value of jobmanager.scheduler to "legacy" and
> rollout
> > the feature in 1.11
> >     4. Change default value of jobmanager.scheduler to "legacy" and
> rollout
> > the feature earliest in 1.10.1
> >
> > [1]
> >
> >
> https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86
> >
> > On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen <se...@apache.org> wrote:
> >
> > > Should we make these a blocker? I am not sure - we could also clearly
> > > state in the release notes how to restore the old behavior, if your
> setup
> > > assumes that behavior.
> > >
> > > Release candidates for this release have been out since mid December,
> it
> > > is a bit unfortunate that these things have been raised so late.
> > > Having these rather open ended tickets (how to re-define the existing
> > > metrics in the new scheduler/failover handling) now as release blockers
> > > would mean that 1.10 is indefinitely delayed.
> > >
> > > Are we sure we want to do that?
> > >
> > > On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise <t...@apache.org> wrote:
> > >
> > >> Hi Gary,
> > >>
> > >> Thanks for the clarification!
> > >>
> > >> When we upgrade to a new Flink release, we don't start with a default
> > >> flink-conf.yaml but upgrade our existing tooling and configuration.
> > >> Therefore we notice this issue as part of the upgrade to 1.10, and not
> > >> when
> > >> we upgraded to 1.9.
> > >>
> > >> I would expect many other users to be in the same camp, and therefore
> > >> consider making these regressions a blocker for 1.10?
> > >>
> > >> Thanks,
> > >> Thomas
> > >>
> > >>
> > >> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao <g...@apache.org> wrote:
> > >>
> > >> > > also notice that the exception causing a restart is no longer
> > >> displayed
> > >> > > in the UI, which is probably related?
> > >> >
> > >> > Yes, this is also related to the new scheduler. I created
> FLINK-15917
> > >> [1]
> > >> > to
> > >> > track this. Moreover, I created a ticket about the uptime metric not
> > >> > resetting
> > >> > [2]. Both issues already exist in 1.9 if
> > >> > "jobmanager.execution.failover-strategy" is set to "region", which
> is
> > >> the
> > >> > case
> > >> > in the default flink-conf.yaml.
> > >> >
> > >> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was
> enough
> > to
> > >> > fall
> > >> > back to the previous behavior.
> > >> >
> > >> > In 1.10, you can still fall back to the previous behavior by setting
> > >> > "jobmanager.scheduler: legacy" and unsetting
> > >> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
> > >> >
> > >> > I would not consider these issues blockers since there is a
> workaround
> > >> for
> > >> > them, but of course we would like to see the new scheduler getting
> > some
> > >> > production exposure. More detailed release notes about the caveats
> of
> > >> the
> > >> > new
> > >> > scheduler will be added to the user documentation.
> > >> >
> > >> >
> > >> > > The watermark issue was
> > >> > https://issues.apache.org/jira/browse/FLINK-14470
> > >> >
> > >> > This should be fixed now [3].
> > >> >
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/FLINK-15917
> > >> > [2] https://issues.apache.org/jira/browse/FLINK-15918
> > >> > [3] https://issues.apache.org/jira/browse/FLINK-8949
> > >> >
> > >> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise <t...@apache.org> wrote:
> > >> >
> > >> >> Hi Gary,
> > >> >>
> > >> >> Thanks for the reply.
> > >> >>
> > >> >> -->
> > >> >>
> > >> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao <g...@apache.org> wrote:
> > >> >>
> > >> >> > Hi Thomas,
> > >> >> >
> > >> >> > > 2) Was there a change in how job recovery reflects in the
> uptime
> > >> >> metric?
> > >> >> > > Didn't uptime previously reset to 0 on recovery (now it just
> > keeps
> > >> >> > > increasing)?
> > >> >> >
> > >> >> > The uptime is the difference between the current time and the
> time
> > >> when
> > >> >> the
> > >> >> > job transitioned to RUNNING state. By default we no longer
> > transition
> > >> >> the
> > >> >> > job
> > >> >> > out of the RUNNING state when restarting. This has something to
> do
> > >> with
> > >> >> the
> > >> >> > new scheduler which enables pipelined region failover by default
> > [1].
> > >> >> > Actually
> > >> >> > we enabled pipelined region failover already in the binary
> > >> distribution
> > >> >> of
> > >> >> > Flink 1.9 by setting:
> > >> >> >
> > >> >> >     jobmanager.execution.failover-strategy: region
> > >> >> >
> > >> >> > in the default flink-conf.yaml. Unless you have removed this
> config
> > >> >> option
> > >> >> > or
> > >> >> > you are using a custom yaml, you should be seeing this behavior
> in
> > >> Flink
> > >> >> > 1.9.
> > >> >> > If you do not want region failover, set
> > >> >> >
> > >> >> >     jobmanager.execution.failover-strategy: full
> > >> >> >
> > >> >> >
> > >> >> We are using the default (the
> jobmanager.execution.failover-strategy
> > >> >> setting is not present in our flink config).
> > >> >>
> > >> >> The change in behavior I see is between the 1.9 based deployment
> and
> > >> the
> > >> >> 1.10 RC.
> > >> >>
> > >> >> Our 1.9 branch is here:
> > >> >> https://github.com/lyft/flink/tree/release-1.9-lyft
> > >> >>
> > >> >> I also notice that the exception causing a restart is no longer
> > >> displayed
> > >> >> in the UI, which is probably related?
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > > 1) Is the low watermark display in the UI still broken?
> > >> >> >
> > >> >> > I was not aware that this is broken. Is there an issue tracking
> > this
> > >> >> bug?
> > >> >> >
> > >> >>
> > >> >> The watermark issue was
> > >> https://issues.apache.org/jira/browse/FLINK-14470
> > >> >>
> > >> >> (I don't have a good way to verify it is fixed at the moment.)
> > >> >>
> > >> >> Another problem with this 1.10 RC is that the
> checkpointAlignmentTime
> > >> >> metric is missing. (I have not been able to investigate this
> further
> > >> yet.)
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > Best,
> > >> >> > Gary
> > >> >> >
> > >> >> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> > >> >> >
> > >> >> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise <t...@apache.org>
> > wrote:
> > >> >> >
> > >> >> >> I opened a PR for FLINK-15868
> > >> >> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> > >> >> >> https://github.com/apache/flink/pull/11006
> > >> >> >>
> > >> >> >> With that change, I was able to run an application that consumes
> > >> from
> > >> >> >> Kinesis.
> > >> >> >>
> > >> >> >> I should have data tomorrow regarding the performance.
> > >> >> >>
> > >> >> >> Two questions/observations:
> > >> >> >>
> > >> >> >> 1) Is the low watermark display in the UI still broken?
> > >> >> >> 2) Was there a change in how job recovery reflects in the uptime
> > >> >> metric?
> > >> >> >> Didn't uptime previously reset to 0 on recovery (now it just
> keeps
> > >> >> >> increasing)?
> > >> >> >>
> > >> >> >> Thanks,
> > >> >> >> Thomas
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise <t...@apache.org>
> > >> wrote:
> > >> >> >>
> > >> >> >> > I found another issue with the Kinesis connector:
> > >> >> >> >
> > >> >> >> > https://issues.apache.org/jira/browse/FLINK-15868
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao <g...@apache.org>
> > wrote:
> > >> >> >> >
> > >> >> >> >> Hi everyone,
> > >> >> >> >>
> > >> >> >> >> I am hereby canceling the vote due to:
> > >> >> >> >>
> > >> >> >> >>     FLINK-15837
> > >> >> >> >>     FLINK-15840
> > >> >> >> >>
> > >> >> >> >> Another RC will be created later today.
> > >> >> >> >>
> > >> >> >> >> Best,
> > >> >> >> >> Gary
> > >> >> >> >>
> > >> >> >> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao <g...@apache.org>
> > >> wrote:
> > >> >> >> >>
> > >> >> >> >> > Hi everyone,
> > >> >> >> >> > Please review and vote on the release candidate #1 for the
> > >> version
> > >> >> >> >> 1.10.0,
> > >> >> >> >> > as follows:
> > >> >> >> >> > [ ] +1, Approve the release
> > >> >> >> >> > [ ] -1, Do not approve the release (please provide specific
> > >> >> comments)
> > >> >> >> >> >
> > >> >> >> >> >
> > >> >> >> >> > The complete staging area is available for your review,
> which
> > >> >> >> includes:
> > >> >> >> >> > * JIRA release notes [1],
> > >> >> >> >> > * the official Apache source release and binary convenience
> > >> >> releases
> > >> >> >> to
> > >> >> >> >> be
> > >> >> >> >> > deployed to dist.apache.org [2], which are signed with the
> > key
> > >> >> with
> > >> >> >> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> > >> >> >> >> > * all artifacts to be deployed to the Maven Central
> > Repository
> > >> >> [4],
> > >> >> >> >> > * source code tag "release-1.10.0-rc1" [5],
> > >> >> >> >> >
> > >> >> >> >> > The announcement blog post is in the works. I will update
> > this
> > >> >> voting
> > >> >> >> >> > thread with a link to the pull request soon.
> > >> >> >> >> >
> > >> >> >> >> > The vote will be open for at least 72 hours. It is adopted
> by
> > >> >> >> majority
> > >> >> >> >> > approval, with at least 3 PMC affirmative votes.
> > >> >> >> >> >
> > >> >> >> >> > Thanks,
> > >> >> >> >> > Yu & Gary
> > >> >> >> >> >
> > >> >> >> >> > [1]
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
> > >> >> >> >> > [2]
> > >> >> https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> > >> >> >> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > >> >> >> >> > [4]
> > >> >> >> >>
> > >> >>
> > https://repository.apache.org/content/repositories/orgapacheflink-1325
> > >> >> >> >> > [5]
> > >> >> https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>

Reply via email to