from:"Chesnay Schepler"

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Chesnay Schepler

People really have to stop thinking that just because something works 
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish, 
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of 
build times and number of builds.

In this very thread people have been complaining about long queue times 
for their builds. Surprise, other Apache projects have been suffering 
the very same thing due to us not controlling our build times. While 
switching services (be it Jenkins, CircleCI or whatever) will possibly 
work for us (and these options are actually attractive, like CircleCI's 
proper support for build artifacts), it will also result in us likely 
negatively affecting other projects in significant ways.

Sure, the Jenkins setup has a good user experience for us, at the cost 
of blocking Jenkins workers for a _lot_ of time. Right now we have 25 
PR's in our queue; that's possibly 50h we'd consume of Jenkins 
resources, and the European contributors haven't even really started yet.

FYI, the latest INFRA response from INFRA-18533:

"Our rough metrics shows that Flink used over 5800 hours of build time 
last month. That is equal to EIGHT servers running 24/7 for the ENTIRE 
MONTH. EIGHT. nonstop.
When we discovered this last night, we discussed it some and are going 
to tune down Flink to allow only five executors maximum. We cannot allow 
Flink to consume so much of a Foundation shared resource."

So yes, we either
a) have to heavily reduce our CI usage or
b) fund our own, either maintaining it ourselves or donating to Apache.

On 02/07/2019 05:11, Bowen Li wrote:
By looking at the git history of the Jenkins script, its core part was 
finished in March 2017 (and only two minor update in 2017/2018), so 
it's been running for over two years now and feels like Zepplin 
community has been quite happy with it. @Jeff Zhang 
<mailto:zjf...@gmail.com> can you share your insights and user 
experience with the Jenkins+Travis approach?

Things like:

- has the approach completely solved the resource capacity problem for 
Zepplin community? is Zepplin community happy with the result?

- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many people 
are usually involved in maintenance and bug-fixes?

The downside of this approach seems mostly to be on the maintenance to 
me - maintain the script and Jenkins infra.

** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own 
travis-ci.com <http://travis-ci.com> account with paid dedicated 
resources. Note travis-ci.org <http://travis-ci.org> is the free 
version and travis-ci.com <http://travis-ci.com> is the commercial 
version. We currently use a shared resource pool managed by ASK INFRA 
team on travis-ci.org <http://travis-ci.org>, but we have no control 
over it - we can't see how it's configured, how much resources are 
available, how resources are allocated among Apache projects, etc. The 
nice thing about having an account on travis-ci.com 
<http://travis-ci.com> are:

- relatively low cost with much better resource guarantee than what we 
currently have [1]: $249/month with 5 dedicated concurrency, 
$489/month with 10 concurrency

- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2] 
(pending verification)
- full control over the build capacity/configuration compared to using 
ASF INFRA's pool

I'd be surprised if we as such a vibrant community cannot find and 
fund $249*12=$2988 a year in exchange for a much better developer 
experience and much higher productivity.

[1] https://travis-ci.com/plans
[2] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration

On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

So yes, the Jenkins job keeps pulling the state from Travis until it
finishes.

Note sure I'm comfortable with the idea of using Jenkins workers
just to
idle for a several hours.

On 29/06/2019 14:56, Jeff Zhang wrote:
> Here's what zeppelin community did, we make a python script to
check the
> build status of pull request.
> Here's script:
> https://github.com/apache/zeppelin/blob/master/travis_check.py
>
> And this is the script we used in Jenkins build job.
>
> if [ -f "travis_check.py" ]; then
>git log -n 1
>STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull
request.*from.*" | sed
> 's/.*GitHub pull request  href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
>AUTHOR=$(echo $STATUS | sed 's/.*[/]

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Chesnay Schepler

As a short-term stopgap, since we can assume this issue to become much 
worse in the following days/weeks, we could disable IT cases in PRs and 
only run them on master.

On 02/07/2019 12:03, Chesnay Schepler wrote:
People really have to stop thinking that just because something works 
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish, 
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of 
build times and number of builds.

In this very thread people have been complaining about long queue 
times for their builds. Surprise, other Apache projects have been 
suffering the very same thing due to us not controlling our build 
times. While switching services (be it Jenkins, CircleCI or whatever) 
will possibly work for us (and these options are actually attractive, 
like CircleCI's proper support for build artifacts), it will also 
result in us likely negatively affecting other projects in significant 
ways.

Sure, the Jenkins setup has a good user experience for us, at the cost 
of blocking Jenkins workers for a _lot_ of time. Right now we have 25 
PR's in our queue; that's possibly 50h we'd consume of Jenkins 
resources, and the European contributors haven't even really started yet.

FYI, the latest INFRA response from INFRA-18533:

"Our rough metrics shows that Flink used over 5800 hours of build time 
last month. That is equal to EIGHT servers running 24/7 for the ENTIRE 
MONTH. EIGHT. nonstop.
When we discovered this last night, we discussed it some and are going 
to tune down Flink to allow only five executors maximum. We cannot 
allow Flink to consume so much of a Foundation shared resource."

So yes, we either
a) have to heavily reduce our CI usage or
b) fund our own, either maintaining it ourselves or donating to Apache.

On 02/07/2019 05:11, Bowen Li wrote:
By looking at the git history of the Jenkins script, its core part 
was finished in March 2017 (and only two minor update in 2017/2018), 
so it's been running for over two years now and feels like Zepplin 
community has been quite happy with it. @Jeff Zhang 
<mailto:zjf...@gmail.com> can you share your insights and user 
experience with the Jenkins+Travis approach?

Things like:

- has the approach completely solved the resource capacity problem 
for Zepplin community? is Zepplin community happy with the result?

- is the whole configuration chain stable (e.g. uptime) enough?
- how often do you need to maintain the Jenkins infra? how many 
people are usually involved in maintenance and bug-fixes?

The downside of this approach seems mostly to be on the maintenance 
to me - maintain the script and Jenkins infra.

** Having Our Own Travis-CI.com Account **

Another alternative I've been thinking of is to have our own 
travis-ci.com <http://travis-ci.com> account with paid dedicated 
resources. Note travis-ci.org <http://travis-ci.org> is the free 
version and travis-ci.com <http://travis-ci.com> is the commercial 
version. We currently use a shared resource pool managed by ASK INFRA 
team on travis-ci.org <http://travis-ci.org>, but we have no control 
over it - we can't see how it's configured, how much resources are 
available, how resources are allocated among Apache projects, etc. 
The nice thing about having an account on travis-ci.com 
<http://travis-ci.com> are:

- relatively low cost with much better resource guarantee than what 
we currently have [1]: $249/month with 5 dedicated concurrency, 
$489/month with 10 concurrency

- low maintenance work compared to using Jenkins
- (potentially) no migration cost according to Travis's doc [2] 
(pending verification)
- full control over the build capacity/configuration compared to 
using ASF INFRA's pool

I'd be surprised if we as such a vibrant community cannot find and 
fund $249*12=$2988 a year in exchange for a much better developer 
experience and much higher productivity.

[1] https://travis-ci.com/plans
[2] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration

On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

So yes, the Jenkins job keeps pulling the state from Travis until it
finishes.

Note sure I'm comfortable with the idea of using Jenkins workers
just to
idle for a several hours.

On 29/06/2019 14:56, Jeff Zhang wrote:
> Here's what zeppelin community did, we make a python script to
check the
> build status of pull request.
> Here's script:
> https://github.com/apache/zeppelin/blob/master/travis_check.py
>
> And this is the script we used in Jenkins build job.
>
> if [ -f "travis_check.py" ]; then
>git log -n 1
>STATUS=$(curl -s $BUILD_URL | grep -e &

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-03 Thread Chesnay Schepler

Are they using their own Travis CI pool, or did the switch to an 
entirely different CI service?


If we can just switch to our own Travis pool, just for our project, then 
this might be something we can do fairly quickly?


On 03/07/2019 05:55, Bowen Li wrote:

I responded in the INFRA ticket [1] that I believe they are using a wrong
metric against Flink and the total build time is a completely different
thing than guaranteed build capacity.

My response:

"As mentioned above, since I started to pay attention to Flink's build
queue a few tens of days ago, I'm in Seattle and I saw no build was kicking
off in PST daytime in weekdays for Flink. Our teammates in China and Europe
have also reported similar observations. So we need to evaluate how the
large total build time came from - if 1) your number and 2) our
observations from three locations that cover pretty much a full day, are
all true, I **guess** one reason can be that - highly likely the extra
build time came from weekends when other Apache projects may be idle and
Flink just drains hard its congested queue.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.

That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

Please be aware of that we're not complaining about the lack of resources
in general, I'm complaining about the lack of **stable, dedicated**
resources. An example for the latter one is, currently even if no build is
in Flink's queue and I submit a request to be the queue head in PST
morning, my build won't even start in 6-8+h. That is an absurd amount of
waiting time.


That's saying, if ASF INFRA decides to adopt a quota system and grants
Flink five DEDICATED servers that runs all the time only for Flink, that'll
be PERFECT and can totally solve our problem now.

I feel what's missing in the ASF INFRA's Travis resource pool is some level
of build capacity SLAs and certainty"


Again, I believe there are differences in nature of these two problems,
long build time v.s. lack of dedicated build resource. That's saying,
shortening build time may relieve the situation, and may not. I'm sightly
negative on disabling IT cases for PRs, due to the downside is that we are
at risk of any potential bugs in PR that UTs doesn't catch, and may cost a
lot more to fix and if it slows others down or even block others, but am
open to others opinions on it.

AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
solve our problem since INFRA's pool is fully shared and they have no
control and finer insights over resource allocation to a specific Apache
project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
Travis pool (they are actually surprised Flink hasn't plan to do so). I
know that Spark is on its own build infra. If we all agree that funding our
own build infra, I'd be glad to help investigate any potential options
after releasing 1.9 since I'm super busy with 1.9 now.

[1] https://issues.apache.org/jira/browse/INFRA-18533



On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler  wrote:


As a short-term stopgap, since we can assume this issue to become much
worse in the following days/weeks, we could disable IT cases in PRs and
only run them on master.

On 02/07/2019 12:03, Chesnay Schepler wrote:

People really have to stop thinking that just because something works
for us it is also a good solution.
Also, please remember that our builds run for 2h from start to finish,
and not the 14 _minutes_ it takes for zeppelin.
We are dealing with an entirely different scale here, both in terms of
build times and number of builds.

In this very thread people have been complaining about long queue
times for their builds. Surprise, other Apache projects have been
suffering the very same thing due to us not controlling our build
times. While switching services (be it Jenkins, CircleCI or whatever)
will possibly work for us (and these options are actually attractive,
like CircleCI's proper support for build artifacts), it will also
result in us likely negatively affecting other projects in significant
ways.

Sure, the Jenkins setup has a good user experience for us, at the cost
of blocking Jenkins workers for a _lot_ of time. Right now we have 25
PR's in our queue; that's possibly 50h we'd consume of Jenkins
resources, and the European contributors haven't even really started yet.

FYI, the latest INFRA response from INFRA-

Re: [DISCUSS] Publish the PyFlink into PyPI

2019-07-03 Thread Chesnay Schepler

The existing artifact in the pyflink project was neither released by the 
Flink project / anyone affiliated with it nor approved by the Flink PMC.


As such, if we were to use this account I believe we should delete it to 
not mislead users that this is in any way an apache-provided 
distribution. Since this goes against the users wishes, I would be in 
favor of creating a separate account, and giving back control over the 
pyflink account.


My take on the raised points:
1.1) "apache-flink"
1.2)  option 2
2) Given that we only distribute python code there should be no reason 
to differentiate between scala versions. We should not be distributing 
any java/scala code and/or modules to PyPi. Currently, I'm a bit 
confused about this question and wonder what exactly we are trying to 
publish here.
3) The should be treated as any other source release; i.e., it needs a 
LICENSE and NOTICE file, signatures and a PMC vote. My suggestion would 
be to make this part of our normal release process. There will be _one_ 
source release on dist.apache.org encompassing everything, and a 
separate python of focused source release that we push to PyPi. The 
LICENSE and NOTICE contained in the python source release must also be 
present in the source release of Flink; so basically the python source 
release is just the contents of flink-python module the maven pom.xml, 
with no other special sauce added during the release process.


On 02/07/2019 05:42, jincheng sun wrote:

Hi all,

With the effort of FLIP-38 [1], the Python Table API(without UDF support
for now) will be supported in the coming release-1.9.
As described in "Build PyFlink"[2], if users want to use the Python Table
API, they can manually install it using the command:
"cd flink-python && python3 setup.py sdist && pip install dist/*.tar.gz".

This is non-trivial for users and it will be better if we can follow the
Python way to publish PyFlink to PyPI
which is a repository of software for the Python programming language. Then
users can use the standard Python package
manager "pip" to install PyFlink: "pip install pyflink". So, there are some
topic need to be discussed as follows:

1. How to publish PyFlink to PyPI

1.1 Project Name
  We need to decide the project name of PyPI to use, for example,
apache-flink,  pyflink, etc.

 Regarding to the name "pyflink", it has already been registered by
@ueqt and there is already a package '1.0' released under this project
which is based on flink-libraries/flink-python.

@ueqt has kindly agreed to give this project back to the community. And
he has requested that the released package '1.0' should not be removed as
it has already been used in their company.

 So we need to decide whether to use the name 'pyflink'?  If yes, we
need to figure out how to tackle with the package '1.0' under this project.

 From the points of my view, the "pyflink" is better for our project
name and we can keep the release of 1.0, maybe more people want to use.

1.2 PyPI account for release
 We need also decide on which account to use to publish packages to PyPI.

 There are two permissions in PyPI: owner and maintainer:

 1) The owner can upload releases, delete files, releases or the entire
project.
 2) The maintainer can also upload releases. However, they cannot delete
files, releases, or the project.

 So there are two options in my mind:

 1) Create an account such as 'pyflink' as the owner share it with all
the release managers and then release managers can publish the package to
PyPI using this account.
 2) Create an account such as 'pyflink' as owner(only PMC can manage it)
and adds the release manager's account as maintainers of the project.
Release managers publish the package to PyPI using their own account.

 As I know, PySpark takes Option 1) and Apache Beam takes Option 2).

 From the points of my view, I prefer option 2) as it's pretty safer as
it eliminate the risk of deleting old releases occasionally and at the same
time keeps the trace of who is operating.

2. How to handle Scala_2.11 and Scala_2.12

The PyFlink package bundles the jars in the package. As we know, there are
two versions of jars for each module: one for Scala 2.11 and the other for
Scala 2.12. So there will be two PyFlink packages theoretically. We need to
decide which one to publish to PyPI or both. If both packages will be
published to PyPI, we may need two projects, such as pyflink_211 and
pyflink_212 separately. Maybe more in the future such as pyflink_213.

 (BTW, I think we should bring up a discussion for dorp Scala_2.11 in
Flink 1.10 release due to 2.13 is available in early June.)

 From the points of my view, for now, we can only release the scala_2.11
version, due to scala_2.11 is our default version in Flink.

3. Legal probl

Re: [DISCUSS] Publish the PyFlink into PyPI

2019-07-03 Thread Chesnay Schepler


So this would not be a source release then, but a full-blown binary release.

Maybe it is just me, but I find it a bit suspect to ship an entire java 
application via PyPI, just because there's a Python API for it.


We definitely need input from more people here.

On 03/07/2019 14:09, Dian Fu wrote:

Hi Chesnay,

Thanks a lot for the suggestions.

Regarding “distributing java/scala code to PyPI”:
The Python Table API is just a wrapper of the Java Table API and without the 
java/scala code, two steps will be needed to set up an environment to execute a 
Python Table API program:
1) Install pyflink using "pip install apache-flink"
2) Download the flink distribution and set the FLINK_HOME to it.
Besides, users have to make sure that the manually installed Flink is 
compatible with the pip installed pyflink.

Bundle the java/scala code inside the Python package will eliminate step 2) and makes 
it more simple for users to install pyflink. There was a short discussion 
<https://issues.apache.org/jira/browse/SPARK-1267> on this in Spark community 
and they finally decide to package the java/scala code in the python package. (BTW, 
PySpark only bundle the jars of scala 2.11).

Regards,
Dian


在 2019年7月3日，下午7:13，Chesnay Schepler  写道：

The existing artifact in the pyflink project was neither released by the Flink 
project / anyone affiliated with it nor approved by the Flink PMC.

As such, if we were to use this account I believe we should delete it to not 
mislead users that this is in any way an apache-provided distribution. Since 
this goes against the users wishes, I would be in favor of creating a separate 
account, and giving back control over the pyflink account.

My take on the raised points:
1.1) "apache-flink"
1.2)  option 2
2) Given that we only distribute python code there should be no reason to 
differentiate between scala versions. We should not be distributing any 
java/scala code and/or modules to PyPi. Currently, I'm a bit confused about 
this question and wonder what exactly we are trying to publish here.
3) The should be treated as any other source release; i.e., it needs a LICENSE 
and NOTICE file, signatures and a PMC vote. My suggestion would be to make this 
part of our normal release process. There will be _one_ source release on 
dist.apache.org encompassing everything, and a separate python of focused 
source release that we push to PyPi. The LICENSE and NOTICE contained in the 
python source release must also be present in the source release of Flink; so 
basically the python source release is just the contents of flink-python module 
the maven pom.xml, with no other special sauce added during the release process.

On 02/07/2019 05:42, jincheng sun wrote:

Hi all,

With the effort of FLIP-38 [1], the Python Table API(without UDF support
for now) will be supported in the coming release-1.9.
As described in "Build PyFlink"[2], if users want to use the Python Table
API, they can manually install it using the command:
"cd flink-python && python3 setup.py sdist && pip install dist/*.tar.gz".

This is non-trivial for users and it will be better if we can follow the
Python way to publish PyFlink to PyPI
which is a repository of software for the Python programming language. Then
users can use the standard Python package
manager "pip" to install PyFlink: "pip install pyflink". So, there are some
topic need to be discussed as follows:

1. How to publish PyFlink to PyPI

1.1 Project Name
  We need to decide the project name of PyPI to use, for example,
apache-flink,  pyflink, etc.

 Regarding to the name "pyflink", it has already been registered by
@ueqt and there is already a package '1.0' released under this project
which is based on flink-libraries/flink-python.

@ueqt has kindly agreed to give this project back to the community. And
he has requested that the released package '1.0' should not be removed as
it has already been used in their company.

 So we need to decide whether to use the name 'pyflink'?  If yes, we
need to figure out how to tackle with the package '1.0' under this project.

 From the points of my view, the "pyflink" is better for our project
name and we can keep the release of 1.0, maybe more people want to use.

1.2 PyPI account for release
 We need also decide on which account to use to publish packages to PyPI.

 There are two permissions in PyPI: owner and maintainer:

 1) The owner can upload releases, delete files, releases or the entire
project.
 2) The maintainer can also upload releases. However, they cannot delete
files, releases, or the project.

 So there are two options in my mind:

 1) Create an account such as 'pyflink' as the owner share it with all
the release managers and then release managers can publish the package to
PyPI using this account.
 2) C

[VOTE] Migrate to sponsored Travis account

2019-07-04 Thread Chesnay Schepler

I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to inquire 
whether it would be possible to switch to a different Travis account, 
and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis account, 
provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored Travis 
account

The vote will be open for at least 24h, and until we have confirmation 
from INFRA. The voting period may be shorter than the usual 3 days since 
our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to an 
entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference quoting 
Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration

[4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

Are they using their own Travis CI pool, or did the switch to an
entirely different CI service?

If we can just switch to our own Travis pool, just for our
project, then
this might be something we can do fairly quickly?

On 03/07/2019 05:55, Bowen Li wrote:
> I responded in the INFRA ticket [1] that I believe they are
using a wrong
> metric against Flink and the total build time is a completely
different
> thing than guaranteed build capacity.
>
> My response:
>
> "As mentioned above, since I started to pay attention to Flink's
build
> queue a few tens of days ago, I'm in Seattle and I saw no build
was kicking
> off in PST daytime in weekdays for Flink. Our teammates in China
and Europe
> have also reported similar observations. So we need to evaluate
how the
> large total build time came from - if 1) your number and 2) our
> observations from three locations that cover pretty much a full
day, are
> all true, I **guess** one reason can be that - highly likely the
extra
> build time came from weekends when other Apache projects may be
idle and
> Flink just drains hard its congested queue.
>
> Please be aware of that we're not complaining about the lack of
resources
> in general, I'm complaining about the lack of **stable, dedicated**
> resources. An example for the latter one is, currently even if
no build is
> in Flink's queue and I submit a request to be the queue head in PST
> morning, my build won't even start in 6-8+h. That is an absurd
amount of
> waiting time.
>
> That's saying, if ASF INFRA decides to adopt a quota system and
grants
> Flink five DEDICATED servers that runs all the time only for
Flink, that'll
> be PERFECT and can totally solve our problem now.
>
> Please be aware of that we're not complaining about the lack of
resources
> in general, I'm complaining about the lack of **stable, dedicated**
> resources. An example for the latter one is, currently even if
no build is
> in Flink's queue and I submit a request to be the queue head in PST
> morning, my build won't even start in 6-8+h. That is an absurd
amount of
> waiting time.
>
>
> That's saying, if ASF INFRA decides to adopt a quota system and
grants
> Flink five DEDICATED servers that runs all the time only for
Flink, that'll
> be PERFECT and can totally solve our problem now.
>
> I feel what's missing in the ASF INFRA's Travis resource pool is
some level
> of build capacity SLAs and certaint

Re: [DISCUSS] A more restrictive JIRA workflow

2019-07-04 Thread Chesnay Schepler

o offer to add a functionality to the

Flinkbot

to

automatically

close pull requests which have been opened against a

unassigned

JIRA

ticket.
Being rejected by an automated system, which just

applies

a

rule

is

nicer

than being rejected by a person.


On Wed, Feb 27, 2019 at 1:45 PM Stephan Ewen <

se...@apache.org>

wrote:

@Chesnay - yes, this is possible, according to

infra.

On Wed, Feb 27, 2019 at 11:09 AM ZiLi Chen <

wander4...@gmail.com

wrote:

Hi,

@Hequn
It might be hard to separate JIRAs into

conditional

and

unconditional

ones.

Even if INFRA supports such separation, we meet

the

problem

that

whether

a contributor is granted to decide the type of a

JIRA.

If

so,

then

contributors might
tend to create JIRAs as unconditional; and if

not, we

fallback

that a

contributor
ask a committer for setting the JIRA as

unconditional,

which

is

no

better

than
ask a committer for assigning to the contributor.

@Timo
"More discussion before opening a PR" sounds good.

However,

it

requires

more
effort/participation from committer's side. From

my

own

side,

it's

exciting

to
see our committers become more active :-)

Best,
tison.


Chesnay Schepler 

于2019年2月27日周三

下午5:06写道：

We currently cannot change the JIRA permissions.

Have

you

asked

INFRA

whether it is possible to setup a Flink-specific

permission

scheme?

On 25.02.2019 14:23, Timo Walther wrote:

Hi everyone,

as some of you might have noticed during the

last

weeks,

the

Flink

community grew quite a bit. A lot of people have

applied

for

contributor permissions and started working on

issues,

which

is

great

for the growth of Flink!

However, we've also observed that managing JIRA

and

coordinate

work

and responsibilities becomes more complex as

more

people

are

joining.

Here are some observations to examplify the

current

challenges:

- There is a high number of concurrent

discussion

about

new

features

or important refactorings.

- JIRA issues are being created for components

to:

   - represent an implementation plan (e.g.

of a

FLIP)

   - track progress of the feature by

splitting

it

into a

finer

granularity
   - coordinate work between

contributors/contributor

teams

- Lack of guidance for new contributors:

Contributors

don't

know

which

issues to pick but are motivated to work on

something.

- Contributors pick issues that:
   - require very good (historical)

knowledge of

a

component

   - need to be implemented in a timely

fashion

as

they

block

other

contributors or a Flink release
   - have implicit dependencies on other

changes

- Contributors open pull requests with a bad

description,

without

consensus, or an unsatisfactory architecture.

Shortcomings

that

could

have been solved in JIRA before.

- Committers don't open issues because they fear

that

some

"random"

contributor picks it up or assign many issues to

themselves

to

"protect" them. Even though they don't have the

capacity

to

solve

all

of them.

I propose to make our JIRA a bit more

restrictive:

- Don't allow contributors to assign issues to

themselves.

This

forces

them to find supporters first. As mentioned in

the

contribution

guidelines [1]: "reach consensus with the

community".

Only

committers

can assign people to issues.

- Don't allow contributors to set a fixed

version or

release

notes.

Only committers should do that after merging the

contribution.

- Don't allow contributors to set a blocker

priority.

The

release

manager should decide about that.

As a nice side-effect, it might also impact the

number

of

stale

pull

requests by moving the consensus and design

discussion

to

an

earlier

phase in the process.

What do you think? Feel free to propose more

workflow

improvements.

Of

course we need to check with INFRA if this can

be

represented

in

our

JIRA.

Thanks,
Timo

[1]

https://flink.apache.org/contribute-code.html



--
Feng (Sent from my phone)

Re: [VOTE] Migrate to sponsored Travis account

2019-07-04 Thread Chesnay Schepler

Small update with mostly bad news:

INFRA doesn't know whether it is possible, and referred my to Travis 
support.
They did point out that it could be problematic in regards to read/write 
permissions for the repository.

From my own findings /so far/ with a test repo/organization, it does 
not appear possible to configure the Travis account used for a specific 
repository.

So yeah, if we go down this route we may have to pimp the Flinkbot to 
trigger builds through the Travis REST API.

On 04/07/2019 10:46, Chesnay Schepler wrote:
I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to 
inquire whether it would be possible to switch to a different Travis 
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis account, 
provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored Travis 
account

The vote will be open for at least 24h, and until we have confirmation 
from INFRA. The voting period may be shorter than the usual 3 days 
since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to 
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference 
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration

[4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

    Are they using their own Travis CI pool, or did the switch to an
    entirely different CI service?

    If we can just switch to our own Travis pool, just for our
    project, then
    this might be something we can do fairly quickly?

    On 03/07/2019 05:55, Bowen Li wrote:
    > I responded in the INFRA ticket [1] that I believe they are
    using a wrong
    > metric against Flink and the total build time is a completely
    different
    > thing than guaranteed build capacity.
    >
    > My response:
    >
    > "As mentioned above, since I started to pay attention to Flink's
    build
    > queue a few tens of days ago, I'm in Seattle and I saw no build
    was kicking
    > off in PST daytime in weekdays for Flink. Our teammates in China
    and Europe
    > have also reported similar observations. So we need to evaluate
    how the
    > large total build time came from - if 1) your number and 2) our
    > observations from three locations that cover pretty much a full
    day, are
    > all true, I **guess** one reason can be that - highly likely the
    extra
    > build time came from weekends when other Apache projects may be
    idle and
    > Flink just drains hard its congested queue.
    >
    > Please be aware of that we're not complaining about the lack of
    resources
    > in general, I'm complaining about the lack of **stable, 
dedicated**

    > resources. An example for the latter one is, currently even if
    no build is
    > in Flink's queue and I submit a request to be the queue head in 
PST

    > morning, my build won't even start in 6-8+h. That is an absurd
    amount of
    > waiting time.
    >
    > That's saying, if ASF INFRA decides to adopt a quota system and
    grants
    > Flink five DEDICATED servers that runs all the time only for
    Flink, that'll
    > be PERFECT and can totally solve our problem now.
    >
    > Please be aware of that we're not complaining about the lack of
    resources
    > in general, I'm complaining about the lack of **stable, 
dedicated**

    > resources. An example for the latter one is, currently even if
    no build is
    > in Flink's queue and I submit a request

Re: [DISCUSS] Flink framework and user log separation

2019-07-04 Thread Chesnay Schepler

From what I understand this isn't about logging Flink/user messages to 
different files, but log everything relevant to a specific job to a 
separate file (including what is being logged in runtime classes, i.e. 
Tasks, Operators etc.)


On 04/07/2019 12:37, Stephan Ewen wrote:

Is that something that can just be done by the right logging framework and
configuration?

Like having a log framework with two targets, one filtered on
"org.apache.flink" and the other one filtered on "my.company.project" or so?

On Fri, Mar 1, 2019 at 3:44 AM vino yang  wrote:


Hi Jamie Grier,

Thank you for your reply, let me add some explanations to this design.

First of all, as stated in "Goal", it is mainly for the "Standalone"
cluster model, although we have implemented it for Flink on YARN, this does
not mean that we can't turn off this feature by means of options. It should
be noted that the separation is basically based on the "log configuration
file", it is very scalable and even allows users to define the log pattern
of the configuration file (of course this is an extension feature, not
mentioned in the design documentation). In fact, "multiple files are a
special case of a single file", we can provide an option to keep it still
the default behavior, it should be the scene you expect in the container.

According to Flink's official 2016 adjustment report [1], users using the
standalone mode are quite close to the yarn mode (unfortunately there is no
data support in 2017). Although we mainly use Flink on Yarn now, we have
used standalone in depth (close to the daily processing volume of 20
trillion messages). In this scenario, the user logs generated by different
job's tasks are mixed together, and it is very difficult to locate the
issue. Moreover, as we configure the log file scrolling policy, we have to
log in to the server to view it. Therefore, we expect that for the same
task manager, the user logs generated by the tasks from the same job can be
distinguished.

In addition, I have tried MDC technology, but it can not achieve the goal.
The underlying Flink is log4j 1.x and logback. We need to be compatible
with both frameworks at the same time, and we don't allow large-scale
changes to the active code, and no sense to the user.

Some other points:

1) Many of our users have experience using Storm and Spark, and they are
more accustomed to that style in standalone mode;
2) We split the user log by Job, which will help to implement the "business
log aggregation" feature based on the Job.

Best,
Vino

[1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1

Jamie Grier  于2019年3月1日周五 上午7:32写道：


I think maybe if I understood this correctly this design is going in the
wrong direction.  The problem with Flink logging, when you are running
multiple jobs in the same TMs, is not just about separating out the
business level logging into separate files.  The Flink framework itself
logs many things where there is clearly a single job in context but that
all ends up in the same log file and with no clear separation amongst the
log lines.

Also, I don't think shooting to have multiple log files is a very good

idea

either.  It's common, especially on container-based deployments, that the
expectation is that a process (like Flink) logs everything to stdout and
the surrounding tooling takes care of routing that log data somewhere.  I
think we should stick with that model and expect that there will be a
single log stream coming out of each Flink process.

Instead, I think it would be better to enhance Flink's logging capability
such that the appropriate context can be added to each log line with the
exact format controlled by the end user.  It might make sense to take a
look at MDC, for example, as a way to approach this.


On Thu, Feb 28, 2019 at 4:24 AM vino yang  wrote:


Dear devs,

Currently, for log output, Flink does not explicitly distinguish

between

framework logs and user logs. In Task Manager, logs from the framework

are

intermixed with the user's business logs. In some deployment models,

such

as Standalone or YARN session, there are different task instances of
different jobs deployed in the same Task Manager. It makes the log

event

flow more confusing unless the users explicitly use tags to distinguish
them and it makes locating problems more difficult and inefficient. For
YARN job cluster deployment model, this problem will not be very

serious,

but we still need to artificially distinguish between the framework and

the

business log. Overall, we found that Flink's existing log model has the
following problems:


-

Framework log and business log are mixed in the same log file. There
is no way to make a clear distinction, which is not conducive to

problem

location and analysis;
-

Not conducive to the independent collection of business logs;


Therefore, we propose a mechanism to separate the framework and

business

log. It can split existing log files for Task Manager.

Re: [VOTE] Migrate to sponsored Travis account

2019-07-04 Thread Chesnay Schepler

Note that the Flinkbot approach isn't that trivial either; we can't 
_just_ trigger builds for a branch in the apache repo, but would first 
have to clone the branch/pr into a separate repository (that is owned by 
the github account that the travis account would be tied to).

One roadblock after the next showing up...

On 04/07/2019 11:59, Chesnay Schepler wrote:

Small update with mostly bad news:

INFRA doesn't know whether it is possible, and referred my to Travis 
support.
They did point out that it could be problematic in regards to 
read/write permissions for the repository.

From my own findings /so far/ with a test repo/organization, it does 
not appear possible to configure the Travis account used for a 
specific repository.

So yeah, if we go down this route we may have to pimp the Flinkbot to 
trigger builds through the Travis REST API.

On 04/07/2019 10:46, Chesnay Schepler wrote:
I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to 
inquire whether it would be possible to switch to a different Travis 
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis 
account, provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored Travis 
account

The vote will be open for at least 24h, and until we have 
confirmation from INFRA. The voting period may be shorter than the 
usual 3 days since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to 
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference 
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration 

[4] 
https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

    Are they using their own Travis CI pool, or did the switch to an
    entirely different CI service?

    If we can just switch to our own Travis pool, just for our
    project, then
    this might be something we can do fairly quickly?

    On 03/07/2019 05:55, Bowen Li wrote:
    > I responded in the INFRA ticket [1] that I believe they are
    using a wrong
    > metric against Flink and the total build time is a completely
    different
    > thing than guaranteed build capacity.
    >
    > My response:
    >
    > "As mentioned above, since I started to pay attention to Flink's
    build
    > queue a few tens of days ago, I'm in Seattle and I saw no build
    was kicking
    > off in PST daytime in weekdays for Flink. Our teammates in China
    and Europe
    > have also reported similar observations. So we need to evaluate
    how the
    > large total build time came from - if 1) your number and 2) our
    > observations from three locations that cover pretty much a full
    day, are
    > all true, I **guess** one reason can be that - highly likely the
    extra
    > build time came from weekends when other Apache projects may be
    idle and
    > Flink just drains hard its congested queue.
    >
    > Please be aware of that we're not complaining about the lack of
    resources
    > in general, I'm complaining about the lack of **stable, 
dedicated**

    > resources. An example for the latter one is, currently even if
    no build is
    > in Flink's queue and I submit a request to be the queue head 
in PST

    > morning, my build won't even start in 6-8+h. That is an absurd
    amount of
    > waiting time.
    >
    > That's saying, if ASF INFRA decides to adopt a quota system and
    grants
    > Flink five DEDICATED servers that runs all the time only for
    Flink, that'll

Re: [VOTE] Migrate to sponsored Travis account

2019-07-05 Thread Chesnay Schepler

I have a prototype ready and will now commence a real world test. I will 
point it apache/flink and mirror it into a ververica controlled repo to 
start Travis runs.

Once the run is finished the bot will comment on the PR with the results.

This runs in addition to our existing CI.

On 04/07/2019 14:06, Chesnay Schepler wrote:
Note that the Flinkbot approach isn't that trivial either; we can't 
_just_ trigger builds for a branch in the apache repo, but would first 
have to clone the branch/pr into a separate repository (that is owned 
by the github account that the travis account would be tied to).

One roadblock after the next showing up...

On 04/07/2019 11:59, Chesnay Schepler wrote:

Small update with mostly bad news:

INFRA doesn't know whether it is possible, and referred my to Travis 
support.
They did point out that it could be problematic in regards to 
read/write permissions for the repository.

From my own findings /so far/ with a test repo/organization, it does 
not appear possible to configure the Travis account used for a 
specific repository.

So yeah, if we go down this route we may have to pimp the Flinkbot to 
trigger builds through the Travis REST API.

On 04/07/2019 10:46, Chesnay Schepler wrote:
I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to 
inquire whether it would be possible to switch to a different Travis 
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis 
account, provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored 
Travis account

The vote will be open for at least 24h, and until we have 
confirmation from INFRA. The voting period may be shorter than the 
usual 3 days since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to 
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference 
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration 

[4] 
https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler 
mailto:ches...@apache.org>> wrote:

    Are they using their own Travis CI pool, or did the switch to an
    entirely different CI service?

    If we can just switch to our own Travis pool, just for our
    project, then
    this might be something we can do fairly quickly?

    On 03/07/2019 05:55, Bowen Li wrote:
    > I responded in the INFRA ticket [1] that I believe they are
    using a wrong
    > metric against Flink and the total build time is a completely
    different
    > thing than guaranteed build capacity.
    >
    > My response:
    >
    > "As mentioned above, since I started to pay attention to Flink's
    build
    > queue a few tens of days ago, I'm in Seattle and I saw no build
    was kicking
    > off in PST daytime in weekdays for Flink. Our teammates in China
    and Europe
    > have also reported similar observations. So we need to evaluate
    how the
    > large total build time came from - if 1) your number and 2) our
    > observations from three locations that cover pretty much a full
    day, are
    > all true, I **guess** one reason can be that - highly likely the
    extra
    > build time came from weekends when other Apache projects may be
    idle and
    > Flink just drains hard its congested queue.
    >
    > Please be aware of that we're not complaining about the lack of
    resources
    > in general, I'm complaining about the lack of **stable, 
dedicated**

    > resources. An example for the latter one is, currently even if
    no build is
    > in Flink's queue and I submit a request to be the qu

[RESULT][VOTE] Migrate to sponsored Travis account

2019-07-07 Thread Chesnay Schepler

The vote has passed unanimously in favor of migrating to a separate 
Travis account.

I will now set things up such that no PullRequest is no longer run on 
the ASF servers.

This is a major setup in reducing our usage of ASF resources.
For the time being we'll use free Travis plan for flink-ci (i.e. 5 
workers, which is the same the ASF gives us). Over the course of the 
next week we'll setup the Ververica subscription to increase this limit.

From now now, a bot will mirror all new and updated PullRequests to a 
mirror repository (https://github.com/flink-ci/flink-ci) and write an 
update into the PR once the build is complete.
I have ran the bots for the past 3 days in parallel to our existing 
Travis and it was working without major issues.

The biggest change that contributors will see is that there's no longer 
a icon next to each commit. We may revisit this in the future.

I'll setup a repo with the source of the bot later.

On 04/07/2019 10:46, Chesnay Schepler wrote:
I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to 
inquire whether it would be possible to switch to a different Travis 
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis account, 
provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored Travis 
account

The vote will be open for at least 24h, and until we have confirmation 
from INFRA. The voting period may be shorter than the usual 3 days 
since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to 
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference 
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration

[4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

    Are they using their own Travis CI pool, or did the switch to an
    entirely different CI service?

    If we can just switch to our own Travis pool, just for our
    project, then
    this might be something we can do fairly quickly?

    On 03/07/2019 05:55, Bowen Li wrote:
    > I responded in the INFRA ticket [1] that I believe they are
    using a wrong
    > metric against Flink and the total build time is a completely
    different
    > thing than guaranteed build capacity.
    >
    > My response:
    >
    > "As mentioned above, since I started to pay attention to Flink's
    build
    > queue a few tens of days ago, I'm in Seattle and I saw no build
    was kicking
    > off in PST daytime in weekdays for Flink. Our teammates in China
    and Europe
    > have also reported similar observations. So we need to evaluate
    how the
    > large total build time came from - if 1) your number and 2) our
    > observations from three locations that cover pretty much a full
    day, are
    > all true, I **guess** one reason can be that - highly likely the
    extra
    > build time came from weekends when other Apache projects may be
    idle and
    > Flink just drains hard its congested queue.
    >
    > Please be aware of that we're not complaining about the lack of
    resources
    > in general, I'm complaining about the lack of **stable, 
dedicated**

    > resources. An example for the latter one is, currently even if
    no build is
    > in Flink's queue and I submit a request to be the queue head in 
PST

    > morning, my build won't even start in 6-8+h. That is an absurd
    amount of
    > waiting time.
    >
    > That's saying, if ASF INFRA decides to adopt a quota system and
    grants

Re: [RESULT][VOTE] Migrate to sponsored Travis account

2019-07-08 Thread Chesnay Schepler

Yes we can do that; for the time being you can add an empty commit to 
re-trigger the CI.

On 08/07/2019 03:49, Congxian Qiu wrote:

As we used flink bot to trigger the CI test, could we add a command for
flink bot to retrigger the CI(sometimes we may encounter some flaky tests)

Best,
Congxian

Chesnay Schepler  于2019年7月8日周一 上午5:01写道：

The vote has passed unanimously in favor of migrating to a separate
Travis account.

I will now set things up such that no PullRequest is no longer run on
the ASF servers.
This is a major setup in reducing our usage of ASF resources.
For the time being we'll use free Travis plan for flink-ci (i.e. 5
workers, which is the same the ASF gives us). Over the course of the
next week we'll setup the Ververica subscription to increase this limit.

  From now now, a bot will mirror all new and updated PullRequests to a
mirror repository (https://github.com/flink-ci/flink-ci) and write an
update into the PR once the build is complete.
I have ran the bots for the past 3 days in parallel to our existing
Travis and it was working without major issues.

The biggest change that contributors will see is that there's no longer
a icon next to each commit. We may revisit this in the future.

I'll setup a repo with the source of the bot later.

On 04/07/2019 10:46, Chesnay Schepler wrote:

I've raised a JIRA
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to
inquire whether it would be possible to switch to a different Travis
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full
control of the flink repository (for example, we cannot access the
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis
account for the Flink project.
This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis account,
provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored Travis
account

The vote will be open for at least 24h, and until we have confirmation
from INFRA. The voting period may be shorter than the usual 3 days
since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:

Re: > Are they using their own Travis CI pool, or did the switch to
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are
currently moving away from ASF's Travis to their own in-house metal
machines at [1] with custom CI application at [2]. They've seen
significant improvement w.r.t both much higher performance and
basically no resource waiting time, "night-and-day" difference
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3]

https://docs.travis-ci.com/user/migrate/open-source-repository-migration

[4]

https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler mailto:ches...@apache.org>> wrote:

 Are they using their own Travis CI pool, or did the switch to an
 entirely different CI service?

 If we can just switch to our own Travis pool, just for our
 project, then
 this might be something we can do fairly quickly?

 On 03/07/2019 05:55, Bowen Li wrote:
 > I responded in the INFRA ticket [1] that I believe they are
 using a wrong
 > metric against Flink and the total build time is a completely
 different
 > thing than guaranteed build capacity.
 >
 > My response:
 >
 > "As mentioned above, since I started to pay attention to Flink's
 build
 > queue a few tens of days ago, I'm in Seattle and I saw no build
 was kicking
 > off in PST daytime in weekdays for Flink. Our teammates in China
 and Europe
 > have also reported similar observations. So we need to evaluate
 how the
 > large total build time came from - if 1) your number and 2) our
 > observations from three locations that cover pretty much a full
 day, are
 > all true, I **guess** one reason can be that - highly likely the
 extra
 > build time came from weekends when other Apache projects may be
 idle and
 > Flink just drains hard its congested queue.
 >
 > Please be aware of that we're not complaining about the lack of
 resources
 > in general, I'm complaining about the lack of **stable,
dedicated**
 > resources. An example for the latter one is, curre

Re: [RESULT][VOTE] Migrate to sponsored Travis account

2019-07-08 Thread Chesnay Schepler

I have temporarily re-enabled running PR builds on the ASF account; 
migrating to the Travis subscription caused some issues in the bot that 
I have to fix first.

On 07/07/2019 23:01, Chesnay Schepler wrote:
The vote has passed unanimously in favor of migrating to a separate 
Travis account.

I will now set things up such that no PullRequest is no longer run on 
the ASF servers.

This is a major setup in reducing our usage of ASF resources.
For the time being we'll use free Travis plan for flink-ci (i.e. 5 
workers, which is the same the ASF gives us). Over the course of the 
next week we'll setup the Ververica subscription to increase this limit.

From now now, a bot will mirror all new and updated PullRequests to a 
mirror repository (https://github.com/flink-ci/flink-ci) and write an 
update into the PR once the build is complete.
I have ran the bots for the past 3 days in parallel to our existing 
Travis and it was working without major issues.

The biggest change that contributors will see is that there's no 
longer a icon next to each commit. We may revisit this in the future.

I'll setup a repo with the source of the bot later.

On 04/07/2019 10:46, Chesnay Schepler wrote:
I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to 
inquire whether it would be possible to switch to a different Travis 
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis 
account, provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored Travis 
account

The vote will be open for at least 24h, and until we have 
confirmation from INFRA. The voting period may be shorter than the 
usual 3 days since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to 
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference 
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration 

[4] 
https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

    Are they using their own Travis CI pool, or did the switch to an
    entirely different CI service?

    If we can just switch to our own Travis pool, just for our
    project, then
    this might be something we can do fairly quickly?

    On 03/07/2019 05:55, Bowen Li wrote:
    > I responded in the INFRA ticket [1] that I believe they are
    using a wrong
    > metric against Flink and the total build time is a completely
    different
    > thing than guaranteed build capacity.
    >
    > My response:
    >
    > "As mentioned above, since I started to pay attention to Flink's
    build
    > queue a few tens of days ago, I'm in Seattle and I saw no build
    was kicking
    > off in PST daytime in weekdays for Flink. Our teammates in China
    and Europe
    > have also reported similar observations. So we need to evaluate
    how the
    > large total build time came from - if 1) your number and 2) our
    > observations from three locations that cover pretty much a full
    day, are
    > all true, I **guess** one reason can be that - highly likely the
    extra
    > build time came from weekends when other Apache projects may be
    idle and
    > Flink just drains hard its congested queue.
    >
    > Please be aware of that we're not complaining about the lack of
    resources
    > in general, I'm complaining about the lack of **stable, 
dedicated**

    > resources. An example for the latter one is, currently even if
    no build is
    > in Flink's queue and I submit a request to be the queue head 
in PST

    > mor

Re: [RESULT][VOTE] Migrate to sponsored Travis account

2019-07-08 Thread Chesnay Schepler

The kinks have been worked out; the bot is running again and pr builds 
are yet again no longer running on ASF resources.

PRs are mirrored to: https://github.com/flink-ci/flink
Bot source: https://github.com/flink-ci/ci-bot

On 08/07/2019 17:14, Chesnay Schepler wrote:
I have temporarily re-enabled running PR builds on the ASF account; 
migrating to the Travis subscription caused some issues in the bot 
that I have to fix first.

On 07/07/2019 23:01, Chesnay Schepler wrote:
The vote has passed unanimously in favor of migrating to a separate 
Travis account.

I will now set things up such that no PullRequest is no longer run on 
the ASF servers.

This is a major setup in reducing our usage of ASF resources.
For the time being we'll use free Travis plan for flink-ci (i.e. 5 
workers, which is the same the ASF gives us). Over the course of the 
next week we'll setup the Ververica subscription to increase this limit.

From now now, a bot will mirror all new and updated PullRequests to a 
mirror repository (https://github.com/flink-ci/flink-ci) and write an 
update into the PR once the build is complete.
I have ran the bots for the past 3 days in parallel to our existing 
Travis and it was working without major issues.

The biggest change that contributors will see is that there's no 
longer a icon next to each commit. We may revisit this in the future.

I'll setup a repo with the source of the bot later.

On 04/07/2019 10:46, Chesnay Schepler wrote:
I've raised a JIRA 
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to 
inquire whether it would be possible to switch to a different Travis 
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full 
control of the flink repository (for example, we cannot access the 
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis 
account for the Flink project.

This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by 
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis 
account, provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored 
Travis account

The vote will be open for at least 24h, and until we have 
confirmation from INFRA. The voting period may be shorter than the 
usual 3 days since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:
Re: > Are they using their own Travis CI pool, or did the switch to 
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are 
currently moving away from ASF's Travis to their own in-house metal 
machines at [1] with custom CI application at [2]. They've seen 
significant improvement w.r.t both much higher performance and 
basically no resource waiting time, "night-and-day" difference 
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our 
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] 
https://docs.travis-ci.com/user/migrate/open-source-repository-migration 

[4] 
https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler 
mailto:ches...@apache.org>> wrote:

    Are they using their own Travis CI pool, or did the switch to an
    entirely different CI service?

    If we can just switch to our own Travis pool, just for our
    project, then
    this might be something we can do fairly quickly?

    On 03/07/2019 05:55, Bowen Li wrote:
    > I responded in the INFRA ticket [1] that I believe they are
    using a wrong
    > metric against Flink and the total build time is a completely
    different
    > thing than guaranteed build capacity.
    >
    > My response:
    >
    > "As mentioned above, since I started to pay attention to Flink's
    build
    > queue a few tens of days ago, I'm in Seattle and I saw no build
    was kicking
    > off in PST daytime in weekdays for Flink. Our teammates in China
    and Europe
    > have also reported similar observations. So we need to evaluate
    how the
    > large total build time came from - if 1) your number and 2) our
    > observations from three locations that cover pretty much a full
    day, are
    > all true, I **guess** one reason can be that - highly likely the
    extra
    > build time came from weekends when other Apache projects may be
    idle and
    > Flink just drains hard its congested queue.
    >
    > Please be aware of that we're not complaining about the lack of
    resour

Re: Flink 1.8.1 release tag missing?

2019-07-09 Thread Chesnay Schepler

Yes, it appears the 1.8.1 tag is missing. The 1.8.1-rc1 tag is an 
equivalent that you can use for the time being.


Alternatively you could of course also use the source release to build 
the connector.


@jincheng Could you add the proper 1.8.1 tag?

On 09/07/2019 16:38, Bekir Oguz wrote:

Hi,
I would like to build the 1.8.1 version of the flink-connector-kinesis module 
but cannot find the release tag in GitHub repo.
I see release candidate 1 (release-1.8.1-rc1) tag, but not sure whether this 
consists of all the 40 bug fixes in 1.8.1 or not.

Which hash or tag should I use to release the flink-connector-kinesis module?

Regards,
Bekir Oguz

Re: [RESULT][VOTE] Migrate to sponsored Travis account

2019-07-10 Thread Chesnay Schepler

Your best bet would be to check the first commit in the PR and check the 
parent commit.

To re-run things, you will have to rebase the PR on the latest master.

On 10/07/2019 03:32, Kurt Young wrote:

Thanks for all your efforts Chesnay, it indeed improves a lot for our
develop experience. BTW, do you know how to find the master branch
information which the CI runs with?

For example, like this one:
https://travis-ci.com/flink-ci/flink/jobs/214542568
It shows pass with the commits, which rebased on the master when the CI
is triggered. But it's both possible that the master branch CI runs on is
the
same or different with current master. If it's the same, I can simply rely
on the
passed information to push commits, but if it's not, I think i should find
another
way to re-trigger tests based on the newest master.

Do you know where can I get such information?

Best,
Kurt

On Tue, Jul 9, 2019 at 3:27 AM Chesnay Schepler  wrote:

The kinks have been worked out; the bot is running again and pr builds
are yet again no longer running on ASF resources.

PRs are mirrored to: https://github.com/flink-ci/flink
Bot source: https://github.com/flink-ci/ci-bot

On 08/07/2019 17:14, Chesnay Schepler wrote:

I have temporarily re-enabled running PR builds on the ASF account;
migrating to the Travis subscription caused some issues in the bot
that I have to fix first.

On 07/07/2019 23:01, Chesnay Schepler wrote:

The vote has passed unanimously in favor of migrating to a separate
Travis account.

I will now set things up such that no PullRequest is no longer run on
the ASF servers.
This is a major setup in reducing our usage of ASF resources.
For the time being we'll use free Travis plan for flink-ci (i.e. 5
workers, which is the same the ASF gives us). Over the course of the
next week we'll setup the Ververica subscription to increase this limit.

 From now now, a bot will mirror all new and updated PullRequests to a
mirror repository (https://github.com/flink-ci/flink-ci) and write an
update into the PR once the build is complete.
I have ran the bots for the past 3 days in parallel to our existing
Travis and it was working without major issues.

The biggest change that contributors will see is that there's no
longer a icon next to each commit. We may revisit this in the future.

I'll setup a repo with the source of the bot later.

On 04/07/2019 10:46, Chesnay Schepler wrote:

I've raised a JIRA
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to
inquire whether it would be possible to switch to a different Travis
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full
control of the flink repository (for example, we cannot access the
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis
account for the Flink project.
This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis
account, provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored
Travis account

The vote will be open for at least 24h, and until we have
confirmation from INFRA. The voting period may be shorter than the
usual 3 days since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:

Re: > Are they using their own Travis CI pool, or did the switch to
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are
currently moving away from ASF's Travis to their own in-house metal
machines at [1] with custom CI application at [2]. They've seen
significant improvement w.r.t both much higher performance and
basically no resource waiting time, "night-and-day" difference
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]

[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3]

https://docs.travis-ci.com/user/migrate/open-source-repository-migration

[4]
https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com

On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler
mailto:ches...@apache.org>> wrote:

 Are they using their own Travis CI pool, or did the switch to an
 entirely different CI service?

 If we can just switch to our own Travis pool, just for our
 project, then
 this might be something we can do fairly quickly?

 On 03/07/2019 05:55, Bowen Li wrote:
 > I responded in the INFRA ticket [1] that I believe they are
 using a wrong
 > metric against Flink and the total build time is a completely
 different

Re: [ANNOUNCE] Feature freeze for Apache Flink 1.9.0 release

2019-07-11 Thread Chesnay Schepler

Can we get JIRA's attached to these items so people out of the loop can
track the progress?

On 05/07/2019 16:06, Kurt Young wrote:

Here are the features I collected which are under actively developing and
close
to merge:

1. Bridge blink planner to unified table environment and remove TableConfig
from blink planner
2. Support timestamp with local time zone and partition pruning in blink
planner
3. Support JDBC & HBase lookup function and upsert sink
4. StreamExecutionEnvironment supports executing job with StreamGraph, and
blink planner should set proper properties to StreamGraph
5. Set resource profiles to task and enable managed memory as resource
profile

Best,
Kurt

On Fri, Jul 5, 2019 at 9:37 PM Kurt Young wrote:

Hi devs,

It's July 5 now and we should announce feature freeze and cut the branch
as planned. However, some components seems still not ready yet and
various features are still under development or review.

But we also can not extend the freeze day again which will further delay
the
release date. I think freeze new features today and have another couple
of buffer days, letting features which are almost ready have a chance to
get in is a reasonable solution.

I hereby announce features of Flink 1.9.0 are freezed, *July 11* will be
the
day for cutting branch. Since the feature freeze has effectively took
place,
I kindly ask committers to refrain from merging features that are planned
for
future releases into the master branch for the time being before the 1.9
branch
is cut. We understand this might be a bit inconvenient, thanks for the
cooperation here.

Best,
Kurt

On Fri, Jul 5, 2019 at 5:19 PM 罗齐 wrote:

Hi Gordon,

Will branch 1.9 be cut out today? We're really looking forward to the
blink features in 1.9.

Thanks,
Qi

On Wed, Jun 26, 2019 at 7:18 PM Tzu-Li (Gordon) Tai
wrote:

Thanks for the updates so far everyone!

Since support for the new Blink-based Table / SQL runner and fine-grained
recovery are quite prominent features for 1.9.0,
and developers involved in these topics have already expressed that these
could make good use for another week,
I think it definitely makes sense to postpone the feature freeze.

The new date for feature freeze and feature branch cut for 1.9.0 will be
*July
5*.

Please update on this thread if there are any further concerns!

Cheers,
Gordon

On Tue, Jun 25, 2019 at 9:05 PM Chesnay Schepler
wrote:

On the fine-grained recovery / batch scheduling side we could make good
use of another week.
Currently we are on track to have the _feature_ merged, but without
having done a great deal of end-to-end testing.

On 25/06/2019 15:01, Kurt Young wrote:

Hi Aljoscha,

I also feel an additional week can make the remaining work more

easy. At

least
we don't have to check in lots of commits in both branches (master &
release-1.9).

Best,
Kurt

On Tue, Jun 25, 2019 at 8:27 PM Aljoscha Krettek <

aljos...@apache.org>

wrote:

A few threads are converging around supporting the new Blink-based

Table

API Runner/Planner. I think hitting the currently proposed feature

freeze

date is hard, if not impossible, and that the work would benefit

from an

additional week to get everything in with good quality.

What do the others involved in the topic think?

Aljoscha

On 24. Jun 2019, at 19:42, Bowen Li wrote:

Hi Gordon,

Thanks for driving this effort.

Xuefu responded to the discussion thread [1] and I want to bring

that

our attention here:

Hive integration depends on a few features that are actively

developed.

the completion of those features don't leave enough time for us to
integrate, then our work can potentially go beyond the proposed

date.

Just wanted to point out such a dependency adds uncertainty.

[1]

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Features-for-Apache-Flink-1-9-0-td28701.html

On Thu, Jun 20, 2019 at 1:05 AM Tzu-Li (Gordon) Tai <

tzuli...@apache.org

wrote:

Hi devs,

Per the feature discussions for 1.9.0 [1], I hereby announce the

official

feature freeze for Flink 1.9.0 to be on June 28. A release feature

branch

for 1.9 will be cut following that date.

We’re roughly one week away from this date, but please keep in

mind

that we

still shouldn’t rush things. If you feel that there may be

problems

with

this schedule for the things you are working on, please let us

know

here.

Cheers,
Gordon

[1]

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Features-for-Apache-Flink-1-9-0-td28701.html

Re: [DISCUSS] Flink project bylaws

2019-07-11 Thread Chesnay Schepler

The emeritus stuff seems like unnecessary noise.

There's a bunch of subtle changes in the draft compared to existing
"conventions"; we should find a way to highlight these and discuss them
one by one.

On 11/07/2019 14:29, Robert Metzger wrote:

Thank you Becket for kicking off this discussion and creating a draft in
the Wiki.

I left some comments in the wiki.

In my understanding this means, that a committer always needs a review and

+1 from another committer. As far as I know this is currently not always
the case (often committer authors, contributor reviews & +1s).

I would agree to add such a bylaw, if we had cases in the past where code
was not sufficiently reviewed AND we believe that we have enough capacity
to ensure a separate committer's approval.

On Thu, Jul 11, 2019 at 9:49 AM Konstantin Knauf
wrote:

Hi all,

thanks a lot for driving this, Becket. I have two remarks regarding the
"Actions" section:

* In addition to a simple "Code Change" we could also add a row for "Code
Change requiring a FLIP" with a reference to the FLIP process page. A FLIP
will have/does have different rules for approvals, etc.
* For "Code Change" the draft currently requires "one +1 from a committer
who has not authored the patch followed by a Lazy approval (not counting
the vote of the contributor), moving to lazy majority if a -1 is received".
In my understanding this means, that a committer always needs a review and
+1 from another committer. As far as I know this is currently not always
the case (often committer authors, contributor reviews & +1s).

I think it is worth thinking about how we can make it easy to follow the
bylaws e.g. by having more Flink-specific Jira workflows and ticket types +
corresponding permissions. While this is certainly "Step 2", I believe, we
really need to make it as easy & transparent as possible, otherwise they
will be unintentionally broken.

Cheers and thanks,

Konstantin

On Thu, Jul 11, 2019 at 9:10 AM Becket Qin wrote:

Hi all,

As it was raised in the FLIP process discussion thread [1], currently

Flink

does not have official bylaws to govern the operation of the project.

Such

bylaws are critical for the community to coordinate and contribute
together. It is also the basis of other processes such as FLIP.

I have drafted a Flink bylaws page and would like to start a discussion
thread on this.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

The bylaws will affect everyone in the community. It'll be great to hear
your thoughts.

Thanks,

Jiangjie (Becket) Qin

[1]

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-META-FLIP-Sticking-or-not-to-a-strict-FLIP-voting-process-td29978.html#none

Konstantin Knauf | Solutions Architect

+49 160 91394525

Planned Absences: 10.08.2019 - 31.08.2019, 05.09. - 06.09.2010

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: [DISCUSS] Flink project bylaws

2019-07-12 Thread Chesnay Schepler

here.

Ha, you got me on this. The first version of the draft was almost identical
to Kafka. But Robert has already caught a few inconsistent places. So it
might still worth going through it to make sure we truly agree on them.
Otherwise we may end up modifying them shortly after adoption.

Thanks again folks, for all the valuable feedback. These are great
discussion.

Jiangjie (Becket) Qin

On Thu, Jul 11, 2019 at 9:55 PM Aljoscha Krettek
wrote:

Big +1

How different is this from the Kafka bylaws? I’m asking because I quite
like them and wouldn’t mind essentially adopting the Kafka bylaws. I

mean,

it’s open source, and we don’t have to try to re-invent the wheel here.

I think it’s worthwhile to discuss the “committer +1” requirement. We
don’t usually have that now but I would actually be in favour of

requiring

it, although it might make stuff more complicated.

Aljoscha

On 11. Jul 2019, at 15:31, Till Rohrmann wrote:

Thanks a lot for creating this draft Becket.

I think without the notion of emeritus (or active vs. inactive), it

won't

be possible to have a 2/3 majority vote because we already have too

many

inactive PMCs/committers.

For the case of a committer being the author and getting a +1 from a
non-committer: I think a committer should know when to ask another
committer for feedback or not. Hence, I would not enforce that we

strictly

need a +1 from a committer if the author is a committer but of course
encourage it if capacities exist.

Cheers,
Till

On Thu, Jul 11, 2019 at 3:08 PM Chesnay Schepler

wrote:

The emeritus stuff seems like unnecessary noise.

There's a bunch of subtle changes in the draft compared to existing
"conventions"; we should find a way to highlight these and discuss

them

one by one.

On 11/07/2019 14:29, Robert Metzger wrote:

Thank you Becket for kicking off this discussion and creating a draft

the Wiki.

I left some comments in the wiki.

In my understanding this means, that a committer always needs a

review

and

+1 from another committer. As far as I know this is currently not

always

the case (often committer authors, contributor reviews & +1s).

I would agree to add such a bylaw, if we had cases in the past where

code

was not sufficiently reviewed AND we believe that we have enough

capacity

to ensure a separate committer's approval.

On Thu, Jul 11, 2019 at 9:49 AM Konstantin Knauf <

konstan...@ververica.com>

wrote:

Hi all,

thanks a lot for driving this, Becket. I have two remarks regarding

the

"Actions" section:

* In addition to a simple "Code Change" we could also add a row for

"Code

Change requiring a FLIP" with a reference to the FLIP process page.

FLIP

will have/does have different rules for approvals, etc.
* For "Code Change" the draft currently requires "one +1 from a

committer

who has not authored the patch followed by a Lazy approval (not

counting

the vote of the contributor), moving to lazy majority if a -1 is

received".

In my understanding this means, that a committer always needs a

review

and

+1 from another committer. As far as I know this is currently not

always

the case (often committer authors, contributor reviews & +1s).

I think it is worth thinking about how we can make it easy to follow

the

bylaws e.g. by having more Flink-specific Jira workflows and ticket

types +

corresponding permissions. While this is certainly "Step 2", I

believe,

really need to make it as easy & transparent as possible, otherwise

they

will be unintentionally broken.

Cheers and thanks,

Konstantin

On Thu, Jul 11, 2019 at 9:10 AM Becket Qin

wrote:

Hi all,

As it was raised in the FLIP process discussion thread [1],

currently

Flink

does not have official bylaws to govern the operation of the

project.

Such

bylaws are critical for the community to coordinate and contribute
together. It is also the basis of other processes such as FLIP.

I have drafted a Flink bylaws page and would like to start a

discussion

thread on this.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

The bylaws will affect everyone in the community. It'll be great to

hear

your thoughts.

Thanks,

Jiangjie (Becket) Qin

[1]

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-META-FLIP-Sticking-or-not-to-a-strict-FLIP-voting-process-td29978.html#none

Konstantin Knauf | Solutions Architect

+49 160 91394525

Planned Absences: 10.08.2019 - 31.08.2019, 05.09. - 06.09.2010

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: [DISCUSS] Create "flink-playgrounds" repository

2019-07-12 Thread Chesnay Schepler

Wouldn't this qualify for releasing snapshot artifacts to users? (Which, 
you know, shouldn't be done?)


On 12/07/2019 11:55, Robert Metzger wrote:

I will request the repo now, so that you can continue working on the
documentation (thanks for that again :) )


I actually like Xuefu's idea of making an archive available.
The good thing is that we can get this from any GitHub hosted repository.
For example for Flink, this link let's you download an archive of Flink's
latest master: https://github.com/apache/flink/archive/master.zip
We would not need to set up any additional automation for this.



On Fri, Jul 12, 2019 at 11:51 AM Konstantin Knauf 
wrote:


Hi everyone,

thanks everyone for you remarks and questions! We have three +1s, so I
think, we can proceed with this.

@Robert: Could you create the request to the INFRA?

Thanks,

Konstantin

On Fri, Jul 12, 2019 at 10:16 AM Stephan Ewen  wrote:


I am fine with a separate repository, was just raising the other option

as

a question.

+1 to go ahead

On Fri, Jul 12, 2019 at 9:49 AM Konstantin Knauf <

konstan...@ververica.com

wrote:


Hi Xuefu,

thanks for having a look at this. I am sure this playground setup will

need

to be maintained and will go through revisions, too. So, we would still
need to keep the content of the archive in some repository + the

additional

piece of automation to update the archive, when the documentation is

build.

To me this seems to be more overhead than a repository.

Best,

Konstantin


On Thu, Jul 11, 2019 at 9:00 PM Xuefu Z  wrote:


The idea seems interesting, but I'm wondering if we have considered
publishing .tz file hosted somewhere in Flink site with a link in the

doc.

This might avoid the "overkill" of introducing a repo, which is main

used

for version control in development cycles. On the other hand, a

docker

setup, once published, will seldom (if ever) go thru revisions.

Thanks,
Xuefu



On Thu, Jul 11, 2019 at 6:58 AM Konstantin Knauf <

konstan...@ververica.com

wrote:


Hi Stephan,

putting it under "flink-quickstarts" alone would not help. The user

would

still need to check out the whole `apache/flink` repository, which

is a

bit

overwhelming. The Java/Scala quickstarts use Maven archetypes. Is

this

what

you are suggesting? I think, this would be an option, but it seems

strange

to manage a pure Docker setup (eventually maybe only one file) in a

Maven

project.

Best,

Konstantin

On Thu, Jul 11, 2019 at 3:52 PM Stephan Ewen 

wrote:

Hi all!

I am fine with a separate repository.

Quick question. though: Have you considered putting the setup not

under

"docs" but under "flink-quickstart" or so?
Would that be equally cumbersome for users?

Best,
Stephan


On Thu, Jul 11, 2019 at 12:19 PM Fabian Hueske <

fhue...@gmail.com>

wrote:

Hi,

I think Quickstart should be as lightweight as possible and

follow

common

practices.
A Git repository for a few configuration files might feel like

overkill,

but IMO it makes sense because this ensures users can get

started

with

3

commands:

$ git clone .../flink-playground
$ cd flink-playground
$ docker-compose up -d

So +1 to create a repository.

Thanks, Fabian

Am Do., 11. Juli 2019 um 12:07 Uhr schrieb Robert Metzger <
rmetz...@apache.org>:


+1 to create a repo.

On Thu, Jul 11, 2019 at 11:10 AM Konstantin Knauf <
konstan...@ververica.com>
wrote:


Hi everyone,

in the course of implementing FLIP-42 we are currently

reworking

the

Getting Started section of our documentation. As part of

this,

we

are

adding docker-compose-based playgrounds to get started with

Flink

operations and Flink SQL quickly.

To reduce as much friction as possible for new users, we

would

like

to

maintain the required configuration files

(docker-comose.yaml,

flink-conf.yaml) in a separate new repository,

`apache/flink-playgrounds`.

You can find more details and a brief discussion on this in

the

corresponding Jira ticket [2].

What do you think?

I am not sure, what kind of approval is required for such a

change.

So,

my

suggestion would be that we have lazy majority within the

next

24

hours

to

create the repository, we proceed. Please let me know, if

this

requires a

more formal approval.

Best and thanks,

Konstantin

[1]



https://cwiki.apache.org/confluence/display/FLINK/FLIP-42%3A+Rework+Flink+Documentation

[2] https://issues.apache.org/jira/browse/FLINK-12749


--

Konstantin Knauf | Solutions Architect

+49 160 91394525


Planned Absences: 10.08.2019 - 31.08.2019, 05.09. -

06.09.2010


--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin,

Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen



--

Konstantin Knauf | Solutions Architect

+49 160 91394525


Planned Absences: 10.08.2019 - 31.08.2019, 05.09. - 06.09.2010


--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg

Re: [DISCUSS] Create "flink-playgrounds" repository

2019-07-12 Thread Chesnay Schepler

The last time this came up was about our download page which contained 
snapshot links, with a big warning that these are for dev purposes, and 
we had to take that down. Back than the conclusion was that snapshot 
artifacts must only be linked on pages intended for developers, and most 
not be visible on any page that one would direct users to.

So I'm not quite convinced that this would fly.

Given that we're intending to offer files that assemble docker images (I 
guess?) I personally believe that these should go through a formal vote 
process; for licensing alone we have to check that users aren't being 
given dependencies with surprising restrictions.

On a side note, any extra link is kinda unnecessary since you can get a 
zip that through the GitHub UI. (go to repo main page -> Clone or 
download -> Download Zip)

On 12/07/2019 12:21, Robert Metzger wrote:
That's a good point. We should point readers in the documentation to 
the repository first, and then write "for convenience, you can also 
download a snapshot of the repository here" AND put a disclaimer on 
the page, that this archive is not an official product released by the 
Flink PMC.

Since this is not on the official download page, and clearly in the 
context of a "playground" or "demonstration", people will not assume a 
proper release.

Do you think that is okay, or should we reach out to somebody at the 
foundation?

On Fri, Jul 12, 2019 at 12:09 PM Chesnay Schepler <mailto:ches...@apache.org>> wrote:

Wouldn't this qualify for releasing snapshot artifacts to users?
(Which,
you know, shouldn't be done?)

On 12/07/2019 11:55, Robert Metzger wrote:
> I will request the repo now, so that you can continue working on the
> documentation (thanks for that again :) )
>
>
> I actually like Xuefu's idea of making an archive available.
> The good thing is that we can get this from any GitHub hosted
repository.
> For example for Flink, this link let's you download an archive
of Flink's
> latest master: https://github.com/apache/flink/archive/master.zip
> We would not need to set up any additional automation for this.
>
>
>
> On Fri, Jul 12, 2019 at 11:51 AM Konstantin Knauf
mailto:konstan...@ververica.com>>
> wrote:
>
>> Hi everyone,
>>
>> thanks everyone for you remarks and questions! We have three
+1s, so I
>> think, we can proceed with this.
>>
>> @Robert: Could you create the request to the INFRA?
>>
>> Thanks,
>>
>> Konstantin
>>
>> On Fri, Jul 12, 2019 at 10:16 AM Stephan Ewen mailto:se...@apache.org>> wrote:
>>
>>> I am fine with a separate repository, was just raising the
other option
>> as
>>> a question.
>>>
>>> +1 to go ahead
>>>
>>> On Fri, Jul 12, 2019 at 9:49 AM Konstantin Knauf <
>> konstan...@ververica.com <mailto:konstan...@ververica.com>
>>> wrote:
>>>
>>>> Hi Xuefu,
>>>>
>>>> thanks for having a look at this. I am sure this playground
setup will
>>> need
>>>> to be maintained and will go through revisions, too. So, we
would still
>>>> need to keep the content of the archive in some repository + the
>>> additional
>>>> piece of automation to update the archive, when the
documentation is
>>> build.
>>>> To me this seems to be more overhead than a repository.
>>>>
>>>> Best,
>>>>
>>>> Konstantin
>>>>
>>>>
>>>> On Thu, Jul 11, 2019 at 9:00 PM Xuefu Z mailto:usxu...@gmail.com>> wrote:
>>>>
>>>>> The idea seems interesting, but I'm wondering if we have
considered
>>>>> publishing .tz file hosted somewhere in Flink site with a
link in the
>>>> doc.
>>>>> This might avoid the "overkill" of introducing a repo, which
is main
>>> used
>>>>> for version control in development cycles. On the other hand, a
>> docker
>>>>> setup, once published, will seldom (if ever) go thru revisions.
>>>>>
>>>>> Thanks,
>>>>> Xuefu
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 11, 2019 at 6:58 AM Konstantin Knauf <
>>>> konstan...@ververica.com <mailto:konst

CiBot Update

2019-07-12 Thread Chesnay Schepler


Hello all,

on Thursday i pushed an update to the CiBot so that it

 * only maintains a single comment, updating it for each new build
 * also links in-progress/queued builds, instead of just finished ones.

The update also included a bug that causes the bot to not recognize 
which commits have been verified before, which lead to a sharp increase 
in queue times as it repeatedly scheduled builds for the same commit. 
This issue has been fixed, all redundant builds have been removed from 
the queue and all comments have been updated to point to previously 
completed builds.


I apologize for the inconvenience.

Re: CiBot Update

2019-07-12 Thread Chesnay Schepler


Yes.

On 13/07/2019 01:56, Bowen Li wrote:

   * only maintains a single comment, updating it for each new build
   * also links in-progress/queued builds, instead of just finished ones.

Want to clarify that the above changes still hold?



On Fri, Jul 12, 2019 at 3:56 PM Chesnay Schepler  wrote:


Hello all,

on Thursday i pushed an update to the CiBot so that it

   * only maintains a single comment, updating it for each new build
   * also links in-progress/queued builds, instead of just finished ones.

The update also included a bug that causes the bot to not recognize
which commits have been verified before, which lead to a sharp increase
in queue times as it repeatedly scheduled builds for the same commit.
This issue has been fixed, all redundant builds have been removed from
the queue and all comments have been updated to point to previously
completed builds.

I apologize for the inconvenience.

Re: CiBot Update

2019-07-15 Thread Chesnay Schepler


The comment was updated in the mean-time.

On 13/07/2019 19:47, Bowen Li wrote:

Thanks Chesnay for the update.

A new issue I found is that our bot doesn't seem to update the final CI
status back to github.

E.g. in [1], the CI Report shows "d1aa3f2 : PENDING Build" at the moment,
but the travis build actually passed successfully 14 hours ago [2].

[1] https://github.com/apache/flink/pull/8920#issuecomment-510405859
[2] https://travis-ci.com/flink-ci/flink/builds/119001147



On Fri, Jul 12, 2019 at 11:00 PM Chesnay Schepler 
wrote:


Yes.

On 13/07/2019 01:56, Bowen Li wrote:

* only maintains a single comment, updating it for each new build
* also links in-progress/queued builds, instead of just finished ones.

Want to clarify that the above changes still hold?



On Fri, Jul 12, 2019 at 3:56 PM Chesnay Schepler 

wrote:

Hello all,

on Thursday i pushed an update to the CiBot so that it

* only maintains a single comment, updating it for each new build
* also links in-progress/queued builds, instead of just finished

ones.

The update also included a bug that causes the bot to not recognize
which commits have been verified before, which lead to a sharp increase
in queue times as it repeatedly scheduled builds for the same commit.
This issue has been fixed, all redundant builds have been removed from
the queue and all comments have been updated to point to previously
completed builds.

I apologize for the inconvenience.

Re: Flink benchmark Jenkins is broken due to missing 1.10 snapshots

2019-07-15 Thread Chesnay Schepler

It is documented in the release guide that a new jenkins deployment must 
be setup when creating a new release branch, and even contains 
step-by-step instructions for doing so.


@Kurt @Gordon please fix this

On 12/07/2019 20:10, Yu Li wrote:

Hi All,

I just found our flink benchmark Jenkins build [1] is broken with below
error:

*[ERROR] Failed to execute goal on project flink-hackathon-benchmarks:
Could not resolve dependencies for project
org.apache.flink.benchmark:flink-hackathon-benchmarks:jar:0.1: Failure to
find org.apache.flink:flink-tests_2.11:jar:tests:1.10-SNAPSHOT in
https://repository.apache.org/content/repositories/snapshots/
*

Which is due to the branching of 1.9 has updated our flink project version
to 1.10 while still no 1.10 snapshot deployed yet. I tried to help deploy
but it turned out with no access. Could anyone with the privilege help
deploy the snapshot for the new version? Thanks.

What's more, no blame but to prevent such issue happen again, should we
document it somewhere that a deploy of snapshot is necessary when branching
new releases and update the snapshot version?

I've also opened an issue in our flink-benchmarks project [2].

Thanks.

[1] http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/
[2] https://github.com/dataArtisans/flink-benchmarks/issues/28

Best Regards,
Yu

Re: Flink benchmark Jenkins is broken due to missing 1.10 snapshots

2019-07-15 Thread Chesnay Schepler


Please also setup the 1.9 travis cron branch.

On 15/07/2019 09:46, Chesnay Schepler wrote:
It is documented in the release guide that a new jenkins deployment 
must be setup when creating a new release branch, and even contains 
step-by-step instructions for doing so.


@Kurt @Gordon please fix this

On 12/07/2019 20:10, Yu Li wrote:

Hi All,

I just found our flink benchmark Jenkins build [1] is broken with below
error:

*[ERROR] Failed to execute goal on project flink-hackathon-benchmarks:
Could not resolve dependencies for project
org.apache.flink.benchmark:flink-hackathon-benchmarks:jar:0.1: 
Failure to

find org.apache.flink:flink-tests_2.11:jar:tests:1.10-SNAPSHOT in
https://repository.apache.org/content/repositories/snapshots/
<https://repository.apache.org/content/repositories/snapshots/>*

Which is due to the branching of 1.9 has updated our flink project 
version
to 1.10 while still no 1.10 snapshot deployed yet. I tried to help 
deploy

but it turned out with no access. Could anyone with the privilege help
deploy the snapshot for the new version? Thanks.

What's more, no blame but to prevent such issue happen again, should we
document it somewhere that a deploy of snapshot is necessary when 
branching

new releases and update the snapshot version?

I've also opened an issue in our flink-benchmarks project [2].

Thanks.

[1] http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/
[2] https://github.com/dataArtisans/flink-benchmarks/issues/28

Best Regards,
Yu

Re: Flink benchmark Jenkins is broken due to missing 1.10 snapshots

2019-07-15 Thread Chesnay Schepler

It will likely take about a day for the 1.10 snapshots to be released.

On 15/07/2019 12:41, Yu Li wrote:

Thanks for the follow up Chesnay, Gordon and Kurt. It seems the
flink-tests_2.11 snapshot [1] is not deployed yet thus flink-benchmark
build [2] hasn't recovered, will watch for the next round and report back
if fixed.

[1]
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-tests_2.11/
[2] http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/

Best Regards,
Yu

On Mon, 15 Jul 2019 at 18:25, Kurt Young wrote:

Sorry about that and thanks Gordon for fixing this!

Best,
Kurt

On Mon, Jul 15, 2019 at 5:43 PM Tzu-Li (Gordon) Tai
wrote:

Done.

Thanks for the reminder and help with the Jenkins deployment setup!

Cheers,
Gordon

On Mon, Jul 15, 2019 at 3:54 PM Chesnay Schepler
wrote:

Please also setup the 1.9 travis cron branch.

On 15/07/2019 09:46, Chesnay Schepler wrote:

It is documented in the release guide that a new jenkins deployment
must be setup when creating a new release branch, and even contains
step-by-step instructions for doing so.

@Kurt @Gordon please fix this

On 12/07/2019 20:10, Yu Li wrote:

Hi All,

I just found our flink benchmark Jenkins build [1] is broken with

below

error:

*[ERROR] Failed to execute goal on project flink-hackathon-benchmarks:
Could not resolve dependencies for project
org.apache.flink.benchmark:flink-hackathon-benchmarks:jar:0.1:
Failure to
find org.apache.flink:flink-tests_2.11:jar:tests:1.10-SNAPSHOT in
https://repository.apache.org/content/repositories/snapshots/
<https://repository.apache.org/content/repositories/snapshots/>*

Which is due to the branching of 1.9 has updated our flink project
version
to 1.10 while still no 1.10 snapshot deployed yet. I tried to help
deploy
but it turned out with no access. Could anyone with the privilege help
deploy the snapshot for the new version? Thanks.

What's more, no blame but to prevent such issue happen again, should

document it somewhere that a deploy of snapshot is necessary when
branching
new releases and update the snapshot version?

I've also opened an issue in our flink-benchmarks project [2].

Thanks.

[1] http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/
[2] https://github.com/dataArtisans/flink-benchmarks/issues/28

Best Regards,
Yu

Re: Flink benchmark Jenkins is broken due to missing 1.10 snapshots

2019-07-16 Thread Chesnay Schepler


Done.

On 16/07/2019 12:47, Haibo Sun wrote:


 `flink-tests_2.11:jar:tests:1.10-SNAPSHOT` was not deployed because 
the JIRA (https://issues.apache.org/jira/browse/FLINK-12602) changed 
the `artifactId` of `flink-tests` from 
`flink-tests_${scala.binary.version}` to `flink-tests`.


I have created a PR 
(https://github.com/dataArtisans/flink-benchmarks/pull/29) to revise 
pom.xml of `flink-benchmarks` accordingly, and need someone to help 
merge it. @Chesnay Schepler


Best,
Haobo
At 2019-07-15 19:01:57, "Yu Li"  wrote:
>Thanks for the note Chesnay, will wait and report back then.
>
>Best Regards,
>Yu
>
>
>On Mon, 15 Jul 2019 at 18:51, Chesnay Schepler  wrote:
>
>> It will likely take about a day for the 1.10 snapshots to be released.
>>
>> On 15/07/2019 12:41, Yu Li wrote:
>> > Thanks for the follow up Chesnay, Gordon and Kurt. It seems the
>> > flink-tests_2.11 snapshot [1] is not deployed yet thus flink-benchmark
>> > build [2] hasn't recovered, will watch for the next round and report back
>> > if fixed.
>> >
>> > [1]
>> >
>> 
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-tests_2.11/
>> > [2] http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/
>> >
>> > Best Regards,
>> > Yu
>> >
>> >
>> > On Mon, 15 Jul 2019 at 18:25, Kurt Young  wrote:
>> >
>> >> Sorry about that and thanks Gordon for fixing this!
>> >>
>> >> Best,
>> >> Kurt
>> >>
>> >>
>> >> On Mon, Jul 15, 2019 at 5:43 PM Tzu-Li (Gordon) Tai <
>> tzuli...@apache.org>
>> >> wrote:
>> >>
>> >>> Done.
>> >>>
>> >>> Thanks for the reminder and help with the Jenkins deployment setup!
>> >>>
>> >>> Cheers,
>> >>> Gordon
>> >>>
>> >>> On Mon, Jul 15, 2019 at 3:54 PM Chesnay Schepler 
>> >>> wrote:
>> >>>
>> >>>> Please also setup the 1.9 travis cron branch.
>> >>>>
>> >>>> On 15/07/2019 09:46, Chesnay Schepler wrote:
>> >>>>> It is documented in the release guide that a new jenkins deployment
>> >>>>> must be setup when creating a new release branch, and even contains
>> >>>>> step-by-step instructions for doing so.
>> >>>>>
>> >>>>> @Kurt @Gordon please fix this
>> >>>>>
>> >>>>> On 12/07/2019 20:10, Yu Li wrote:
>> >>>>>> Hi All,
>> >>>>>>
>> >>>>>> I just found our flink benchmark Jenkins build [1] is broken with
>> >>>> below
>> >>>>>> error:
>> >>>>>>
>> >>>>>> *[ERROR] Failed to execute goal on project
>> flink-hackathon-benchmarks:
>> >>>>>> Could not resolve dependencies for project
>> >>>>>> org.apache.flink.benchmark:flink-hackathon-benchmarks:jar:0.1:
>> >>>>>> Failure to
>> >>>>>> find org.apache.flink:flink-tests_2.11:jar:tests:1.10-SNAPSHOT in
>> >>>>>> https://repository.apache.org/content/repositories/snapshots/
>> >>>>>> <https://repository.apache.org/content/repositories/snapshots/>*
>> >>>>>>
>> >>>>>> Which is due to the branching of 1.9 has updated our flink project
>> >>>>>> version
>> >>>>>> to 1.10 while still no 1.10 snapshot deployed yet. I tried to help
>> >>>>>> deploy
>> >>>>>> but it turned out with no access. Could anyone with the privilege
>> help
>> >>>>>> deploy the snapshot for the new version? Thanks.
>> >>>>>>
>> >>>>>> What's more, no blame but to prevent such issue happen again, should
>> >>>> we
>> >>>>>> document it somewhere that a deploy of snapshot is necessary when
>> >>>>>> branching
>> >>>>>> new releases and update the snapshot version?
>> >>>>>>
>> >>>>>> I've also opened an issue in our flink-benchmarks project [2].
>> >>>>>>
>> >>>>>> Thanks.
>> >>>>>>
>> >>>>>> [1] http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/
>> >>>>>> [2] https://github.com/dataArtisans/flink-benchmarks/issues/28
>> >>>>>>
>> >>>>>> Best Regards,
>> >>>>>> Yu
>> >>>>>>
>> >>>>>
>> >>>>
>>
>>

Re: [DISCUSS] A more restrictive JIRA workflow

2019-07-18 Thread Chesnay Schepler

Sounds good to me.

On 18/07/2019 12:07, Robert Metzger wrote:

Infra has finally changed the permissions. I just announced the change in a
separate email [1].

One thing I wanted to discuss here is, how do we want to handle all the
"contributor permissions" requests?

My proposal is to basically reject them with a nice message, pointing them
to my announcement.

What do you think?

[1]
https://lists.apache.org/thread.html/4ed570c7110b7b55b5c3bd52bb61ff35d5bda88f47939d8e7f1844c4@%3Cdev.flink.apache.org%3E

On Thu, Jul 4, 2019 at 1:21 PM Robert Metzger wrote:

This is the Jira ticket I opened
https://issues.apache.org/jira/browse/INFRA-18644 a long time ago :)

On Thu, Jul 4, 2019 at 10:47 AM Chesnay Schepler
wrote:

@Robert what's the state here?

On 24/06/2019 16:16, Robert Metzger wrote:

Hey all,

I would like to drive this discussion to an end soon.
I've just merged the updated contribution guide to the Flink website:
https://flink.apache.org/contributing/contribute-code.html

I will now ask Apache IINFRA to change the permissions in our Jira.

Here's the updated TODO list:

1. I update the contribution guide DONE
2. Update Flinkbot to close invalid PRs, and show warnings on PRs with
unassigned JIRAs IN PROGRESS
3. We ask Infra to change the permissions of our JIRA so that: IN

PROGRESS

a) only committers can assign users to tickets
b) contributors can't assign users to tickets
c) Every registered JIRA user is an assignable user in FLINK

On Fri, May 24, 2019 at 9:18 AM Robert Metzger

wrote:

Hey,

I started working on step 1 and proposed some changes to the Flink
website: https://github.com/apache/flink-web/pull/217

On Tue, Apr 30, 2019 at 4:08 PM Robert Metzger
wrote:

Hi Fabian,
You are right, I made a mistake. I don't think it makes sense to
introduce a new permission class. This will make the life of JIRA

admins

unnecessarily complicated.
I updated the task list:

1. I update the contribution guide
2. Update Flinkbot to close invalid PRs, and show warnings on PRs with
unassigned JIRAs
3. We ask Infra to change the permissions of our JIRA so that:
a) only committers can assign users to tickets
b) contributors can't assign users to tickets
c) Every registered JIRA user is an assignable user in FLINK
4. We remove all existing contributors

On Tue, Apr 30, 2019 at 12:00 PM Fabian Hueske

wrote:

Hi Robert,

If I understood the decision correctly, we also need to ask Infra to

make

everybody an assignable user, right?
Or do we want to add a new permission class "Assignable User" such

that

everyone still needs to ask for the right Jira permissions?

Fabian

Am Di., 30. Apr. 2019 um 10:46 Uhr schrieb Timo Walther <
twal...@apache.org

:
Hi Robert,

thanks for taking care of this. +1 to your suggested steps.

Regards,
Timo

Am 30.04.19 um 10:42 schrieb Robert Metzger:

@Stephan: I agree. Auto-closing PRs is quite aggressive.
I will only do that for PRs without JIRA ID or "[hotfix]" in the

title.

We can always revisit this at a later stage.

I'm proposing the following steps:

1. I update the contribution guide
2. Update Flinkbot to close invalid PRs, and show warnings on PRs

with

unassigned JIRAs
3. We ask Infra to change the permissions of our JIRA so that:
a) only committers can assign users to tickets
b) contributors can't assign users to tickets
4. We remove all existing contributors

On Wed, Apr 24, 2019 at 11:17 AM vino yang

wrote:

also +1 for option 2.

I think auto-close a PR sometimes a bit impertinency.
The reasons just like Stephan mentioned.

Stephan Ewen 于2019年4月24日周三 下午4:08写道：

About auto-closing PRs from unassigned issues, consider the

following

case

that has happened quite a bit.

- a user reports a small bug and immediately wants to

provide a

fix

for

it
- it makes sense to not stall the user for a few days until a

committer

assigned the issue
- not auto-closing the PR would at least allow the user to

provide

their

patch.

On Wed, Apr 24, 2019 at 10:00 AM Stephan Ewen

wrote:

+1 for option #2

Seems to me that this does not contradict option #1, it only

extends

this

a bit. I think there is a good case for that, to help frequent

contributors

on a way to committership.

@Konstantin: Trivial fixes (typos, docs, javadocs, ...) should

still

possible as "hotfixes".

On Mon, Apr 15, 2019 at 3:14 PM Timo Walther <

twal...@apache.org>

wrote:

I think this really depends on the contribution.

Sometimes "triviality" means that people just want to fix a

typo

some

docs. For this, a hotfix PR is sufficient and does not need a

JIRA

issue.

However, sometimes "triviality" is only trivial at first glance

but

introduces side effects. In any case, any contribution needs to

reviewed and merged by a committer so follow-up responses and

follow-up

work might always b

Re: [DISCUSS] A more restrictive JIRA workflow

2019-07-18 Thread Chesnay Schepler

Do our contribution guidelines contain anything that should be updated?

On 18/07/2019 12:24, Chesnay Schepler wrote:

Sounds good to me.

On 18/07/2019 12:07, Robert Metzger wrote:
Infra has finally changed the permissions. I just announced the
change in a

separate email [1].

One thing I wanted to discuss here is, how do we want to handle all the
"contributor permissions" requests?

My proposal is to basically reject them with a nice message, pointing
them

to my announcement.

What do you think?

[1]
https://lists.apache.org/thread.html/4ed570c7110b7b55b5c3bd52bb61ff35d5bda88f47939d8e7f1844c4@%3Cdev.flink.apache.org%3E

On Thu, Jul 4, 2019 at 1:21 PM Robert Metzger
wrote:

This is the Jira ticket I opened
https://issues.apache.org/jira/browse/INFRA-18644 a long time ago :)

On Thu, Jul 4, 2019 at 10:47 AM Chesnay Schepler
wrote:

@Robert what's the state here?

On 24/06/2019 16:16, Robert Metzger wrote:

Hey all,

I would like to drive this discussion to an end soon.
I've just merged the updated contribution guide to the Flink website:
https://flink.apache.org/contributing/contribute-code.html

I will now ask Apache IINFRA to change the permissions in our Jira.

Here's the updated TODO list:

1. I update the contribution guide DONE
2. Update Flinkbot to close invalid PRs, and show warnings on PRs
with

unassigned JIRAs IN PROGRESS
3. We ask Infra to change the permissions of our JIRA so that: IN

PROGRESS

a) only committers can assign users to tickets
b) contributors can't assign users to tickets
c) Every registered JIRA user is an assignable user in FLINK

On Fri, May 24, 2019 at 9:18 AM Robert Metzger

wrote:

Hey,

I started working on step 1 and proposed some changes to the Flink
website: https://github.com/apache/flink-web/pull/217

On Tue, Apr 30, 2019 at 4:08 PM Robert Metzger
wrote:

Hi Fabian,
You are right, I made a mistake. I don't think it makes sense to
introduce a new permission class. This will make the life of JIRA

admins

unnecessarily complicated.
I updated the task list:

1. I update the contribution guide
2. Update Flinkbot to close invalid PRs, and show warnings on
PRs with

unassigned JIRAs
3. We ask Infra to change the permissions of our JIRA so that:
a) only committers can assign users to tickets
b) contributors can't assign users to tickets
c) Every registered JIRA user is an assignable user in FLINK
4. We remove all existing contributors

On Tue, Apr 30, 2019 at 12:00 PM Fabian Hueske

wrote:

Hi Robert,

If I understood the decision correctly, we also need to ask
Infra to

make

everybody an assignable user, right?
Or do we want to add a new permission class "Assignable User" such

that

everyone still needs to ask for the right Jira permissions?

Fabian

Am Di., 30. Apr. 2019 um 10:46 Uhr schrieb Timo Walther <
twal...@apache.org

:
Hi Robert,

thanks for taking care of this. +1 to your suggested steps.

Regards,
Timo

Am 30.04.19 um 10:42 schrieb Robert Metzger:

@Stephan: I agree. Auto-closing PRs is quite aggressive.
I will only do that for PRs without JIRA ID or "[hotfix]" in the

title.

We can always revisit this at a later stage.

I'm proposing the following steps:

1. I update the contribution guide
2. Update Flinkbot to close invalid PRs, and show warnings on
PRs

with

On Wed, Apr 24, 2019 at 11:17 AM vino yang

wrote:

also +1 for option 2.

I think auto-close a PR sometimes a bit impertinency.
The reasons just like Stephan mentioned.

Stephan Ewen 于2019年4月24日周三
下午4:08写道：

About auto-closing PRs from unassigned issues, consider the

following

case

that has happened quite a bit.

- a user reports a small bug and immediately wants to

provide a

fix

for

it
- it makes sense to not stall the user for a few days
until a

committer

assigned the issue
- not auto-closing the PR would at least allow the
user to

provide

their

patch.

On Wed, Apr 24, 2019 at 10:00 AM Stephan Ewen

wrote:

+1 for option #2

Seems to me that this does not contradict option #1, it only

extends

this
a bit. I think there is a good case for that, to help
frequent

contributors

on a way to committership.

@Konstantin: Trivial fixes (typos, docs, javadocs, ...)
should

still

possible as "hotfixes".

On Mon, Apr 15, 2019 at 3:14 PM Timo Walther <

twal...@apache.org>

wrote:

I think this really depends on the contribution.

Sometimes "triviality" means that people just want to fix a

typo

some
docs. For this, a hotfix PR is sufficient and does not
need a

JIRA

issue.
However, sometimes "triviality" is only trivial at first
glance

but
introduces side effect

Re: [DISCUSS] A more restrictive JIRA workflow

2019-07-18 Thread Chesnay Schepler

We haven't wiped the set of contributors yet. Not sure if there's an
easy way to remove the permissions for all of them; someone from the PMC
may have to bite the bullet and click 600 times in a row :)

On 18/07/2019 12:32, Zili Chen wrote:

Robert,

Thanks for your effort. Rejecting contributor permission request
with a nice message and pointing them to the announcement sounds
reasonable. Just to be clear, we now have no person with contributor
role, right?

Chesnay,

https://flink.apache.org/contributing/contribute-code.html has been
updated and mentions that "Only committers can assign a Jira ticket."

I think the corresponding update has been done.

Best,
tison.

Chesnay Schepler 于2019年7月18日周四 下午6:25写道：

Do our contribution guidelines contain anything that should be updated?

On 18/07/2019 12:24, Chesnay Schepler wrote:

Sounds good to me.

On 18/07/2019 12:07, Robert Metzger wrote:

Infra has finally changed the permissions. I just announced the
change in a
separate email [1].

One thing I wanted to discuss here is, how do we want to handle all the
"contributor permissions" requests?

My proposal is to basically reject them with a nice message, pointing
them
to my announcement.

What do you think?

[1]

https://lists.apache.org/thread.html/4ed570c7110b7b55b5c3bd52bb61ff35d5bda88f47939d8e7f1844c4@%3Cdev.flink.apache.org%3E

On Thu, Jul 4, 2019 at 1:21 PM Robert Metzger
wrote:

This is the Jira ticket I opened
https://issues.apache.org/jira/browse/INFRA-18644 a long time ago :)

On Thu, Jul 4, 2019 at 10:47 AM Chesnay Schepler
wrote:

@Robert what's the state here?

On 24/06/2019 16:16, Robert Metzger wrote:

Hey all,

I would like to drive this discussion to an end soon.
I've just merged the updated contribution guide to the Flink website:
https://flink.apache.org/contributing/contribute-code.html

I will now ask Apache IINFRA to change the permissions in our Jira.

Here's the updated TODO list:

1. I update the contribution guide DONE
2. Update Flinkbot to close invalid PRs, and show warnings on PRs
with
unassigned JIRAs IN PROGRESS
3. We ask Infra to change the permissions of our JIRA so that: IN

PROGRESS

a) only committers can assign users to tickets
b) contributors can't assign users to tickets
c) Every registered JIRA user is an assignable user in FLINK

On Fri, May 24, 2019 at 9:18 AM Robert Metzger

wrote:

Hey,

I started working on step 1 and proposed some changes to the Flink
website: https://github.com/apache/flink-web/pull/217

On Tue, Apr 30, 2019 at 4:08 PM Robert Metzger
Hi Fabian,
You are right, I made a mistake. I don't think it makes sense to
introduce a new permission class. This will make the life of JIRA

admins

unnecessarily complicated.
I updated the task list:

1. I update the contribution guide
2. Update Flinkbot to close invalid PRs, and show warnings on
PRs with
unassigned JIRAs
3. We ask Infra to change the permissions of our JIRA so that:
a) only committers can assign users to tickets
b) contributors can't assign users to tickets
c) Every registered JIRA user is an assignable user in FLINK
4. We remove all existing contributors

On Tue, Apr 30, 2019 at 12:00 PM Fabian Hueske

wrote:

Hi Robert,

If I understood the decision correctly, we also need to ask
Infra to

make

everybody an assignable user, right?
Or do we want to add a new permission class "Assignable User" such

that

everyone still needs to ask for the right Jira permissions?

Fabian

Am Di., 30. Apr. 2019 um 10:46 Uhr schrieb Timo Walther <
twal...@apache.org

:
Hi Robert,

thanks for taking care of this. +1 to your suggested steps.

Regards,
Timo

Am 30.04.19 um 10:42 schrieb Robert Metzger:

@Stephan: I agree. Auto-closing PRs is quite aggressive.
I will only do that for PRs without JIRA ID or "[hotfix]" in the

title.

We can always revisit this at a later stage.

I'm proposing the following steps:

1. I update the contribution guide
2. Update Flinkbot to close invalid PRs, and show warnings on
PRs

with

On Wed, Apr 24, 2019 at 11:17 AM vino yang

wrote:

also +1 for option 2.

I think auto-close a PR sometimes a bit impertinency.
The reasons just like Stephan mentioned.

Stephan Ewen 于2019年4月24日周三
下午4:08写道：

About auto-closing PRs from unassigned issues, consider the

following

case

that has happened quite a bit.

- a user reports a small bug and immediately wants to

provide a

fix

for

it
- it makes sense to not stall the user for a few days
until a

committer

assigned the issue
- not auto-closing the PR would at least allow the
user to

provide

their

patch.

On Wed, Apr 24, 2019 at 10:00 AM Stephan

Re: flink-mapr-fs failed in travis

2019-07-19 Thread Chesnay Schepler

I did modify the .travis.yml do activate the unsafe-mapr-repo profile; 
did I modified the wrong profile?...



On 19/07/2019 07:57, Jark Wu wrote:

It seems that it is introduced by this commit:
https://github.com/apache/flink/commit/5c36c650e6520d92191ce2da33f7dcae774319f6
Hi @Chesnay Schepler  , do we need to add
"-Punsafe-mapr-repo" to the ".travis.yml"?

Best,
Jark

On Fri, 19 Jul 2019 at 10:58, JingsongLee 
wrote:


Hi everyone:

flink-mapr-fs failed in travis, and I retried many times, and also failed.
Anyone has idea about this?

01:32:54.755 [ERROR] Failed to execute goal on project flink-mapr-fs:
Could not resolve dependencies for project
org.apache.flink:flink-mapr-fs:jar:1.10-SNAPSHOT: Failed to collect
dependencies at com.mapr.hadoop:maprfs:jar:5.2.1-mapr: Failed to read
artifact descriptor for com.mapr.hadoop:maprfs:jar:5.2.1-mapr: Could not
transfer artifact com.mapr.hadoop:maprfs:pom:5.2.1-mapr from/to
mapr-releases (https://repository.mapr.com/maven/):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find
valid certification path to requested target -> [Help 1]

https://api.travis-ci.org/v3/job/560790299/log.txt

Best, Jingsong Lee

Re: flink-mapr-fs failed in travis

2019-07-19 Thread Chesnay Schepler


Ah, I added it to the common options in the travis_manv_watchdog.sh .

On 19/07/2019 09:58, Chesnay Schepler wrote:
I did modify the .travis.yml do activate the unsafe-mapr-repo profile; 
did I modified the wrong profile?...



On 19/07/2019 07:57, Jark Wu wrote:

It seems that it is introduced by this commit:
https://github.com/apache/flink/commit/5c36c650e6520d92191ce2da33f7dcae774319f6 


Hi @Chesnay Schepler  , do we need to add
"-Punsafe-mapr-repo" to the ".travis.yml"?

Best,
Jark

On Fri, 19 Jul 2019 at 10:58, JingsongLee 


wrote:


Hi everyone:

flink-mapr-fs failed in travis, and I retried many times, and also 
failed.

Anyone has idea about this?

01:32:54.755 [ERROR] Failed to execute goal on project flink-mapr-fs:
Could not resolve dependencies for project
org.apache.flink:flink-mapr-fs:jar:1.10-SNAPSHOT: Failed to collect
dependencies at com.mapr.hadoop:maprfs:jar:5.2.1-mapr: Failed to read
artifact descriptor for com.mapr.hadoop:maprfs:jar:5.2.1-mapr: Could 
not

transfer artifact com.mapr.hadoop:maprfs:pom:5.2.1-mapr from/to
mapr-releases (https://repository.mapr.com/maven/):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable 
to find

valid certification path to requested target -> [Help 1]

https://api.travis-ci.org/v3/job/560790299/log.txt

Best, Jingsong Lee

Re: flink-mapr-fs failed in travis

2019-07-19 Thread Chesnay Schepler


I think I found the issue; I forgot to update travis_controller.sh .

On 19/07/2019 10:02, Chesnay Schepler wrote:

Ah, I added it to the common options in the travis_manv_watchdog.sh .

On 19/07/2019 09:58, Chesnay Schepler wrote:
I did modify the .travis.yml do activate the unsafe-mapr-repo 
profile; did I modified the wrong profile?...



On 19/07/2019 07:57, Jark Wu wrote:

It seems that it is introduced by this commit:
https://github.com/apache/flink/commit/5c36c650e6520d92191ce2da33f7dcae774319f6 


Hi @Chesnay Schepler  , do we need to add
"-Punsafe-mapr-repo" to the ".travis.yml"?

Best,
Jark

On Fri, 19 Jul 2019 at 10:58, JingsongLee 


wrote:


Hi everyone:

flink-mapr-fs failed in travis, and I retried many times, and also 
failed.

Anyone has idea about this?

01:32:54.755 [ERROR] Failed to execute goal on project flink-mapr-fs:
Could not resolve dependencies for project
org.apache.flink:flink-mapr-fs:jar:1.10-SNAPSHOT: Failed to collect
dependencies at com.mapr.hadoop:maprfs:jar:5.2.1-mapr: Failed to read
artifact descriptor for com.mapr.hadoop:maprfs:jar:5.2.1-mapr: 
Could not

transfer artifact com.mapr.hadoop:maprfs:pom:5.2.1-mapr from/to
mapr-releases (https://repository.mapr.com/maven/):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable 
to find

valid certification path to requested target -> [Help 1]

https://api.travis-ci.org/v3/job/560790299/log.txt

Best, Jingsong Lee

[NOTICE] SSL issue when building flink-mapr-fs

2019-07-19 Thread Chesnay Schepler


Hello,

the Flink PMC was a while ago informed about a security risk in our 
build process, as we were accessing various maven repositories without 
HTTPS. This issue was resolved in FLINK-12578 for 1.7 on-wards.


However, there is an ongoing issue with the MapR repository where you 
may run into an SSLException when building the flink-mapr-fs module. 
MapR has been made aware of this issue, but we're still waiting for a 
solution.


If you run into this you can use the "unsafe-mapr-repo" profile to 
revert to the previous behavior. Please also comment either here or in 
FLINK-12578 so that we can gauge how many devs are affected; in case 
this is wide-spread we may have to look at other alternatives (like 
excluding it from the build process by default).


Regards,

Chesnay

Re: [DISCUSS] Publish the PyFlink into PyPI

2019-07-24 Thread Chesnay Schepler

if we ship a binary, we should ship the binary we usually ship, not some 
highly customized version.


On 24/07/2019 05:19, Dian Fu wrote:

Hi Stephan & Jeff,

Thanks a lot for sharing your thoughts!

Regarding the bundled jars, currently only the jars in the flink binary 
distribution is packaged in the pyflink package. That maybe a good idea to also 
bundle the other jars such as flink-hadoop-compatibility. We may need also 
consider whether to bundle the format jars such as flink-avro, flink-json, 
flink-csv and the connector jars such as flink-connector-kafka, etc.

If FLINK_HOME is set, the binary distribution specified by FLINK_HOME will be 
used instead.

Regards,
Dian


在 2019年7月24日，上午9:47，Jeff Zhang  写道：

+1 for publishing pyflink to pypi.

Regarding including jar, I just want to make sure which flink binary
distribution we would ship with pyflink since we have multiple flink binary
distributions (w/o hadoop).
Personally, I prefer to use the hadoop-included binary distribution.

And I just want to confirm whether it is possible for users to use a
different flink binary distribution as long as he set env FLINK_HOME.

Besides that, I hope that there will be bi-direction link reference between
flink doc and pypi doc.



Stephan Ewen  于2019年7月24日周三 上午12:07写道：


Hi!

Sorry for the late involvement. Here are some thoughts from my side:

Definitely +1 to publishing to PyPy, even if it is a binary release.
Community growth into other communities is great, and if this is the
natural way to reach developers in the Python community, let's do it. This
is not about our convenience, but reaching users.

I think the way to look at this is that this is a convenience distribution
channel, courtesy of the Flink community. It is not an Apache release, we
make this clear in the Readme.
Of course, this doesn't mean we don't try to uphold similar standards as
for our official release (like proper license information).

Concerning credentials sharing, I would be fine with whatever option. The
PMC doesn't own it (it is an initiative by some community members), but the
PMC needs to ensure trademark compliance, so slight preference for option
#1 (PMC would have means to correct problems).

I believe there is no need to differentiate between Scala versions, because
this is merely a convenience thing for pure Python users. Users that mix
python and scala (and thus depend on specific scala versions) can still
download from Apache or build themselves.

Best,
Stephan



On Thu, Jul 4, 2019 at 9:51 AM jincheng sun 
wrote:


Hi All,

Thanks for the feedback @Chesnay Schepler  @Dian!

I think using `apache-flink` for the project name also makes sense to me.
due to we should always keep in mind that Flink is owned by Apache. (And
beam also using this pattern `apache-beam` for Python API)

Regarding the Python API release with the JAVA JARs, I think the

principle

of consideration is the convenience of the user. So, Thanks for the
explanation @Dian!

And your right @Chesnay Schepler   we can't make a
hasty decision and we need more people's opinions！

So, I appreciate it if anyone can give us feedback and suggestions!

Best,
Jincheng




Chesnay Schepler  于2019年7月3日周三 下午8:46写道：


So this would not be a source release then, but a full-blown binary
release.

Maybe it is just me, but I find it a bit suspect to ship an entire java
application via PyPI, just because there's a Python API for it.

We definitely need input from more people here.

On 03/07/2019 14:09, Dian Fu wrote:

Hi Chesnay,

Thanks a lot for the suggestions.

Regarding “distributing java/scala code to PyPI”:
The Python Table API is just a wrapper of the Java Table API and

without

the java/scala code, two steps will be needed to set up an environment

to

execute a Python Table API program:

1) Install pyflink using "pip install apache-flink"
2) Download the flink distribution and set the FLINK_HOME to it.
Besides, users have to make sure that the manually installed Flink is

compatible with the pip installed pyflink.

Bundle the java/scala code inside the Python package will eliminate

step

2) and makes it more simple for users to install pyflink. There was a

short

discussion <https://issues.apache.org/jira/browse/SPARK-1267> on this

in

Spark community and they finally decide to package the java/scala code

in

the python package. (BTW, PySpark only bundle the jars of scala 2.11).

Regards,
Dian


在 2019年7月3日，下午7:13，Chesnay Schepler  写道：

The existing artifact in the pyflink project was neither released by

the Flink project / anyone affiliated with it nor approved by the Flink

PMC.

As such, if we were to use this account I believe we should delete

it

to not mislead users that this is in any way an apache-provided
distribution. Since this goes against the users wishes, I would be in

favor

of creating a separate account, and giving back control over the

pyflink

account.

My take on the raised poin

Re: [Requirement] CI report

2019-07-26 Thread Chesnay Schepler


Noted, I'll see what I can do.

On 23/07/2019 10:15, Zili Chen wrote:

Hi,

Currently, our flinkbot updates CI report on status changing.

However, it updates via editing GitHub comment, which would not send
a notification to pr creator once status updated.

Said the "PENDING" status is not quite useful, is it possible that
flinkbot updates a final status(FAILURE/SUCCESS) by adding a new
comment? This will be like hadoop bot updates on JIRA.


Best,
tison.

Re: Something wrong with travis?

2019-07-30 Thread Chesnay Schepler

There is nothing to report; we already know what the problem is but it 
cannot be fixed.


On 30/07/2019 08:46, Yun Tang wrote:

I met this problem again at https://api.travis-ci.com/v3/job/220732163/log.txt 
. Is there any place we could ask for help to contact tarvis or any clues we 
could use to figure out this?

Best
Yun Tang

From: Yun Tang 
Sent: Monday, June 24, 2019 14:22
To: dev@flink.apache.org ; Kurt Young 
Subject: Re: Something wrong with travis?

Unfortunately, I met this problem again just now 
https://api.travis-ci.org/v3/job/549534496/log.txt (the build overview 
https://travis-ci.org/apache/flink/builds/549534489). For those non-committers, 
including me, we have to close-reopen the PR or push another commit to 
re-trigger the PR check🙁

Best
Yun Tang

From: Chesnay Schepler 
Sent: Wednesday, June 19, 2019 16:59
To: dev@flink.apache.org; Kurt Young
Subject: Re: Something wrong with travis?

Recent builds are passing again.

On 18/06/2019 08:34, Kurt Young wrote:

Hi dev,

I noticed that all the travis tests triggered by pull request are failed
with the same error:

"Cached flink dir /home/travis/flink_cache/x/flink does not exist.
Exiting build."

Anyone have a clue on what happened and how to fix this?

Best,
Kurt

Re: REST API / JarRunHandler: More flexibility for launching jobs

2019-07-31 Thread Chesnay Schepler

Couldn't the beam job server use the same work-around we're using in the
JarRunHandler to get access to the JobGraph?

On 26/07/2019 17:38, Thomas Weise wrote:

Hi Till,

Thanks for taking a look!

The Beam job server does not currently have the ability to just output the
job graph (and related artifacts) that could then be used with the
JobSubmitHandler. It is itself using StreamExecutionEnvironment, which in
turn will lead to a REST API submission.

Here I'm looking at what happens before the Beam job server gets involved:
the interaction of the k8s operator with the Flink deployment. The jar run
endpoint (ignoring the current handler implementation) is generic and
pretty much exactly matches what we would need for a uniform entry point.
It's just that in the Beam case the jar file would itself be a "launcher"
that doesn't provide the job graph itself, but the dependencies and
mechanism to invoke the actual client.

I could accomplish what I'm looking for by creating a separate REST
endpoint that looks almost the same. But I would prefer to reuse the Flink
REST API interaction that is already implemented for the Flink Java jobs to
reduce the complexity of the deployment.

Thomas

On Fri, Jul 26, 2019 at 2:29 AM Till Rohrmann wrote:

Hi Thomas,

quick question: Why do you wanna use the JarRunHandler? If another process
is building the JobGraph, then one could use the JobSubmitHandler which
expects a JobGraph and then starts executing it.

Cheers,
Till

On Thu, Jul 25, 2019 at 7:45 PM Thomas Weise wrote:

Hi,

While considering different options to launch Beam jobs through the Flink
REST API, I noticed that the implementation of JarRunHandler places

quite a

few restrictions on how the entry point shall construct a Flink job, by
extracting and manipulating the job graph.

That's normally not a problem for Flink Java programs, but in the

scenario

I'm looking at, the job graph would be constructed by a different process
and isn't available to the REST handler. Instead, I would like to be able
to just respond with the job ID of the already launched job.

For context, please see:

https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.fh2f571kms4d

The current JarRunHandler code is here:

https://github.com/apache/flink/blob/f3c5dd960ff81a022ece2391ed3aee86080a352a/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JarRunHandler.java#L82

It would be nice if there was an option to delegate the responsibility

for

job submission to the user code / entry point. That would be useful for
Beam and other frameworks built on top of Flink that dynamically create a
job graph from a different representation.

Possible ways to get there:

* an interface that the main class can be implement end when present, the
jar run handler calls instead of main.

* an annotated method

Either way query parameters like savepoint path and parallelism would be
forwarded to the user code and the result would be the ID of the launched
job.

Thougths?

Thanks,
Thomas

Re: [DISCUSS] ARM support for Flink

2019-07-31 Thread Chesnay Schepler

We (as in the Flink PMC) cannot add apps to the GitHub repo; please 
check first with INFRA whether these CI systems are allowed.


On 30/07/2019 03:44, Xiyuan Wang wrote:

Hi Stephan,
   Maybe I misled you in the previous email. We don't need to migrate CI
completely, travis-ci is still there working for X86 arch. What we need to
do is to add another CI tool for ARM arch.

   There are some ways to do it. As I wrote on
https://issues.apache.org/jira/browse/FLINK-13199 to @Chesnay:

1. Add OpenLab CI system for ARM arch test.OpenLab is very similar with
travis-ci. What Flilnk need to do is adding the openlab github app to the
repo, then add the job define files inner Flink repo, Here is a POC by me:
https://github.com/theopenlab/flink/pull/1
2. OpenLab will donate ARM resouces to Apache Infra team as well. Then
Flink can use the Apache offical  Jenkins system for Flink ARM test in the
future. https://builds.apache.org/
3. Use Drony CI which support ARM arch as well. https://drone.io/

Since I'm from OpenLab community, if Flink choose OpenLab CI, My OpenLab
colleague and I can keep helping and maintaining the ARM CI job. If choose
the 2nd way, the CI maintainance work may be handled by apache-infra team I
guess.  If choose the 3rd Drony CI, what we can help is very limited.
AFAIK, Drony use container for CI test, which may not satisfy some
requiremnts. And OpenLab use VM for test.

Need Flink core team's decision and reply.

Thanks.


Stephan Ewen  于2019年7月29日周一 下午6:05写道：


I don't think it is feasible for Flink to migrate CI completely.

Is there a way to add ARM tests on an external CI in addition?
@Chesnay what do you think?


On Fri, Jul 12, 2019 at 4:45 AM Xiyuan Wang 
wrote:


Hi Stephan
   yeah, we should add an ARM CI first. But Travis CI doesn't support ARM
arch itself. OpenLab community support it. As I mentioned before, OpenLab
is an opensource CI system like travis-ci.[1], it uses opensource CI
project `zuul`[2] for its deployment. Now some opensource project has
intergreted with it already. For example, `contained` project from
CNCF community[3]. And I have a POC for Flink ARM build and test using
OpenLab. Now the build is passed[4], and I'm working on debugging with

the

`test` part[5]. Is it fine for Flink to using?

[1]: https://openlabtesting.org
[2]: https://zuul-ci.org/docs/zuul/
[3]: https://status.openlabtesting.org/projects
[4]:
https://status.openlabtesting.org/build/2aa33f1a87854679b70f36bd6f75a890
[5]: https://github.com/theopenlab/flink/pull/1


Stephan Ewen  于2019年7月11日周四 下午9:56写道：


I think an ARM release would be cool.

To actually support that properly, we would need something like an ARM
profile for the CI builds (at least in the nightly tests), otherwise

ARM

support would probably be broken frequently.
Maybe that could be a way to start? Create a Travis CI ARM build (if
possible) and see what tests pass and which parts of the system would

need

to be adjusted?

On Thu, Jul 11, 2019 at 9:24 AM Xiyuan Wang 
wrote:


Hi yun:
   I didn't try to build rocksdb with vagrant, but just `make -j8
rocksdbjava` directly in an ARM machine.  We hit some issues as well.

My

colleague has created an issue in rocksdb[1]. Rocksdb doesn't

contains

ARM

.so file in his offical jar package. If you have the same request,

let's

work together there.

Thanks.

[1]: https://github.com/facebook/rocksdb/issues/5559

Yun Tang  于2019年7月11日周四 下午12:01写道：


Hi Xiyuan

Have you ever tried to release RocksDB on ARM like official doc[1]
suggests? From our experience, cross-building for ARM did not work

fine

with Vagrant and we have to build rocksDB's binary file on ARM

separately.

As frocksdb [2] might not always maintained in Flink, I think we'd

better

support to release RocksDB-java with ARM officially.


[1]

https://github.com/facebook/rocksdb/blob/master/java/RELEASE.md

[2] https://github.com/dataArtisans/frocksdb

Best
Yun Tang



From: Xiyuan Wang 
Sent: Tuesday, July 9, 2019 10:52
To: dev@flink.apache.org
Subject: Re: [DISCUSS] ARM support for Flink

Thanks for your help. I built the frocksdb locally on ARM and all

the

related tests are passed now. Except some tests which can be fixed

easily,

it seems that both building and testing are ran well on ARM.

Basing on my test, Is it possible to support Flink on ARM

officailly?

Seem

the worklist is not too long. And I can help with the CI testing

part.

Need Flink team's idea.

Thanks.

Dian Fu  于2019年7月8日周一 上午10:23写道：


Hi Xiyuan,

Thanks for bring the discussion.

WRT the exception, it's because the native bundled in the rocksdb

jar

file

isn't compiled with cross platform support. You can refer [1] for

how

to

build rocksdb which has ARM platform.

WRT ARM support, the rocksdb currently used in Flink is hosted in

the

Ververica git [2], so it won't be difficult to make it support

ARM.

However, I guess this git exists just for temporary [3], not

because

we

want to add much feature in rocksdb.

Re: [DISCUSS] ARM support for Flink

2019-08-01 Thread Chesnay Schepler

Please open a JIRA with INFRA and ask whether OpenLab/Drone are 
supported by INFRA.


On 01/08/2019 04:16, Xiyuan Wang wrote:

Thanks for your reply.

We are now keeping investigating and debugging Flink on ARM.  It's hard for
us to say How many kinds of test are enough for ARM support at this moment,
but `core` and `test` are necessary of cause I think. What we do now is
following travis-ci, added all the module that tarvis-ci contains.

During out local test, there are just few tests failed[1]. We have
solutions for some of them, others are still under debugging. Flink team's
idea is welcome. And very thanks for your jira issue[2], we will keep
updating it then.

It'll be great if Infra Team could add OpenLab App[3](or other CI if Flink
choose) to Flink repo. I'm not  clear how to talk with Infra Team, should
Flink team start the discussion? Or I send a mail list to Infra? Need your
help.

Then once app is added, perhaps we can add `core` and `test` jobs as the
first step, making them run stable and successful and then adding more
modules if needed.

[1]: https://etherpad.net/p/flink_arm64_support
[2]: https://issues.apache.org/jira/browse/FLINK-13448
[3]: https://github.com/apps/theopenlab-ci

Regards
wangxiyuan

Stephan Ewen  于2019年7月31日周三 下午9:46写道：


Wow, that is pretty nice work, thanks a lot!

We need some support from Apache Infra to see if we can connect the Flink
Github Repo with the OpenLab CI.
We would also need a discussion on the developer mailing list, to get
community agreement.

Have you looked at whether we need to run all tests with ARM, or whether
maybe only the "core" and "tests" profile would be enough to get confidence
that Flink runs on ARM?
Just asking because Flink has a lot of long running tests by now that can
easily eat up a lot of CI capacity.

Best,
Stephan



On Tue, Jul 30, 2019 at 3:45 AM Xiyuan Wang 
wrote:


Hi Stephan,
   Maybe I misled you in the previous email. We don't need to migrate CI
completely, travis-ci is still there working for X86 arch. What we need

to

do is to add another CI tool for ARM arch.

   There are some ways to do it. As I wrote on
https://issues.apache.org/jira/browse/FLINK-13199 to @Chesnay:

1. Add OpenLab CI system for ARM arch test.OpenLab is very similar with
travis-ci. What Flilnk need to do is adding the openlab github app to the
repo, then add the job define files inner Flink repo, Here is a POC by

me:

https://github.com/theopenlab/flink/pull/1
2. OpenLab will donate ARM resouces to Apache Infra team as well. Then
Flink can use the Apache offical  Jenkins system for Flink ARM test in

the

future. https://builds.apache.org/
3. Use Drony CI which support ARM arch as well. https://drone.io/

Since I'm from OpenLab community, if Flink choose OpenLab CI, My OpenLab
colleague and I can keep helping and maintaining the ARM CI job. If

choose

the 2nd way, the CI maintainance work may be handled by apache-infra

team I

guess.  If choose the 3rd Drony CI, what we can help is very limited.
AFAIK, Drony use container for CI test, which may not satisfy some
requiremnts. And OpenLab use VM for test.

Need Flink core team's decision and reply.

Thanks.


Stephan Ewen  于2019年7月29日周一 下午6:05写道：


I don't think it is feasible for Flink to migrate CI completely.

Is there a way to add ARM tests on an external CI in addition?
@Chesnay what do you think?


On Fri, Jul 12, 2019 at 4:45 AM Xiyuan Wang 
wrote:


Hi Stephan
   yeah, we should add an ARM CI first. But Travis CI doesn't support

ARM

arch itself. OpenLab community support it. As I mentioned before,

OpenLab

is an opensource CI system like travis-ci.[1], it uses opensource CI
project `zuul`[2] for its deployment. Now some opensource project has
intergreted with it already. For example, `contained` project from
CNCF community[3]. And I have a POC for Flink ARM build and test

using

OpenLab. Now the build is passed[4], and I'm working on debugging

with

the

`test` part[5]. Is it fine for Flink to using?

[1]: https://openlabtesting.org
[2]: https://zuul-ci.org/docs/zuul/
[3]: https://status.openlabtesting.org/projects
[4]:


https://status.openlabtesting.org/build/2aa33f1a87854679b70f36bd6f75a890

[5]: https://github.com/theopenlab/flink/pull/1


Stephan Ewen  于2019年7月11日周四 下午9:56写道：


I think an ARM release would be cool.

To actually support that properly, we would need something like an

ARM

profile for the CI builds (at least in the nightly tests),

otherwise

ARM

support would probably be broken frequently.
Maybe that could be a way to start? Create a Travis CI ARM build

(if

possible) and see what tests pass and which parts of the system

would

need

to be adjusted?

On Thu, Jul 11, 2019 at 9:24 AM Xiyuan Wang <

wangxiyuan1...@gmail.com>

wrote:


Hi yun:
   I didn't try to build rocksdb with vagrant, but just `make -j8
rocksdbjava` directly in an ARM machine.  We hit some issues as

well.

My

colleague has created an issue in rocksdb[1]. Rocksdb doesn't

contains

ARM

.so file

Re: [DISCUSS][CODE STYLE] Breaking long function argument lists and chained method calls

2019-08-02 Thread Chesnay Schepler


Just so everyone remembers:

Any suggested code-style should be
a) configurable in the IDE (otherwise we'll never be able to auto-format)
b) be verifiable via checkstyle (otherwise we'll end up manually 
checking for code-style again)


On 02/08/2019 03:20, SHI Xiaogang wrote:

Hi Andrey,

Thanks for bringing this. Personally, I prefer to the following style which
(1) puts the right parenthese in the next line
(2) a new line for each exception if exceptions can not be put in the same
line

That way, parentheses are aligned in a similar way to braces and exceptions
can be well aligned.

*public **void func(*
*int arg1,*
*int arg2,*
*...
*) throws E1, E2, E3 {*
*...
*}*

or

*public **void func(*
*int arg1,*
*int arg2,*
*...
*) throws
*E1,
*E2,
*E3 {*
*...
*}*

Regards,
Xiaogang

Andrey Zagrebin  于2019年8月1日周四 下午11:19写道：


Hi all,

This is one more small suggestion for the recent thread about code style
guide in Flink [1].

We already have a note about using a new line for each chained call in
Scala, e.g. either:

*values**.stream()**.map(...)**,collect(...);*

or

*values*
*.stream()*
*.map(*...*)*
*.collect(...)*

if it would result in a too long line by keeping all chained calls in one
line.

The suggestion is to have it for Java as well and add the same rule for a
long list of function arguments. So it is either:

*public void func(int arg1, int arg2, ...) throws E1, E2, E3 {*
*...*
*}*

or

*public **void func(*
*int arg1,*
*int arg2,*
*...)** throws E1, E2, E3 {*
*...*
*}*

but thrown exceptions stay on the same last line.

Please, feel free to share you thoughts.

Best,
Andrey

[1]

http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3ced91df4b-7cab-4547-a430-85bc710fd...@apache.org%3E

Re: [RESULT][VOTE] Migrate to sponsored Travis account

2019-08-02 Thread Chesnay Schepler

I'm currently modifying the cibot to do this automatically; should be 
finished until Monday.


On 02/08/2019 07:41, Jark Wu wrote:

Hi Chesnay,

Can we assign Flink Committers the permission of flink-ci/flink repo?
Several times, when I pushed some new commits, the old build jobs are still
in pending and not canceled.
Before we fix that, we can manually cancel some old jobs to save build
resource.

Best,
Jark


On Wed, 10 Jul 2019 at 16:17, Chesnay Schepler  wrote:


Your best bet would be to check the first commit in the PR and check the
parent commit.

To re-run things, you will have to rebase the PR on the latest master.

On 10/07/2019 03:32, Kurt Young wrote:

Thanks for all your efforts Chesnay, it indeed improves a lot for our
develop experience. BTW, do you know how to find the master branch
information which the CI runs with?

For example, like this one:
https://travis-ci.com/flink-ci/flink/jobs/214542568
It shows pass with the commits, which rebased on the master when the CI
is triggered. But it's both possible that the master branch CI runs on is
the
same or different with current master. If it's the same, I can simply

rely

on the
passed information to push commits, but if it's not, I think i should

find

another
way to re-trigger tests based on the newest master.

Do you know where can I get such information?

Best,
Kurt


On Tue, Jul 9, 2019 at 3:27 AM Chesnay Schepler 

wrote:

The kinks have been worked out; the bot is running again and pr builds
are yet again no longer running on ASF resources.

PRs are mirrored to: https://github.com/flink-ci/flink
Bot source: https://github.com/flink-ci/ci-bot

On 08/07/2019 17:14, Chesnay Schepler wrote:

I have temporarily re-enabled running PR builds on the ASF account;
migrating to the Travis subscription caused some issues in the bot
that I have to fix first.

On 07/07/2019 23:01, Chesnay Schepler wrote:

The vote has passed unanimously in favor of migrating to a separate
Travis account.

I will now set things up such that no PullRequest is no longer run on
the ASF servers.
This is a major setup in reducing our usage of ASF resources.
For the time being we'll use free Travis plan for flink-ci (i.e. 5
workers, which is the same the ASF gives us). Over the course of the
next week we'll setup the Ververica subscription to increase this

limit.

  From now now, a bot will mirror all new and updated PullRequests to a
mirror repository (https://github.com/flink-ci/flink-ci) and write an
update into the PR once the build is complete.
I have ran the bots for the past 3 days in parallel to our existing
Travis and it was working without major issues.

The biggest change that contributors will see is that there's no
longer a icon next to each commit. We may revisit this in the future.

I'll setup a repo with the source of the bot later.

On 04/07/2019 10:46, Chesnay Schepler wrote:

I've raised a JIRA
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to
inquire whether it would be possible to switch to a different Travis
account, and if so what steps would need to be taken.
We need a proper confirmation from INFRA since we are not in full
control of the flink repository (for example, we cannot access the
settings page).

If this is indeed possible, Ververica is willing sponsor a Travis
account for the Flink project.
This would provide us with more than enough resources than we need.

Since this makes the project more reliant on resources provided by
external companies I would like to vote on this.

Please vote on this proposal, as follows:
[ ] +1, Approve the migration to a Ververica-sponsored Travis
account, provided that INFRA approves
[ ] -1, Do not approach the migration to a Ververica-sponsored
Travis account

The vote will be open for at least 24h, and until we have
confirmation from INFRA. The voting period may be shorter than the
usual 3 days since our current is effectively not working.

On 04/07/2019 06:51, Bowen Li wrote:

Re: > Are they using their own Travis CI pool, or did the switch to
an entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are
currently moving away from ASF's Travis to their own in-house metal
machines at [1] with custom CI application at [2]. They've seen
significant improvement w.r.t both much higher performance and
basically no resource waiting time, "night-and-day" difference
quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our
project, then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]


[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3]


https://docs.travis-ci.com/user/migrate/open-source-repository-migration

[4]


https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com



On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler
mailto:ches...@apache.or

Re: [RESULT][VOTE] Migrate to sponsored Travis account

2019-08-02 Thread Chesnay Schepler


Update: Implemented and deployed.

On 02/08/2019 12:11, Jark Wu wrote:

Wow. That's great! Thanks Chesnay.

On Fri, 2 Aug 2019 at 17:50, Chesnay Schepler <mailto:ches...@apache.org>> wrote:


I'm currently modifying the cibot to do this automatically; should be
finished until Monday.

On 02/08/2019 07:41, Jark Wu wrote:
> Hi Chesnay,
>
> Can we assign Flink Committers the permission of flink-ci/flink
repo?
> Several times, when I pushed some new commits, the old build
jobs are still
> in pending and not canceled.
> Before we fix that, we can manually cancel some old jobs to save
build
> resource.
>
> Best,
> Jark
>
>
> On Wed, 10 Jul 2019 at 16:17, Chesnay Schepler
mailto:ches...@apache.org>> wrote:
>
>> Your best bet would be to check the first commit in the PR and
check the
>> parent commit.
>>
>> To re-run things, you will have to rebase the PR on the latest
master.
>>
>> On 10/07/2019 03:32, Kurt Young wrote:
>>> Thanks for all your efforts Chesnay, it indeed improves a lot
for our
>>> develop experience. BTW, do you know how to find the master branch
>>> information which the CI runs with?
>>>
>>> For example, like this one:
>>> https://travis-ci.com/flink-ci/flink/jobs/214542568
>>> It shows pass with the commits, which rebased on the master
when the CI
>>> is triggered. But it's both possible that the master branch CI
runs on is
>>> the
>>> same or different with current master. If it's the same, I can
simply
>> rely
>>> on the
>>> passed information to push commits, but if it's not, I think i
should
>> find
>>> another
>>> way to re-trigger tests based on the newest master.
>>>
>>> Do you know where can I get such information?
>>>
>>> Best,
>>> Kurt
>>>
>>>
>>> On Tue, Jul 9, 2019 at 3:27 AM Chesnay Schepler
mailto:ches...@apache.org>>
>> wrote:
>>>> The kinks have been worked out; the bot is running again and
pr builds
>>>> are yet again no longer running on ASF resources.
>>>>
>>>> PRs are mirrored to: https://github.com/flink-ci/flink
>>>> Bot source: https://github.com/flink-ci/ci-bot
>>>>
>>>> On 08/07/2019 17:14, Chesnay Schepler wrote:
>>>>> I have temporarily re-enabled running PR builds on the ASF
account;
>>>>> migrating to the Travis subscription caused some issues in
the bot
>>>>> that I have to fix first.
>>>>>
>>>>> On 07/07/2019 23:01, Chesnay Schepler wrote:
>>>>>> The vote has passed unanimously in favor of migrating to a
separate
>>>>>> Travis account.
>>>>>>
>>>>>> I will now set things up such that no PullRequest is no
longer run on
>>>>>> the ASF servers.
>>>>>> This is a major setup in reducing our usage of ASF resources.
>>>>>> For the time being we'll use free Travis plan for flink-ci
(i.e. 5
>>>>>> workers, which is the same the ASF gives us). Over the
course of the
>>>>>> next week we'll setup the Ververica subscription to
increase this
>> limit.
>>>>>>   From now now, a bot will mirror all new and updated
PullRequests to a
>>>>>> mirror repository (https://github.com/flink-ci/flink-ci)
and write an
>>>>>> update into the PR once the build is complete.
>>>>>> I have ran the bots for the past 3 days in parallel to our
existing
>>>>>> Travis and it was working without major issues.
>>>>>>
>>>>>> The biggest change that contributors will see is that
there's no
>>>>>> longer a icon next to each commit. We may revisit this in
the future.
>>>>>>
>>>>>> I'll setup a repo with the source of the bot later.
>>>>>>
>>>>>> On 04/07/2019 10:46, Chesnay Schepler wrote:
>>>>>>> I've raised a JIRA
>>>>>>> <https://issues.apache.org/jira/browse/INFRA-18703>with
INFRA to
>>>>>>> inquire whether

Re: flink-mapr-fs failed in travis

2019-08-06 Thread Chesnay Schepler


Done.

On 06/08/2019 13:17, Nico Kruber wrote:

Hi Chesnay,
can you backport these changes to the 1.7 release branch as well? Since
this is still a supported version (and may eventually receive a further
release), it would be nice to have the tests running again.
It is currently failing, e.g.
https://travis-ci.org/apache/flink/builds/566447083


Thanks
Nico

On 19/07/2019 10:40, Chesnay Schepler wrote:

I think I found the issue; I forgot to update travis_controller.sh .

On 19/07/2019 10:02, Chesnay Schepler wrote:

Ah, I added it to the common options in the travis_manv_watchdog.sh .

On 19/07/2019 09:58, Chesnay Schepler wrote:

I did modify the .travis.yml do activate the unsafe-mapr-repo
profile; did I modified the wrong profile?...


On 19/07/2019 07:57, Jark Wu wrote:

It seems that it is introduced by this commit:
https://github.com/apache/flink/commit/5c36c650e6520d92191ce2da33f7dcae774319f6

Hi @Chesnay Schepler  , do we need to add
"-Punsafe-mapr-repo" to the ".travis.yml"?

Best,
Jark

On Fri, 19 Jul 2019 at 10:58, JingsongLee

wrote:


Hi everyone:

flink-mapr-fs failed in travis, and I retried many times, and also
failed.
Anyone has idea about this?

01:32:54.755 [ERROR] Failed to execute goal on project flink-mapr-fs:
Could not resolve dependencies for project
org.apache.flink:flink-mapr-fs:jar:1.10-SNAPSHOT: Failed to collect
dependencies at com.mapr.hadoop:maprfs:jar:5.2.1-mapr: Failed to read
artifact descriptor for com.mapr.hadoop:maprfs:jar:5.2.1-mapr:
Could not
transfer artifact com.mapr.hadoop:maprfs:pom:5.2.1-mapr from/to
mapr-releases (https://repository.mapr.com/maven/):
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable
to find
valid certification path to requested target -> [Help 1]

https://api.travis-ci.org/v3/job/560790299/log.txt

Best, Jingsong Lee

[DISCUSS] Repository split

2019-08-07 Thread Chesnay Schepler


Hello everyone,

The Flink project sees an ever-increasing amount of dev activity, both 
in terms of reworked and new features.


This is of course an excellent situation to be in, but we are getting to 
a point where the associate downsides are becoming increasingly troublesome.


The ever increasing build times, in addition to unstable tests, 
significantly slow down the develoment process.
Additionally, pull requests for smaller features frequently slip through 
the crasks as they are being buried under a mountain of other pull requests.


As a result I'd like to start a discussion on splitting the Flink 
repository.


In this mail I will outline the core idea, and what problems I currently 
envision.


I'd specifically like to encourage those who were part of similar 
initiatives in other projects to share the experiences and ideas.



   General Idea

For starters, the idea is to create a new repository for "flink-connectors".
For the remainder of this mail, the current Flink repository is referred 
to as "flink-main".


There are also other candidates that we could discuss in the future, 
like flink-libraries (the next top-priority repo to ease flink-ml 
development), metric reporters, filesystems and flink-formats.


Moving out flink-connectors provides the most benefits, as we straight 
away save at-least an hour of testing time, and not being included in 
the binary distribution simplifies a few things.



   Problems to solve

To make this a reality there's a number of questions we have to discuss; 
some in the short-term, others in the long-term.


1) Git history

   We have to decide whether we want to rewrite the history of sub
   repositories to only contain diffs/commits related to this part of
   Flink, or whether we just fork from some commit in flink-main and
   add a commit to the connector repo that "transforms" it from
   flink-main to flink-connectors (i.e., remove everything unrelated to
   connectors + update module structure etc.).

   The latter option would have the advantage that our commit book
   keeping in JIRA would still be correct, but it would create a
   significant divide between the current and past state of the repository.

2) Maven

   We should look into whether there's a way to share dependency/plugin
   configurations and similar, so we don't have to keep them in sync
   manually across multiple repositories.

   A new parent Flink pom that all repositories define as their parent
   could work; this would imply splicing out part of the current room
   pom.xml.

3) Documentation

   Splitting the repository realistically also implies splitting the
   documentation source files (At the beginning we can get by with
   having it still in flink-main).
   We could just move the relevant files to the respective repository
   (while maintaining the directory structure), and merge them when
   building the docs.

   We also have to look at how we can handle java-/scaladocs; e.g.
   whether it is possible to aggregate them across projects.

4) CI (end-to-end tests)

   The very basic question we have to answer is whether we want E2E
   tests in the sub repositories. If so, we need to find a way to share
   e2e-tooling.

5) Releases

   We have to discuss how our release process will look like. This may
   also have repercussions on how repositories may depend on each other
   (SNAPSHOT vs LATEST). Note that this should be discussed for each
   repo separately.

   The current options I see are the following:

   a) Single release

   Release all repositories at once as a single product.

   The source release would be a collection of repositories, like
   flink/
   |--flink-main/
   |--flink-core/
   |--flink-runtime/
   ...
   |--flink-connectors/
   ...
   |--flink-.../
   ...

   This option requires a SNAPSHOT dependency between Flink
   repositories, but it is pretty much how things work at the moment.

   b) Synced releases

   Similar to a), except that each repository gets their own source
   release that they may released independent of other repositories.
   For a given release cycle each repo would produce exactly one
   release.

   This option requires a SNAPSHOT dependency between Flink
   repositories. Once any repositories has created an RC or
   finished it's release, release-branches in other repos can
   switch to that version.

   This approach is a tad more flexible than a), but requires more
   coordination between the repos.

   c) Separate releases

   Just like we handle flink-shaded; entirely separate release
   cycles; some repositories may have more releases in a given time
   period than others.

   This option implies a LATEST dependency between Flink repositories.

   Note that hybrid approaches would also make sense, like doing b) for
   major versions and c) for bugfix releases.

   For something like flink-lib

Re: [DISCUSS] Repository split

2019-08-08 Thread Chesnay Schepler

>  I would like to also raise an additional issue: currently quite some 
bugs (like release blockers [1]) are being discovered by ITCases of the 
connectors. It means that at least initially, the main repository will 
lose some test coverage.


True, but I think this is more a symptom of us not properly testing the 
contracts that are exposed to connectors.
That we lose lose test coverage is already a big red flag as it implies 
that issues were fixed and are now verified by a connector test, and not 
by a test in the Flink core.
We could also look into tooling surrounding the CI bot for running the 
connectors tests on-demand, although this is very much long-term.


On 08/08/2019 13:14, Piotr Nowojski wrote:

Hi,

Thanks for proposing and writing this down Chesney.

Generally speaking +1 from my side for the idea. It will create additional pain 
for cross repository development, like some new feature in connectors that need 
some change in the main repository. I’ve worked in such setup before and the 
teams then regretted having such split. But I agree that we should try this to 
try solve the stability/build time issues.

I have no experience in making such kind of splits so I can not help here.

I would like to also raise an additional issue: currently quite some bugs (like 
release blockers [1]) are being discovered by ITCases of the connectors. It 
means that at least initially, the main repository will lose some test coverage.

Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-13593 
<https://issues.apache.org/jira/browse/FLINK-13593>


On 7 Aug 2019, at 13:14, Chesnay Schepler  wrote:

Hello everyone,

The Flink project sees an ever-increasing amount of dev activity, both in terms 
of reworked and new features.

This is of course an excellent situation to be in, but we are getting to a 
point where the associate downsides are becoming increasingly troublesome.

The ever increasing build times, in addition to unstable tests, significantly 
slow down the develoment process.
Additionally, pull requests for smaller features frequently slip through the 
crasks as they are being buried under a mountain of other pull requests.

As a result I'd like to start a discussion on splitting the Flink repository.

In this mail I will outline the core idea, and what problems I currently 
envision.

I'd specifically like to encourage those who were part of similar initiatives 
in other projects to share the experiences and ideas.


   General Idea

For starters, the idea is to create a new repository for "flink-connectors".
For the remainder of this mail, the current Flink repository is referred to as 
"flink-main".

There are also other candidates that we could discuss in the future, like 
flink-libraries (the next top-priority repo to ease flink-ml development), 
metric reporters, filesystems and flink-formats.

Moving out flink-connectors provides the most benefits, as we straight away 
save at-least an hour of testing time, and not being included in the binary 
distribution simplifies a few things.


   Problems to solve

To make this a reality there's a number of questions we have to discuss; some 
in the short-term, others in the long-term.

1) Git history

   We have to decide whether we want to rewrite the history of sub
   repositories to only contain diffs/commits related to this part of
   Flink, or whether we just fork from some commit in flink-main and
   add a commit to the connector repo that "transforms" it from
   flink-main to flink-connectors (i.e., remove everything unrelated to
   connectors + update module structure etc.).

   The latter option would have the advantage that our commit book
   keeping in JIRA would still be correct, but it would create a
   significant divide between the current and past state of the repository.

2) Maven

   We should look into whether there's a way to share dependency/plugin
   configurations and similar, so we don't have to keep them in sync
   manually across multiple repositories.

   A new parent Flink pom that all repositories define as their parent
   could work; this would imply splicing out part of the current room
   pom.xml.

3) Documentation

   Splitting the repository realistically also implies splitting the
   documentation source files (At the beginning we can get by with
   having it still in flink-main).
   We could just move the relevant files to the respective repository
   (while maintaining the directory structure), and merge them when
   building the docs.

   We also have to look at how we can handle java-/scaladocs; e.g.
   whether it is possible to aggregate them across projects.

4) CI (end-to-end tests)

   The very basic question we have to answer is whether we want E2E
   tests in the sub repositories. If so, we need to find a way to share
   e2e-tooling.

5) Releases

   We have to discuss how our release process will look like. This may
   also

Re: [DISCUSS] Repository split

2019-08-14 Thread Chesnay Schepler


Let's recap a bit:

Several people have raised the argument that build times can be kept in 
check via other means (mostly differential builds via some means, be it 
custom scripts or switching to gradle). I will start a separate 
discussion thread on this topic, since it is a useful discussion in any 
case.
I agree with this, and believe it is feasible to update the CI process 
to behave as if the repository was split.



The suggestion of a "project split" within a single repository was 
brought up.
This approach is a mixed bag; it avoids the downsides to the development 
process that multiple repositories would incur, but also only has few 
upsides. It seems primarily relevant for local development, where one 
might want to skip certain modules when running tests.


There's no benefit from the CI side: since we're still limited to 1 
.travis.yml, whatever rules we want to set up (e.g., "do not test core 
if only connectors are modified") have to be handled by the CI scripts 
regardless of whether the project is split or not.


Overall, I'd like to put this item on ice for the time being; the 
subsequent item is related, vastly more impactful and may also render 
this item obsolete.



A major topic of discussion is that of the development process. It was 
pointed how that having a split repository makes the dev process more 
complicated, since certain changes turn into a 2 step process (merge to 
core, then merge to connectors). Others have pointed out that this may 
actually be an advantage, as it (to some extent) enforces that changes 
to core are also tested in core.


I find myself more in the latter camp; it is all to easy for people to 
make a change to the core while making whatever adjustments to 
connectors to make things fit. A recent change to the ClosureCleaner in 
1.8.0 <https://issues.apache.org/jira/browse/FLINK-13586> comes to mind, 
which, with a split repo, may have resulted in build failures in the 
connectors project. (provided that the time-frame between the 2 merges 
is sufficiently large...) As Arvid pointed out, having to feel the pain 
that users have to go through may not be such a bad thing.


This is a fundamental discussion as to whether we want to continue with 
a centralized development of all components.


Robert also pointed out that such a split could result in us 
establishing entirely separate projects. We've had times in the past 
(like the first flink-ml library) where such a setup may have simplified 
things (back then we had lot's of contributors but no committer to 
shepherd the effort; a separate project could be more lenient when it 
comes to appointing new committers).



@Robert We should have a SNAPSHOT dependency /somewhere/ in the 
connector repo, to detect issues (like the ClosureCleaner one) in a 
timely manner and to prepare for new features so that we can have a 
timely release after core, but not necessarily on the master branch.


@Bowen I have implemented and deployed your suggestion to cancel Travis 
builds if the associated PR has been closed.



On 07/08/2019 13:14, Chesnay Schepler wrote:

Hello everyone,

The Flink project sees an ever-increasing amount of dev activity, both 
in terms of reworked and new features.


This is of course an excellent situation to be in, but we are getting 
to a point where the associate downsides are becoming increasingly 
troublesome.


The ever increasing build times, in addition to unstable tests, 
significantly slow down the develoment process.
Additionally, pull requests for smaller features frequently slip 
through the crasks as they are being buried under a mountain of other 
pull requests.


As a result I'd like to start a discussion on splitting the Flink 
repository.


In this mail I will outline the core idea, and what problems I 
currently envision.


I'd specifically like to encourage those who were part of similar 
initiatives in other projects to share the experiences and ideas.



   General Idea

For starters, the idea is to create a new repository for 
"flink-connectors".
For the remainder of this mail, the current Flink repository is 
referred to as "flink-main".


There are also other candidates that we could discuss in the future, 
like flink-libraries (the next top-priority repo to ease flink-ml 
development), metric reporters, filesystems and flink-formats.


Moving out flink-connectors provides the most benefits, as we straight 
away save at-least an hour of testing time, and not being included in 
the binary distribution simplifies a few things.



   Problems to solve

To make this a reality there's a number of questions we have to 
discuss; some in the short-term, others in the long-term.


1) Git history

   We have to decide whether we want to rewrite the history of sub
   repositories to only contain diffs/commits related to this part of
   Flink, or whether we just fork from so

[DISCUSS] Reducing build times

2019-08-15 Thread Chesnay Schepler


Hello everyone,

improving our build times is a hot topic at the moment so let's discuss 
the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total time"), 
and in the ideal case takes about 1h20m ("run time") to complete from 
start to finish. The run time may fluctuate of course depending on the 
current Travis load. This applies both to builds on the Apache and 
flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that we 
are processing builds at the rate that they come in), however we are in 
an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending builds 
are canceled if a new commit was pushed to the PR or the PR was closed, 
which should prove especially useful during the rush hours we see before 
feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this situation 
(hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction in 
total build times was the introduction of cron jobs, which consolidated 
the per-commit matrix from 4 configurations (different scala/hadoop 
versions) to 1.


The separation into multiple build profiles was only a work-around for 
the 50m limit on Travis. Running tests in parallel has the obvious 
potential of reducing run time, but we're currently hitting a hard limit 
since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further splitting 
is possible).


The rework that introduced stages, at the time of introduction, did also 
not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing a 
significant speedup (18 minutes!). So far we have not seen any negative 
consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that I 
have seen recently (in other words, they aren't necessarily mine nor may 
I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which modules might be
   affected by change, and manipulate the splits accordingly. This
   approach is conceptually quite straight-forward, but has limits
   since it has to be pessimistic; i.e. a change in flink-core
   _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on demand.
 * With the introduction of the ci-bot we now have significantly
   more options on how to handle PR builds. One option could be to
   only run basic tests when the PR is created (which may be only
   modified modules, or all unit tests, or another low-cost
   scheme), and then have a committer trigger other builds (full
   test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
 * The budget version of 3); move certain tests that are either
   expensive (like some runtime tests that take minutes) or in
   rarely modified modules (like gelly) into cron jobs.
5. Gradle
 * Gradle was brought up a few times for it's built-in support for
   differential builds; basically providing 2) without the overhead
   of maintaining additional scripts.
 * To date no PoC was provided that shows it working in our CI
   environment (i.e., handling splits & caching etc).
 * This is the most disruptive change by a fair margin, as it would
   affect the entire project, developers and potentially users (f
   they build from source).
6. CI service
 * Our current artifact caching setup on Travis is basically a
   hack; we're basically abusing the Travis cache, which is meant
   for long-term caching, to ship build artifacts across jobs. It's
   brittle at times due to timing/visibility issues and on branches
   the cleanup processes can interfere with running builds. It is
   also not as effective as it could be.
 * There are CI services that provide build artifact caching out of
   the box, which could be useful for us.
 * To date, no PoC for using another CI service has been pr

Re: Watermarks not propagated to WebUI?

2019-08-15 Thread Chesnay Schepler

I remember an issue regarding the watermark fetch request from the WebUI 
exceeding some HTTP size limit, since it tries to fetch all watermarks 
at once, and the format of this request isn't exactly efficient.


Querying metrics for individual operators still works since the request 
is small enough.


Not sure whether we ever fixed that.

On 15/08/2019 12:01, Jan Lukavský wrote:

Hi,

Thomas, thanks for confirming this. I have noticed, that in 1.9 the 
WebUI has been reworked a lot, does anyone know if this is still an 
issue? I currently cannot easily try 1.9, so I cannot confirm or 
disprove that.


Jan

On 8/14/19 6:25 PM, Thomas Weise wrote:
I have also noticed this issue (Flink 1.5, Flink 1.8), and it appears 
with

higher parallelism.

This can be confusing to the user when watermarks actually work and 
can be

observed using the metrics.

On Wed, Aug 14, 2019 at 7:36 AM Jan Lukavský  wrote:


Hi,

is it possible, that watermarks are sometimes not propagated to WebUI,
although they are internally moving as normal? I see in WebUI every
operator showing "No Watermark", but outputs seem to be propagated to
sink (and there are watermark sensitive operations involved - e.g.
reductions on fixed windows without early emitting). More strangely,
this happens when I increase parallelism above some threshold. If I use
parallelism of N, watermarks are shown, when I increase it above some
number (seems not to be exactly deterministic), watermarks seems to
disappear.

I'm using Flink 1.8.1.

Did anyone experience something like this before?

Jan

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Chesnay Schepler

The licensing items aren't a problem; we don't care about Flink modules 
in NOTICE files, and we don't have to update the source-release 
licensing since we don't have a pre-built version of the WebUI in the 
source.


On 15/08/2019 15:22, Kurt Young wrote:

After going through the licenses, I found 2 suspicions but not sure if they
are
valid or not.

1. flink-state-processing-api is packaged in to flink-dist jar, but not
included in
NOTICE-binary file (the one under the root directory) like other modules.
2. flink-runtime-web distributed some JavaScript dependencies through source
codes, the licenses and NOTICE file were only updated inside the module of
flink-runtime-web, but not the NOTICE file and licenses directory which
under
the  root directory.

Another minor issue I just found is:
FLINK-13558 tries to include table examples to flink-dist, but I cannot
find it in
the binary distribution of RC2.

Best,
Kurt


On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:


Hi Gordon & Timo,

Thanks for the feedback, and I agree with it. I will document this in the
release notes.

Best,
Kurt


On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai 
wrote:


Hi Kurt,

With the same argument as before, given that it is mentioned in the
release
announcement that it is a preview feature, I would not block this release
because of it.
Nevertheless, it would be important to mention this explicitly in the
release notes [1].

Regards,
Gordon

[1] https://github.com/apache/flink/pull/9438

On Thu, Aug 15, 2019 at 11:29 AM Timo Walther  wrote:


Hi Kurt,

I agree that this is a serious bug. However, I would not block the
release because of this. As you said, there is a workaround and the
`execute()` works in the most common case of a single execution. We can
fix this in a minor release shortly after.

What do others think?

Regards,
Timo


Am 15.08.19 um 11:23 schrieb Kurt Young:

HI,

We just find a serious bug around blink planner:
https://issues.apache.org/jira/browse/FLINK-13708
When user reused the table environment instance, and call `execute`

method

multiple times for
different sql, the later call will trigger the earlier ones to be
re-executed.

It's a serious bug but seems we also have a work around, which is

never

reuse the table environment
object. I'm not sure if we should treat this one as blocker issue of

1.9.0.

What's your opinion?

Best,
Kurt


On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:


+1 (non-binding)

Jepsen test suite passed 10 times consecutively

On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <

aljos...@apache.org>

wrote:


+1

I did some testing on a Google Cloud Dataproc cluster (it gives you

a

managed YARN and Google Cloud Storage (GCS)):
- tried both YARN session mode and YARN per-job mode, also using
bin/flink list/cancel/etc. against a YARN session cluster
- ran examples that write to GCS, both with the native Hadoop

FileSystem

and a custom “plugin” FileSystem
- ran stateful streaming jobs that use GCS as a checkpoint

backend

- tried running SQL programs on YARN using the SQL Cli: this

worked

for

YARN session mode but not for YARN per-job mode. Looking at the

code I

don’t think per-job mode would work from seeing how it is

implemented.

But

I think it’s an OK restriction to have for now
- in all the testing I had fine-grained recovery (region

failover)

enabled but I didn’t simulate any failures


On 14. Aug 2019, at 15:20, Kurt Young  wrote:

Hi,

Thanks for preparing this release candidate. I have verified the

following:

- verified the checksums and GPG files match the corresponding

release

files

- verified that the source archives do not contains any binaries
- build the source release with Scala 2.11 successfully.
- ran `mvn verify` locally, met 2 issuses [FLINK-13687] and

[FLINK-13688],

but
both are not release blockers. Other than that, all tests are

passed.

- ran all e2e tests which don't need download external packages

(it's

very

unstable
in China and almost impossible to download them), all passed.
- started local cluster, ran some examples. Met a small website

display

issue
[FLINK-13591], which is also not a release blocker.

Although we have pushed some fixes around blink planner and hive
integration
after RC2, but consider these are both preview features, I'm lean

to

be

ok

to release
without these fixes.

+1 from my side. (binding)

Best,
Kurt


On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:


Hi Gordon,

I have verified the following things:

- build the source release with Scala 2.12 and Scala 2.11

successfully

- checked/verified signatures and hashes
- checked that all POM files point to the same version
- ran some flink table related end-to-end tests locally and

succeeded

(except TPC-H e2e failed which is reported in FLINK-13704)
- started cluster for both Scala 2.11 and 2.12, ran examples,

verified

web

ui and log output, nothing unexpected
- started cluster, ran a SQL query to temporal join with kafka

so

[DISCUSS] Release flink-shaded 8.0

2019-08-16 Thread Chesnay Schepler


Hello,

I would like to kick off the next flink-shaded release next week. There 
are 2 ongoing efforts that are blocked on this release:


 * [FLINK-13467] Java 11 support requires a bump to ASM to correctly
   handle Java 11 bytecode
 * [FLINK-11767] Reworking the typeSerializerSnapshotMigrationTestBase
   requires asm-commons to be added to flink-shaded-asm

Are there any other changes on anyone's radar that we will have to make 
for 1.10? (will bumping calcite require anything, for example)

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler

There appears to be a general agreement that 1) should be looked into; 
I've setup a branch with fork reuse being enabled for all tests; will 
report back the results.


On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's 
discuss the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total time"), 
and in the ideal case takes about 1h20m ("run time") to complete from 
start to finish. The run time may fluctuate of course depending on the 
current Travis load. This applies both to builds on the Apache and 
flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that 
we are processing builds at the rate that they come in), however we 
are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending 
builds are canceled if a new commit was pushed to the PR or the PR was 
closed, which should prove especially useful during the rush hours we 
see before feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this 
situation (hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction 
in total build times was the introduction of cron jobs, which 
consolidated the per-commit matrix from 4 configurations (different 
scala/hadoop versions) to 1.


The separation into multiple build profiles was only a work-around for 
the 50m limit on Travis. Running tests in parallel has the obvious 
potential of reducing run time, but we're currently hitting a hard 
limit since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further splitting 
is possible).


The rework that introduced stages, at the time of introduction, did 
also not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing a 
significant speedup (18 minutes!). So far we have not seen any 
negative consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that 
I have seen recently (in other words, they aren't necessarily mine nor 
may I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which modules might be
   affected by change, and manipulate the splits accordingly. This
   approach is conceptually quite straight-forward, but has limits
   since it has to be pessimistic; i.e. a change in flink-core
   _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on demand.
 * With the introduction of the ci-bot we now have significantly
   more options on how to handle PR builds. One option could be to
   only run basic tests when the PR is created (which may be only
   modified modules, or all unit tests, or another low-cost
   scheme), and then have a committer trigger other builds (full
   test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
 * The budget version of 3); move certain tests that are either
   expensive (like some runtime tests that take minutes) or in
   rarely modified modules (like gelly) into cron jobs.
5. Gradle
 * Gradle was brought up a few times for it's built-in support for
   differential builds; basically providing 2) without the overhead
   of maintaining additional scripts.
 * To date no PoC was provided that shows it working in our CI
   environment (i.e., handling splits & caching etc).
 * This is the most disruptive change by a fair margin, as it would
   affect the entire project, developers and potentially users (f
   they build from source).
6. CI service
 * Our current artifact caching setup on Travis is basically a
   hack; we're basically abusing the Travis cache, which is meant
   for long-term caching, to ship build artifacts across jobs. It's
   brittle at times due to timing/visibility issues and on bran

Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Chesnay Schepler

I'm very late to the party, but isn't it a bit weird that we're using a 
voting scheme that isn't laid out in the bylaws?


Additionally, I would heavily suggest to CC priv...@flink.apache.org, as 
we want as many PMC as possible to look at this.
(I would regard the this point as a reason for delaying  the vote 
conclusion)


On 11/08/2019 10:07, Becket Qin wrote:

Hi all,

I would like to start a voting thread on the project bylaws of Flink. It
aims to help the community coordinate more smoothly. Please see the bylaws
wiki page below for details.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

The discussion thread is following:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html

The vote will be open for at least 6 days. PMC members' votes are
considered as binding. The vote requires 2/3 majority of the binding +1s to
pass.

Thanks,

Jiangjie (Becket) Qin

Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Chesnay Schepler


The wording of the original mail is ambiguous imo.

"The vote requires 2/3 majority of the binding +1s to pass."

This to me reads very much "This vote passes if 2/3 of all votes after 
the voting period are +1."


Maybe it's just a wording thing, but it was not clear to me that this 
follows the 2/3 majority scheme laid out in the bylaws.


On 16/08/2019 12:51, Dawid Wysakowicz wrote:

AFAIK this voting scheme is described in the "Modifying Bylaws" section,
in the end introducing bylaws is a modify operation ;) . I think it is a
valid point to CC priv...@flink.apache.org in the future. I wouldn't say
it is a must though. The voting scheme requires that every PMC member
has to be reached out directly, via a private address if he/she did not
vote in a thread. So every PMC member should be aware of the voting thread.

Best,

Dawid

On 16/08/2019 12:38, Chesnay Schepler wrote:

I'm very late to the party, but isn't it a bit weird that we're using
a voting scheme that isn't laid out in the bylaws?

Additionally, I would heavily suggest to CC priv...@flink.apache.org,
as we want as many PMC as possible to look at this.
(I would regard the this point as a reason for delaying  the vote
conclusion)

On 11/08/2019 10:07, Becket Qin wrote:

Hi all,

I would like to start a voting thread on the project bylaws of Flink. It
aims to help the community coordinate more smoothly. Please see the
bylaws
wiki page below for details.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026


The discussion thread is following:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html


The vote will be open for at least 6 days. PMC members' votes are
considered as binding. The vote requires 2/3 majority of the binding
+1s to
pass.

Thanks,

Jiangjie (Becket) Qin

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler

Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right
away, while flink-tests has the potential for huge savings, but we have
to figure out some issues first.

Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)

The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected accumulator
results within time limit.
at
org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one above
failed after 19 minutes and is only missing the migration tests (which
currently need 6-7 minutes). So we could save somewhere between 15 to 20
minutes here.

Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:
There appears to be a general agreement that 1) should be looked into;
I've setup a branch with fork reuse being enabled for all tests; will
report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's
discuss the different ways how they could be reduced.

Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total
time"), and in the ideal case takes about 1h20m ("run time") to
complete from start to finish. The run time may fluctuate of course
depending on the current Travis load. This applies both to builds on
the Apache and flink-ci Travis.

At the time of writing, the current queue time for PR jobs (reminder:
running on flink-ci) is about 30 minutes (which basically means that
we are processing builds at the rate that they come in), however we
are in an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as
everyone was scrambling to get their changes merged in time for the
feature freeze.

(Note: Recently optimizations where added to ci-bot where pending
builds are canceled if a new commit was pushed to the PR or the PR
was closed, which should prove especially useful during the rush
hours we see before feature-freezes.)

Past approaches

Over the years we have done rather few things to improve this
situation (hence our current predicament).

Beyond the sporadic speedup of some tests, the only notable reduction
in total build times was the introduction of cron jobs, which
consolidated the per-commit matrix from 4 configurations (different
scala/hadoop versions) to 1.

The separation into multiple build profiles was only a work-around
for the 50m limit on Travis. Running tests in parallel has the
obvious potential of reducing run time, but we're currently hitting a
hard limit since a few modules (flink-tests, flink-runtime,
flink-table-planner-blink) are so loaded with tests that they nearly
consume an entire profile by themselves (and thus no further
splitting is possible).

The rework that introduced stages, at the time of introduction, did
also not provide a speed up, although this changed slightly once more
profiles were added and some optimizations to the caching have been
made.

Very recently we modified the surefire-plugin configuration for
flink-table-planner-blink to reuse JVM forks for IT cases, providing
a significant speedup (18 minutes!). So far we have not seen any
negative consequences.

Suggestions

This is a list of /all /suggestions for reducing run/total times that
I have seen recently (in other words, they aren't necessarily mine
nor may I agree with all of them).

1. Enable JVM reuse for IT cases in more modules.
* We've seen significant speedups in the blink planner, and this
should be applicable for all modules. However, I presume there's
a reason why we disabled JVM reuse (information on this would be
appreciated)
2. Custom differential build sc

Re: [VOTE] Flink Project Bylaws

2019-08-16 Thread Chesnay Schepler


+1 (binding)

Although I think it would be a good idea to always cc 
priv...@flink.apache.org when modifying bylaws, if anything to speed up 
the voting process.


On 16/08/2019 11:26, Ufuk Celebi wrote:

+1 (binding)

– Ufuk


On Wed, Aug 14, 2019 at 4:50 AM Biao Liu  wrote:


+1 (non-binding)

Thanks for pushing this!

Thanks,
Biao /'bɪ.aʊ/



On Wed, 14 Aug 2019 at 09:37, Jark Wu  wrote:


+1 (non-binding)

Best,
Jark

On Wed, 14 Aug 2019 at 09:22, Kurt Young  wrote:


+1 (binding)

Best,
Kurt


On Wed, Aug 14, 2019 at 1:34 AM Yun Tang  wrote:


+1 (non-binding)

But I have a minor question about "code change" action, for those
"[hotfix]" github pull requests [1], the dev mailing list would not

be

notified currently. I think we should change the description of this

action.


[1]


https://flink.apache.org/contributing/contribute-code.html#code-contribution-process

Best
Yun Tang

From: JingsongLee 
Sent: Tuesday, August 13, 2019 23:56
To: dev 
Subject: Re: [VOTE] Flink Project Bylaws

+1 (non-binding)
Thanks Becket.
I've learned a lot from current bylaws.

Best,
Jingsong Lee


--
From:Yu Li 
Send Time:2019年8月13日(星期二) 17:48
To:dev 
Subject:Re: [VOTE] Flink Project Bylaws

+1 (non-binding)

Thanks for the efforts Becket!

Best Regards,
Yu


On Tue, 13 Aug 2019 at 16:09, Xintong Song 

wrote:

+1 (non-binding)

Thank you~

Xintong Song



On Tue, Aug 13, 2019 at 1:48 PM Robert Metzger <

rmetz...@apache.org>

wrote:


+1 (binding)

On Tue, Aug 13, 2019 at 1:47 PM Becket Qin 
wrote:

Thanks everyone for voting.

For those who have already voted, just want to bring this up to

your

attention that there is a minor clarification to the bylaws

wiki

this

morning. The change is in bold format below:

one +1 from a committer followed by a Lazy approval (not

counting

the

vote

of the contributor), moving to lazy majority if a -1 is

received.


Note that this implies that committers can +1 their own commits

and

merge

right away. *However, the committe**rs should use their best

judgement

to

respect the components expertise and ongoing development

plan.*


This addition does not really change anything the bylaws meant

to

set.

It

is simply a clarification. If anyone who have casted the vote

objects,

please feel free to withdraw the vote.

Thanks,

Jiangjie (Becket) Qin


On Tue, Aug 13, 2019 at 1:29 PM Piotr Nowojski <

pi...@ververica.com>

wrote:


+1


On 13 Aug 2019, at 13:22, vino yang 
wrote:

+1

Tzu-Li (Gordon) Tai  于2019年8月13日周二

下午6:32写道：

+1

On Tue, Aug 13, 2019, 12:31 PM Hequn Cheng <

chenghe...@gmail.com>

wrote:

+1 (non-binding)

Thanks a lot for driving this! Good job. @Becket Qin <

becket@gmail.com

Best, Hequn

On Tue, Aug 13, 2019 at 6:26 PM Stephan Ewen <

se...@apache.org

wrote:

+1

On Tue, Aug 13, 2019 at 12:22 PM Maximilian Michels <

m...@apache.org

wrote:


+1 It's good that we formalize this.

On 13.08.19 10:41, Fabian Hueske wrote:

+1 for the proposed bylaws.
Thanks for pushing this Becket!

Cheers, Fabian

Am Mo., 12. Aug. 2019 um 16:31 Uhr schrieb Robert

Metzger

<

rmetz...@apache.org>:


I changed the permissions of the page.

On Mon, Aug 12, 2019 at 4:21 PM Till Rohrmann <

trohrm...@apache.org>

wrote:


+1 for the proposal. Thanks a lot for driving this

discussion

Becket!

Cheers,
Till

On Mon, Aug 12, 2019 at 3:02 PM Becket Qin <

becket@gmail.com>

wrote:

Hi Robert,

That's a good suggestion. Will you help to change

the

permission

on

that

page?

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 12, 2019 at 2:41 PM Robert Metzger <

rmetz...@apache.org>

wrote:


Thanks for starting the vote.
How about putting a specific version in the wiki

up

for

voting,

or

restricting edit access to the page to the PMC?
There were already two changes (very minor) to the

page

since

the

vote

has

started:



https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?pageId=120731026

I suggest to restrict edit access to the page.



On Mon, Aug 12, 2019 at 11:43 AM Timo Walther <

twal...@apache.org

wrote:

+1

Thanks for all the efforts you put into this for

documenting

how

the

project operates.

Regards,
Timo

Am 12.08.19 um 10:44 schrieb Aljoscha Krettek:

+1


On 11. Aug 2019, at 10:07, Becket Qin <

becket@gmail.com>

wrote:

Hi all,

I would like to start a voting thread on the

project

bylaws

of

Flink.

It

aims to help the community coordinate more

smoothly.

Please

see

the

bylaws

wiki page below for details.



https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

The discussion thread is following:



http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-project-bylaws-td30409.html

The vote will be open for at least 6 days. PMC

members'

votes

are

considered as binding. The vote requires 2/3

majority

of

the

binding

+1s to

pass.

Thanks,

Jiangjie (Becket) Qin

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Chesnay Schepler

@Aljoscha Shading takes a few minutes for a full build; you can see this
quite easily by looking at the compile step in the misc profile
<https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
longer than a fraction of a section are usually caused by shading lots
of classes. Note that I cannot tell you how much of this is spent on
relocations, and how much on writing the jar.

Personally, I'd very much like us to move all shading to flink-shaded;
this would finally allows us to use newer maven versions without needing
cumbersome workarounds for flink-dist. However, this isn't a trivial
affair in some cases; IIRC calcite could be difficult to handle.

On another note, this would also simplify switching the main repo to
another build system, since you would no longer had to deal with
relocations, just packaging + merging NOTICE files.

@BowenLi I disagree, flink-shaded does not include any tests, API
compatibility checks, checkstyle, layered shading (e.g., flink-runtime
and flink-dist, where both relocate dependencies and one is bundled by
the other), and, most importantly, CI (and really, without CI being
covered in a PoC there's nothing to discuss).

On 16/08/2019 15:13, Aljoscha Krettek wrote:

Speaking of flink-shaded, do we have any idea what the impact of shading is on
the build time? We could get rid of shading completely in the Flink main
repository by moving everything that we shade to flink-shaded.

Aljoscha

On 16. Aug 2019, at 14:58, Bowen Li wrote:

+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink

On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann wrote:

For the sake of keeping the discussion focused and not cluttering the
discussion thread I would suggest to split the detailed reporting for
reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler
wrote:

Update:

TL;DR: table-planner is a good candidate for enabling fork reuse right
away, while flink-tests has the potential for huge savings, but we have
to figure out some issues first.

Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at

org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)

org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)

The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected

accumulator

results within time limit.
at

org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

There appears to be a general agreement that 1) should be looked into;
I've setup a branch with fork reuse being enabled for all tests; will
report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's
discuss the different ways how they could be reduced.

Current state:

First up, let's look at some numbers:

Re: [VOTE] FLIP-52: Remove legacy Program interface.

2019-08-21 Thread Chesnay Schepler


+1

On 21/08/2019 13:23, Timo Walther wrote:

+1

Am 21.08.19 um 13:21 schrieb Stephan Ewen:

+1

On Wed, Aug 21, 2019 at 1:07 PM Kostas Kloudas  
wrote:



Hi all,

Following the FLIP process, this is a voting thread dedicated to the
FLIP-52.
As shown from the corresponding discussion thread [1], we seem to agree
that
the Program interface can be removed, so let's make it also official
with a vote.

Cheers,
Kostas


[1]
https://lists.apache.org/thread.html/0dbd0a4adf9ad00d6ad869dffc8820f6ce4c1969e1ea4aafb1dd0aa4@%3Cdev.flink.apache.org%3E

Re: [VOTE] Apache Flink 1.9.0, release candidate #3

2019-08-21 Thread Chesnay Schepler


+1 (binding)

On 21/08/2019 08:09, Bowen Li wrote:

+1 non-binding

- built from source with default profile
- manually ran SQL and Table API tests for Flink's metadata integration
with Hive Metastore in local cluster
- manually ran SQL tests for batch capability with Blink planner and Hive
integration (source/sink/udf) in local cluster
 - file formats include: csv, orc, parquet


On Tue, Aug 20, 2019 at 10:23 PM Gary Yao  wrote:


+1 (non-binding)

Reran Jepsen tests 10 times.

On Wed, Aug 21, 2019 at 5:35 AM vino yang  wrote:


+1 (non-binding)

- checkout source code and build successfully
- started a local cluster and ran some example jobs successfully
- verified signatures and hashes
- checked release notes and post

Best,
Vino

Stephan Ewen  于2019年8月21日周三 上午4:20写道：


+1 (binding)

  - Downloaded the binary release tarball
  - started a standalone cluster with four nodes
  - ran some examples through the Web UI
  - checked the logs
  - created a project from the Java quickstarts maven archetype
  - ran a multi-stage DataSet job in batch mode
  - killed as TaskManager and verified correct restart behavior,

including

failover region backtracking


I found a few issues, and a common theme here is confusing error

reporting

and logging.

(1) When testing batch failover and killing a TaskManager, the job

reports

as the failure cause "org.apache.flink.util.FlinkException: The

assigned

slot 6d0e469d55a2630871f43ad0f89c786c_0 was removed."
 I think that is a pretty bad error message, as a user I don't know

what

that means. Some internal book keeping thing?
 You need to know a lot about Flink to understand that this means
"TaskManager failure".
 https://issues.apache.org/jira/browse/FLINK-13805
 I would not block the release on this, but think this should get

pretty

urgent attention.

(2) The Metric Fetcher floods the log with error messages when a
TaskManager is lost.
  There are many exceptions being logged by the Metrics Fetcher due

to

not reaching the TM any more.
  This pollutes the log and drowns out the original exception and

the

meaningful logs from the scheduler/execution graph.
  https://issues.apache.org/jira/browse/FLINK-13806
  Again, I would not block the release on this, but think this

should

get pretty urgent attention.

(3) If you put "web.submit.enable: false" into the configuration, the

web

UI will still display the "SubmitJob" page, but errors will
 continuously pop up, stating "Unable to load requested file /jars."
 https://issues.apache.org/jira/browse/FLINK-13799

(4) REST endpoint logs ERROR level messages when selecting the
"Checkpoints" tab for batch jobs. That does not seem correct.
  https://issues.apache.org/jira/browse/FLINK-13795

Best,
Stephan




On Tue, Aug 20, 2019 at 11:32 AM Tzu-Li (Gordon) Tai <

tzuli...@apache.org>

wrote:


+1

Legal checks:
- verified signatures and hashes
- New bundled Javascript dependencies for flink-runtime-web are

correctly

reflected under licenses-binary and NOTICE file.
- locally built from source (Scala 2.12, without Hadoop)
- No missing artifacts in staging repo
- No binaries in source release

Functional checks:
- Quickstart working (both in IDE + job submission)
- Simple State Processor API program that performs offline key schema
migration (RocksDB backend). Generated savepoint is valid to restore

from.

- All E2E tests pass locally
- Didn’t notice any issues with the new WebUI

Cheers,
Gordon

On Tue, Aug 20, 2019 at 3:53 AM Zili Chen 

wrote:

+1 (non-binding)

- build from source: OK(8u212)
- check local setup tutorial works as expected

Best,
tison.


Yu Li  于2019年8月20日周二 上午8:24写道：


+1 (non-binding)

- checked release notes: OK
- checked sums and signatures: OK
- repository appears to contain all expected artifacts
- source release
  - contains no binaries: OK
  - contains no 1.9-SNAPSHOT references: OK
  - build from source: OK (8u102)
- binary release
  - no examples appear to be missing
  - started a cluster; WebUI reachable, example ran

successfully

- checked README.md file and found nothing unexpected

Best Regards,
Yu


On Tue, 20 Aug 2019 at 01:16, Tzu-Li (Gordon) Tai <

tzuli...@apache.org

wrote:


Hi all,

Release candidate #3 for Apache Flink 1.9.0 is now ready for

your

review.

Please review and vote on release candidate #3 for version

1.9.0,

as

follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific

comments)

The complete staging area is available for your review, which

includes:

* JIRA release notes [1],
* the official Apache source release and binary convenience

releases

to

be

deployed to dist.apache.org [2], which are signed with the key

with

fingerprint 1C1E2394D3194E1944613488F320986D35C33D6A [3],
* all artifacts to be deployed to the Maven Central Repository

[4],

* source code tag “release-1.9.0-rc3” [5].
* pull requests for the release note documentation [6] and

anno

Re: [DISCUSS] Release flink-shaded 8.0

2019-08-21 Thread Chesnay Schepler

Nico has opened a PR for bumping netty; we plan to have this merged by 
tomorrow.


Unless anyone has concerns I will kick off the release on Friday.

On 19/08/2019 12:11, Nico Kruber wrote:

I quickly went through all the changelogs for Netty 4.1.32 (which we
currently use) to the latest Netty 4.1.39.Final. Below, you will find a
list of bug fixes and performance improvements that may affect us. Nice
changes we could benefit from, also for the Java > 8 efforts. The most
important ones fixing leaks etc are #8921, #9167, #9274, #9394, and the
various CompositeByteBuf fixes. The rest are mostly performance
improvements.

Since we are still early in the dev cycle for Flink 1.10, it would maybe
nice to update and verify that the new version works correctly. I'll
create a ticket and PR.


FYI (1): My own patches to bring dynamically-linked openSSL to more
distributions, namely SUSE and Arch, have not made it into a release yet.

FYI (2): We are currently using the latest version of netty-tcnative,
i.e. 2.0.25.


Nico

--
Netty 4.1.33.Final
- Fix ClassCastException and native crash when using kqueue transport
(#8665)
- Provide a way to cache the internal nioBuffer of the PooledByteBuffer
to reduce GC (#8603)

Netty 4.1.34.Final
- Do not use GetPrimitiveArrayCritical(...) due multiple not-fixed bugs
related to GCLocker (#8921)
- Correctly monkey-patch id also in whe os / arch is used within library
name (#8913)
- Further reduce ensureAccessible() overhead (#8895)
- Support using an Executor to offload blocking / long-running tasks
when processing TLS / SSL via the SslHandler (#8847)
- Minimize memory footprint for AbstractChannelHandlerContext for
handlers that execute in the EventExecutor (#8786)
- Fix three bugs in CompositeByteBuf (#8773)

Netty 4.1.35.Final
- Fix possible ByteBuf leak when CompositeByteBuf is resized (#8946)
- Correctly produce ssl alert when certificate validation fails on the
client-side when using native SSL implementation (#8949)

Netty 4.1.37.Final
- Don't filter out TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (#9274)
- Try to mark child channel writable again once the parent channel
becomes writable (#9254)
- Properly debounce wakeups (#9191)
- Don't read from timerfd and eventfd on each EventLoop tick (#9192)
- Correctly detect that KeyManagerFactory is not supported when using
OpenSSL 1.1.0+ (#9170)
- Fix possible unsafe sharing of internal NIO buffer in CompositeByteBuf
(#9169)
- KQueueEventLoop won't unregister active channels reusing a file
descriptor (#9149)
- Prefer direct io buffers if direct buffers pooled (#9167)

Netty 4.1.38.Final
- Prevent ByteToMessageDecoder from overreading when !isAutoRead (#9252)
- Correctly take length of ByteBufInputStream into account for
readLine() / readByte() (#9310)
- availableSharedCapacity will be slowly exhausted (#9394)
--

On 18/08/2019 16:47, Stephan Ewen wrote:

Are we fine with the current Netty version, or would be want to bump it?

On Fri, Aug 16, 2019 at 10:30 AM Chesnay Schepler mailto:ches...@apache.org>> wrote:

 Hello,

 I would like to kick off the next flink-shaded release next week. There
 are 2 ongoing efforts that are blocked on this release:

   * [FLINK-13467] Java 11 support requires a bump to ASM to correctly
     handle Java 11 bytecode
   * [FLINK-11767] Reworking the typeSerializerSnapshotMigrationTestBase
     requires asm-commons to be added to flink-shaded-asm

 Are there any other changes on anyone's radar that we will have to make
 for 1.10? (will bumping calcite require anything, for example)

CiBot Update

2019-08-21 Thread Chesnay Schepler


Hi everyone,

this is an update on recent changes to the CI bot.


The bot now cancels builds if a new commit was added to a PR, and 
cancels all builds if the PR was closed.
(This was implemented a while ago; I'm just mentioning it again for 
discoverability)



Additionally, starting today you can now re-trigger a Travis run by 
writing a comment "@flinkbot run travis"; this means you no longer have 
to commit an empty commit or do other shenanigans to get another build 
running.
Note that this will /not/ work if the PR was re-opened, until at least 1 
new build was triggered by a push.

Re: [RESULT] [VOTE] Apache Flink 1.9.0, release candidate #3

2019-08-22 Thread Chesnay Schepler


Are we also releasing python artifacts for 1.9?

On 21/08/2019 19:23, Tzu-Li (Gordon) Tai wrote:

I'm happy to announce that we have unanimously approved this candidate as
the 1.9.0 release.

There are 12 approving votes, 5 of which are binding:
- Yu Li
- Zili Chen
- Gordon Tai
- Stephan Ewen
- Jark Wu
- Vino Yang
- Gary Yao
- Bowen Li
- Chesnay Schepler
- Till Rohrmann
- Aljoscha Krettek
- David Anderson

There are no disapproving votes.

Thanks everyone who has contributed to this release!

I will wait until tomorrow morning for the artifacts to be available in
Maven central before announcing the release in a separate thread.

The release blog post will also be merged tomorrow along with the official
announcement.

Cheers,
Gordon

On Wed, Aug 21, 2019, 5:37 PM David Anderson  wrote:


+1 (non-binding)

I upgraded the flink-training-exercises project.

I encountered a few rough edges, including problems in the docs, but
nothing serious.

I had to make some modifications to deal with changes in the Table API:

ExternalCatalogTable.builder became new ExternalCatalogTableBuilder
TableEnvironment.getTableEnvironment became StreamTableEnvironment.create
StreamTableDescriptorValidator.UPDATE_MODE() became
StreamTableDescriptorValidator.UPDATE_MODE
org.apache.flink.table.api.java.Slide moved to
org.apache.flink.table.api.Slide

I also found myself forced to change a CoProcessFunction to a
KeyedCoProcessFunction (which it should have been).

I also tried a few complex queries in the SQL console, and wrote a
simple job using the State Processor API. Everything worked.

David


David Anderson | Training Coordinator

Follow us @VervericaData

--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time


On Wed, Aug 21, 2019 at 1:45 PM Aljoscha Krettek 
wrote:

+1

I checked the last RC on a GCE cluster and was satisfied with the

testing. The cherry-picked commits didn’t change anything related, so I’m
forwarding my vote from there.

Aljoscha


On 21. Aug 2019, at 13:34, Chesnay Schepler 

wrote:

+1 (binding)

On 21/08/2019 08:09, Bowen Li wrote:

+1 non-binding

- built from source with default profile
- manually ran SQL and Table API tests for Flink's metadata

integration

with Hive Metastore in local cluster
- manually ran SQL tests for batch capability with Blink planner and

Hive

integration (source/sink/udf) in local cluster
 - file formats include: csv, orc, parquet


On Tue, Aug 20, 2019 at 10:23 PM Gary Yao  wrote:


+1 (non-binding)

Reran Jepsen tests 10 times.

On Wed, Aug 21, 2019 at 5:35 AM vino yang 

wrote:

+1 (non-binding)

- checkout source code and build successfully
- started a local cluster and ran some example jobs successfully
- verified signatures and hashes
- checked release notes and post

Best,
Vino

Stephan Ewen  于2019年8月21日周三 上午4:20写道：


+1 (binding)

  - Downloaded the binary release tarball
  - started a standalone cluster with four nodes
  - ran some examples through the Web UI
  - checked the logs
  - created a project from the Java quickstarts maven archetype
  - ran a multi-stage DataSet job in batch mode
  - killed as TaskManager and verified correct restart behavior,

including

failover region backtracking


I found a few issues, and a common theme here is confusing error

reporting

and logging.

(1) When testing batch failover and killing a TaskManager, the job

reports

as the failure cause "org.apache.flink.util.FlinkException: The

assigned

slot 6d0e469d55a2630871f43ad0f89c786c_0 was removed."
 I think that is a pretty bad error message, as a user I don't

know

what

that means. Some internal book keeping thing?
 You need to know a lot about Flink to understand that this

means

"TaskManager failure".
 https://issues.apache.org/jira/browse/FLINK-13805
 I would not block the release on this, but think this should

get

pretty

urgent attention.

(2) The Metric Fetcher floods the log with error messages when a
TaskManager is lost.
  There are many exceptions being logged by the Metrics Fetcher

due

to

not reaching the TM any more.
  This pollutes the log and drowns out the original exception

and

the

meaningful logs from the scheduler/execution graph.
  https://issues.apache.org/jira/browse/FLINK-13806
  Again, I would not block the release on this, but think this

should

get pretty urgent attention.

(3) If you put "web.submit.enable: false" into the configuration,

the

web

UI will still display the "SubmitJob" page, but errors will
 continuously pop up, stating "Unable to load requested file

/jars."

 https://issues.apache.org/jira/browse/FLINK-13799

(4) REST endpoint logs ERROR level messages when selecting the
"Checkpoints" tab for batch jobs. That does not seem correct.
  https://issues.apache.org/jira/browse/FLINK-13795

Best,
Stephan




On Tue, Aug 20, 2019 at 11:32 AM Tzu-Li (Gordon) Tai &l

[NOTICE] GitHub service interruption

2019-08-22 Thread Chesnay Schepler


Hello,

GitHub is currently experiencing problems 
; so far the one issue we saw ourselves 
is that Travis builds aren't triggered if a commit is pushed. This 
affects builds both for branches and pull requests; cron jobs may be fine.


@Committers: Please keep this in mind when merging things, as any issues 
on master will likely be detected later then usual.

Re: CiBot Update

2019-08-23 Thread Chesnay Schepler

@Ethan Li The source for the CiBot is available here 
<https://github.com/flink-ci/ci-bot/>. The implementation of this 
command is tightly connected to how the CiBot works; but conceptually it 
looks at a PR, finds the most recent build that ran, and uses the Travis 
REST API to restart the build.
Additionally, it keeps track of which comments have been processed by 
storing the comment ID in the CI report.

If you have further questions, feel free to ping me directly.

@Dianfu I agree, we should include it somewhere in either the flinkbot 
template or the CI report.


On 23/08/2019 03:35, Dian Fu wrote:

Thanks Chesnay for your great work! A very useful feature!

Just one minor suggestion: It will be better if we could add this command to the section 
"Bot commands" in the flinkbot template.

Regards,
Dian


在 2019年8月23日，上午2:06，Ethan Li  写道：

My question is specifically about implementation of "@flinkbot run travis"


On Aug 22, 2019, at 1:06 PM, Ethan Li  wrote:

Hi Chesnay,

This is really nice feature!

Can I ask how is this implemented? Do you have the related Jira/PR/docs that I 
can take a look? I’d like to introduce it to another project if applicable. 
Thank you very much!

Best,
Ethan


On Aug 22, 2019, at 8:34 AM, Biao Liu mailto:mmyy1...@gmail.com>> wrote:

Thanks Chesnay a lot,

I love this feature!

Thanks,
Biao /'bɪ.aʊ/



On Thu, 22 Aug 2019 at 20:55, Hequn Cheng mailto:chenghe...@gmail.com>> wrote:


Cool, thanks Chesnay a lot for the improvement!

Best, Hequn

On Thu, Aug 22, 2019 at 5:02 PM Zhu Zhu mailto:reed...@gmail.com>> wrote:


Thanks Chesnay for the CI improvement!
It is very helpful.

Thanks,
Zhu Zhu

zhijiang mailto:wangzhijiang...@aliyun.com.invalid>> 于2019年8月22日周四 下午4:18写道：


It is really very convenient now. Valuable work, Chesnay!

Best,
Zhijiang
--
From:Till Rohrmann mailto:trohrm...@apache.org>>
Send Time:2019年8月22日(星期四) 10:13
To:dev mailto:dev@flink.apache.org>>
Subject:Re: CiBot Update

Thanks for the continuous work on the CiBot Chesnay!

Cheers,
Till

On Thu, Aug 22, 2019 at 9:47 AM Jark Wu mailto:imj...@gmail.com>> wrote:


Great work! Thanks Chesnay!



On Thu, 22 Aug 2019 at 15:42, Xintong Song mailto:tonysong...@gmail.com>>

wrote:

The re-triggering travis feature is so convenient. Thanks Chesnay~!

Thank you~

Xintong Song



On Thu, Aug 22, 2019 at 9:26 AM Stephan Ewen mailto:se...@apache.org>>

wrote:

Nice, thanks!

On Thu, Aug 22, 2019 at 3:59 AM Zili Chen mailto:wander4...@gmail.com>>

wrote:

Thanks for your announcement. Nice work!

Best,
tison.


vino yang mailto:yanghua1...@gmail.com>> 于2019年8月22日周四 
上午8:14写道：


+1 for "@flinkbot run travis", it is very convenient.

Chesnay Schepler mailto:ches...@apache.org>> 于2019年8月21日周三

下午9:12写道：

Hi everyone,

this is an update on recent changes to the CI bot.


The bot now cancels builds if a new commit was added to a

PR,

and

cancels all builds if the PR was closed.
(This was implemented a while ago; I'm just mentioning it

again

for

discoverability)


Additionally, starting today you can now re-trigger a

Travis

run

by

writing a comment "@flinkbot run travis"; this means you no

longer

have

to commit an empty commit or do other shenanigans to get

another

build

running.
Note that this will /not/ work if the PR was re-opened,

until

at

least

1

new build was triggered by a push.

Re: [DISCUSS] Add ARM CI build to Flink (information-only)

2019-08-23 Thread Chesnay Schepler


I'm wondering what we are supposed to do if the build fails?
We aren't providing and guides on setting up an arm dev environment; so 
reproducing it locally isn't possible.


On 23/08/2019 17:55, Stephan Ewen wrote:

Hi all!

As part of the Flink on ARM effort, there is a pull request that triggers a
build on OpenLabs CI for each push and runs tests on ARM machines.

Currently that build is roughly equivalent to what the "core" and "tests"
profiles do on Travis.
The result will be posted to the PR comments, similar to the Flink Bot's
Travis build result.
The build currently passes :-) so Flink seems to be okay on ARM.

My suggestion would be to try and add this and gather some experience with
it.
The Travis build results should be our "ground truth" and the ARM CI
(openlabs CI) would be "informational only" at the beginning, but helping
us understand when we break ARM support.

You can see this in the PR that adds the openlabs CI config:
https://github.com/apache/flink/pull/9416

Any objections?

Best,
Stephan

[VOTE] Release flink-shaded 8.0, release candidate #1

2019-08-23 Thread Chesnay Schepler


Hi everyone,
Please review and vote on the release candidate #1 for the version 8.0, 
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org 
[2], which are signed with the key with fingerprint 11d464BA [3],

* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-8.0-rc1" [5],
* website pull request listing the new release [6].

The vote will be open for at least 72 hours. It is adopted by majority 
approval, with at least 3 PMC affirmative votes.


Thanks,
Chesnay

[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345488

[2] https://dist.apache.org/repos/dist/dev/flink/flink-shaded-8.0-rc1/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1237
[5] https://github.com/apache/flink-shaded/tree/release-8.0-rc1
[6] https://github.com/apache/flink-web/pull/255

Re: [DISCUSS] Add ARM CI build to Flink (information-only)

2019-08-25 Thread Chesnay Schepler

I'm sorry, but if these issues are only fixed later anyway I see no 
reason to run these tests on each PR. We're just adding noise to each PR 
that everyone will just ignore.


I'm curious as to the benefit of having this directly in Flink; why 
aren't the ARM builds run outside of the Flink project, and fixes for it 
provided?


It seems to me like nothing about these arm builds is actually handled 
by the Flink project.


On 26/08/2019 03:43, Xiyuan Wang wrote:

Thanks for Stephan to bring up this topic.

The package build jobs work well now. I have a simple online demo which is
built and ran on a ARM VM. Feel free to have a try[1].

As the first step for ARM support, maybe it's good to add them now.

While for the next step, the test part is still broken. It relates to some
points we find:

1. Some unit tests are failed[1] by Java coding. These kind of failure can
be fixed easily.
2. Some tests are failed by depending on third part libaraies[2]. It
includes frocksdb, MapR Client and Netty. They don't have ARM release.
 a. Frocksdb: I'm testing it locally now by `make check_some` and `make
jtest` similar with its travis job. There are 3 tests failed by `make
check_some`. Please see the ticket for more details. Once the test pass,
frocksdb can release ARM package then.
 b. MapR Client. This belongs to MapR company. At this moment, maybe we
should skip MapR support for Flink ARM.
 c. Netty. Actually Netty runs well on our ARM machine. We will ask
Netty community to release ARM support. If they do not want, OpenLab will
handle a Maven Repository for some common libraries on ARM.


For Chesnay's concern:

Firstly, OpenLab team will keep maintaining and fixing ARM CI. It means
that once build or test fails, we'll fix it at once.
Secondly,  OpenLab can provide ARM VMs to everyone for reproducing and
testing. You just need to creat a  Test Request issue in openlab[1]. Then
we'll create ARM VMs for you, you can  login and do the thing you want.

Does it make sense?

[1]: http://114.115.168.52:8081/#/overview
[1]: https://issues.apache.org/jira/browse/FLINK-13449
   https://issues.apache.org/jira/browse/FLINK-13450
[2]: https://issues.apache.org/jira/browse/FLINK-13598
[3]: https://github.com/theopenlab/openlab/issues/new/choose




Chesnay Schepler  于2019年8月24日周六 上午12:10写道：


I'm wondering what we are supposed to do if the build fails?
We aren't providing and guides on setting up an arm dev environment; so
reproducing it locally isn't possible.

On 23/08/2019 17:55, Stephan Ewen wrote:

Hi all!

As part of the Flink on ARM effort, there is a pull request that

triggers a

build on OpenLabs CI for each push and runs tests on ARM machines.

Currently that build is roughly equivalent to what the "core" and "tests"
profiles do on Travis.
The result will be posted to the PR comments, similar to the Flink Bot's
Travis build result.
The build currently passes :-) so Flink seems to be okay on ARM.

My suggestion would be to try and add this and gather some experience

with

it.
The Travis build results should be our "ground truth" and the ARM CI
(openlabs CI) would be "informational only" at the beginning, but helping
us understand when we break ARM support.

You can see this in the PR that adds the openlabs CI config:
https://github.com/apache/flink/pull/9416

Any objections?

Best,
Stephan

Re: [DISCUSS] Flink project bylaws

2019-08-28 Thread Chesnay Schepler

ight be able to revisit this

discussion

in

3

-

6

months.


On Thu, Jul 18, 2019 at 4:30 AM jincheng sun <

sunjincheng...@gmail.com

wrote:


Hi Becket,

Thanks for the proposal.

Regarding the vote of FLIP, preferably at least

includes a

PMC

vote.

Because FLIP is usually a big change or affects

the

user's

interface

changes. What do you think? (I leave the

comment

in

the

wiki.)

Best,
Jincheng

Dawid Wysakowicz 

于2019年7月17日周三

下午9:12写道：

Hi all,

Sorry for joining late. I just wanted to say

that

I

really

like

the

proposed bylaws!

One comment, I also have the same concerns as

expressed

by

few

people

before that the "committer +1" on code change

might

be

hard

to

achieve

currently. On the other hand I would say this

would

be

beneficial

for

the quality/uniformity of our codebase and

knowledge

sharing.

I was just wondering what should be the next

step

for

this?

I

think

it

would make sense to already use those bylaws

and

put

them

to

PMC

vote.

Best,

Dawid

On 12/07/2019 13:35, Piotr Nowojski wrote:

Hi Aljoscha and Becket

Right, 3 days for FLIP voting is fine I

think.

I’m missing this stated somewhere clearly.

If

we

are

stating

that a

single

committers +1 is good enough for code

review,

with 0

hours

delay

(de

facto

the current state), we should also write

down

that

this

is

subject

to

the

best judgement of the committer to respect

the

components

expertise

and

ongoing development plans (also the de

facto

current

state).

Adding the statement would help clarify the

intention,

but

it

may

be a

little difficult to enforce and follow..

I would be fine with that, it’s a soft/vague

rule

anyway,

intended

to

be

used with your “best judgemenet". I would like

to

just

avoid a

situation

when someone violates current uncodified state

and

refers

to

the

bylaws

which are saying merging with any committer +1

is

always

fine

(like

mine

+1

for flink-python or flink-ml).

Piotrek


On 12 Jul 2019, at 11:29, Aljoscha Krettek

<

aljos...@apache.org

wrote:

@Piotr regarding the 3 days voting on the

FLIP.

This

is

just

about

the

voting, before that there needs to be the

discussion

thread. If

three

days

have passed on a vote and there is consensus

(i.e. 3

committers/PMCs

have

voted +1) that seems a high enough bar for me.

So

far,

we

have

rarely

see

any FLIPs pass that formal bar.

According to the recent META-FLIP thread,

we

want

to

use

"lazy

majority" for the FLIP voting process. I think

that

should

be

changed

from

“consensus” in the proposed bylaws.

Regarding the current state: do we have a

current

state

that

we

all

agree on? I have the feeling that if we try to

come

up

with

something

that

reflects the common state, according to

PMCs/commiters,

that

might

take a

very long time. In that case I think it’s

better

to

adopt

something

that

we

all like, rather than trying to capture how we

do

it

now.

Aljoscha


On 12. Jul 2019, at 11:07, Piotr Nowojski

<

pi...@ververica.com

wrote:

Hi,

Thanks for the proposal. Generally

speaking

+1

from

my

side

to

the

general idea and most of the content. I also

see

merit

to

the

Chesney's

proposal to start from the current state. I

think

either

would

be

fine

for

me.

Couple of comments:

1.

I also think that requiring +1 from

another

committer

would

slow

down

and put even more strain on the current

reviewing

bottleneck

that

we

are

having. Even if the change clear and simple,

context

switch

cost

is

quite

high, and that’s just one less PR that the

second

“cross”

committer

could

have reviewed somewhere else in that time.

Besides,

current

setup

that

we

have (with no cross +1 from another committer

required)

works

quite

well

and I do not feel that’s causing troubles. On

the

other

hand

reviewing

bottleneck is.

2.


I think a committer should know when to

ask

another

committer

for

feedback or not.

I’m missing this stated somewhere clearly.

If

we

are

stating

that a

single committers +1 is good enough for code

review,

with

0

hours

delay

(de

facto the current state), we should also write

down

that

this

is

subject

to

the best judgement of the committer to respect

the

components

expertise

and

ongoing development plans (also the de facto

current

state).

3.

Minimum length of 3 days for FLIP I think

currently

might

be

problematic/too quick and can lead to problems

if

respected

to

the

letter.

Again I think it depends highly on whether the

committers

with

highest

expertise in the affected components managed

to

respond

or

not.

Piotrek


On 12 Jul 2019, at 09:42, Chesnay

Schepler

<

ches...@apache.org>

wrote:

I'm wondering whether we shouldn't first

write

down

Bylaws

that

reflect the current state, and then have

separate

discussions

for

individual amendments. My gut f

Re: [VOTE] Release flink-shaded 8.0, release candidate #1

2019-08-28 Thread Chesnay Schepler


+1 (binding)

- asm/netty jars do not contain anything suspicions
- ran Flink tests (including e2e tests) with new netty/asm versions
- license files were updated accordingly

On 28/08/2019 14:28, Till Rohrmann wrote:

+1 (binding)

Cheers,
Till

On Wed, Aug 28, 2019 at 11:53 AM Aljoscha Krettek 
wrote:


+1 (binding)

  - I verified the signature and checksum
  - I eyeballed the list of resolved issues
  - I checked the maven central artifices

Aljoscha


On 23. Aug 2019, at 21:05, Chesnay Schepler  wrote:

Hi everyone,
Please review and vote on the release candidate #1 for the version 8.0,

as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org

[2], which are signed with the key with fingerprint 11d464BA [3],

* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-8.0-rc1" [5],
* website pull request listing the new release [6].

The vote will be open for at least 72 hours. It is adopted by majority

approval, with at least 3 PMC affirmative votes.

Thanks,
Chesnay

[1]

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345488

[2] https://dist.apache.org/repos/dist/dev/flink/flink-shaded-8.0-rc1/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4]

https://repository.apache.org/content/repositories/orgapacheflink-1237

[5] https://github.com/apache/flink-shaded/tree/release-8.0-rc1
[6] https://github.com/apache/flink-web/pull/255

[RESULT] [VOTE] flink-shaded 8.0, release candidate #1

2019-08-28 Thread Chesnay Schepler


|I'm happy to announce that we have unanimously approved this release.|
|There are 3 approving votes, 3 of which are binding:|
|* Aljoscha|
|* Till
|
|* Chesnay|
|There are no disapproving votes.|
|Thanks everyone!|

On 23/08/2019 21:05, Chesnay Schepler wrote:

Hi everyone,
Please review and vote on the release candidate #1 for the version 
8.0, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org 
[2], which are signed with the key with fingerprint 11d464BA [3],

* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-8.0-rc1" [5],
* website pull request listing the new release [6].

The vote will be open for at least 72 hours. It is adopted by majority 
approval, with at least 3 PMC affirmative votes.


Thanks,
Chesnay

[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345488

[2] https://dist.apache.org/repos/dist/dev/flink/flink-shaded-8.0-rc1/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] 
https://repository.apache.org/content/repositories/orgapacheflink-1237

[5] https://github.com/apache/flink-shaded/tree/release-8.0-rc1
[6] https://github.com/apache/flink-web/pull/255

Re: [PROPOSAL] Force rebase on master before merge

2019-08-30 Thread Chesnay Schepler

I think this is a non-issue; every committer I know checks beforehand if
the build passes.

Piotr has provided good arguments for why this approach isn't practical.
Additionally, there are simply technical limitations that prevent this
from working as expected.

a) we cannot attach Travis checks via CiBot due to lack of permissions
b) It is not possible AFAIK to force a PR to be up-to-date with current
master when Travis runs. In other words, I can open a PR, travis passes,
and so long as no new merge conflicts arise I could _still_ merge it 2
months later.

On 30/08/2019 10:34, Piotr Nowojski wrote:

Hi,

Thanks for the proposal. I have similar concerns as Kurt.

If we enforced such rule I would be afraid that everybody would be waiting for
tests on his PR to complete, racing others committers to be “the first guy that
clicks the merge button”, then forcing all of the others to rebase manually and
race again. For example it wouldn’t be possible to push a final version of the
PR, wait for the tests to complete overnight and merge it next day. Unless we
would allow for merging without green travis after a final rebase, but that for
me would be almost exactly what we have now.

Is this a big issue in the first place? I don’t feel it that way, but maybe I’m
working in not very contested parts of the code?

If it’s an issue, I would suggest to go for the merging bot, that would have a
queue of PRs to be:
1. Automatically rebased on the latest master
2. If no conflicts in 1., run the tests
3. If no test failures merge

Piotrek

On 30 Aug 2019, at 09:38, Till Rohrmann wrote:

Hi Tison,

thanks for starting this discussion. In general, I'm in favour of
automations which remove human mistakes out of the equation.

Do you know how these status checks work concretely? Will Github reject
commits for which there is no passed Travis run? How would hotfix commits
being distinguished from PR commits for which a Travis run should exist? So
I guess my question is how would enabling the status checks change how
committers interact with the Github repository?

Cheers,
Till

On Fri, Aug 30, 2019 at 4:46 AM Zili Chen wrote:

Hi Kurt,

Thanks for your reply!

I find two concerns about the downside from your email. Correct
me if I misunderstanding.

1. Rebase times. Typically commits are independent one another, rebase
just fast-forward changes so that contributors rarely resolve conflicts
by himself. Reviews doesn't get blocked by this force rebase if there is
a green travis report ever -- just require contributor rebase and test
again, which generally doesn't involve changes(unless resolve conflicts).
Contributor rebases his pull request when he has spare time or is required
by reviewer/before getting merged. This should not inflict too much works.

2. Testing time. It is a separated topic that discussed in this thread[1].
I don't think we finally live with a long testing time, so it won't be a
problem then we trigger multiple tests.

Simply sum up, for trivial cases, works are trivial and it
prevents accidentally
failures; for complicated cases, it already requires rebase and fully
tests.

Best,
tison.

[1]

https://lists.apache.org/x/thread.html/b90aa518fcabce94f8e1de4132f46120fae613db6e95a2705f1bd1ea@%3Cdev.flink.apache.org%3E

Kurt Young 于2019年8月30日周五 上午9:15写道：

Hi Zili,

Thanks for the proposal, I had similar confusion in the past with your
point #2.
Force rebase to master before merging can solve some problems, but it

also

introduces new problem. Given the CI testing time is quite long (couple

hours)
now, it's highly possible that before your test which triggered by

rebasing

finishes,
the master will get some more new commits. This situation will get worse

more
people are doing this. One possible solution is let the committer decide
what should
do before he/she merges it. If it's a trivial issue, just merge it if
travis passes is
fine. But if it's a rather big one, and some related codes just got

merged

in to master,
I will choose to rebase to master and push it to my own repo to trigger

personal
CI test on it because this can guarantee the testing time.

To summarize: I think this should be decided by the committer who is
merging the PR,
but not be forced.

Best,
Kurt

On Thu, Aug 29, 2019 at 11:07 PM Zili Chen wrote:

Hi devs,

GitHub provides a mechanism which is able to require branches to be
up to date before merged[1](point 6). I can see several advantages
enabling it. Thus propose our project to turn on this switch. Below are
my concerns. Looking forward to your insights.

1. Avoid CI failures in pr which fixed by another commit. We now merge

pull request even if CI fails but the failures knowns as flaky tests.
We doesn't resolve this by turn on the switch but it helps to find any
other potential valid failures.

2. Avoid CI failures in master after pull request merged. Actually, CI
tests the branch that pull request bind exactly. Even if it gave green
it is still

[ANNOUNCE] Apache Flink-shaded 8.0 released

2019-08-30 Thread Chesnay Schepler

The Apache Flink community is very happy to announce the release of 
Apache Flink-shaded 8.0.


The flink-shaded project contains a number of shaded dependencies for 
Apache Flink.


Apache Flink® is an open-source stream processing framework for 
distributed, high-performing, always-available, and accurate data 
streaming applications.


The release is available for download at:
https://flink.apache.org/downloads.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345488

We would like to thank all contributors of the Apache Flink community 
who made this release possible!


Regards,
Chesnay

State of FLIPs

2019-08-30 Thread Chesnay Schepler

The following FLIPs are marked as "Under discussion" in the wiki 
, 
but actually seem to be in progress (i.e. have open pull requests) and 
some even  have code merged to master:


 * FLIP-36 (Interactive Programming)
 * FLIP-38 (Python Table API)
 * FLIP-44 (Support Local Aggregation)
 * FLIP-50 (Spill-able Heap Keyed State Backend)

I would like to find out what the _actual_ state is, and then discuss 
how we handle these FLIPs from now on (e.g., retcon history and mark 
them as accepted, freeze further development until a vote, ...).


I've cc'd all people who create the wiki pages for said FLIPs.

Re: [DISCUSS] Reducing build times

2019-09-04 Thread Chesnay Schepler

rate this, as well as sponsors for such an
infrastructure.

[1] https://docs.travis-ci.com/user/reference/overview/


On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler 

wrote:

@Aljoscha Shading takes a few minutes for a full build; you can see

this

quite easily by looking at the compile step in the misc profile
<https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
longer than a fraction of a section are usually caused by shading lots
of classes. Note that I cannot tell you how much of this is spent on
relocations, and how much on writing the jar.

Personally, I'd very much like us to move all shading to flink-shaded;
this would finally allows us to use newer maven versions without

needing

cumbersome workarounds for flink-dist. However, this isn't a trivial
affair in some cases; IIRC calcite could be difficult to handle.

On another note, this would also simplify switching the main repo to
another build system, since you would no longer had to deal with
relocations, just packaging + merging NOTICE files.

@BowenLi I disagree, flink-shaded does not include any tests,  API
compatibility checks, checkstyle, layered shading (e.g., flink-runtime
and flink-dist, where both relocate dependencies and one is bundled by
the other), and, most importantly, CI (and really, without CI being
covered in a PoC there's nothing to discuss).

On 16/08/2019 15:13, Aljoscha Krettek wrote:

Speaking of flink-shaded, do we have any idea what the impact of

shading

is on the build time? We could get rid of shading completely in the

Flink

main repository by moving everything that we shade to flink-shaded.

Aljoscha


On 16. Aug 2019, at 14:58, Bowen Li  wrote:

+1 to Till's points on #2 and #5, especially the potential

non-disruptive,

gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's

of

much

smaller size, totally isolated from and not interfered with flink

project

[2], and it actually covers most of our practical feature

requirements

for

a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann 

wrote:

For the sake of keeping the discussion focused and not cluttering

the

discussion thread I would suggest to split the detailed reporting

for

reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <

ches...@apache.org>

wrote:


Update:

TL;DR: table-planner is a good candidate for enabling fork reuse

right

away, while flink-tests has the potential for huge savings, but we

have

to figure out some issues first.


Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)

at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)

The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected

accumulator

results within time limit.
at


org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one

above

failed after 19 minutes and is only missing the migration tests

(which

currently need 6-7 minutes). So we could save somewhere between 15

to

20

minutes here.


Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at

org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

There appears to be a general agreement that 1) should be looked

into;

I've setup a branch with fork reuse being enabled for all tests;

will

report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a hot topic at the moment so let's
discuss the different ways how they could be reduced.


   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total
time"), a

Re: [DISCUSS] Contribute Pulsar Flink connector back to Flink

2019-09-04 Thread Chesnay Schepler


I'm quite worried that we may end up repeating history.

There were already 2 attempts at contributing a pulsar connector, both 
of which failed because no committer was getting involved, despite the 
contributor opening a dedicated discussion thread about the contribution 
beforehand and getting several +1's from committers.


We should really make sure that if we welcome/approve such a 
contribution it will actually get the attention it deserves.


As such, I'm inclined to recommend maintaining the connector outside of 
Flink. We could link to it from the documentation to give it more exposure.
With the upcoming page for sharing artifacts among the community (what's 
the state of that anyway?), this may be a better option.


On 04/09/2019 10:16, Till Rohrmann wrote:

Hi everyone,

thanks a lot for starting this discussion Yijie. I think the Pulsar
connector would be a very valuable addition since Pulsar becomes more and
more popular and it would further expand Flink's interoperability. Also
from a project perspective it makes sense for me to place the connector in
the downstream project.

My main concern/question is how can the Flink community maintain the
connector? We have seen in the past that connectors are some of the most
actively developed components because they need to be kept in sync with the
external system and with Flink. Given that the Pulsar community is willing
to help with maintaining, improving and evolving the connector, I'm
optimistic that we can achieve this. Hence, +1 for contributing it back to
Flink.

Cheers,
Till



On Wed, Sep 4, 2019 at 2:03 AM Sijie Guo  wrote:


Hi Yun,

Since I was the main driver behind FLINK-9641 and FLINK-9168, let me try to
add more context on this.

FLINK-9641 and FLINK-9168 was created for bringing Pulsar as source and
sink for Flink. The integration was done with Flink 1.6.0. We sent out pull
requests about a year ago and we ended up maintaining those connectors in
Pulsar for Pulsar users to use Flink to process event streams in Pulsar.
(See https://github.com/apache/pulsar/tree/master/pulsar-flink). The Flink
1.6 integration is pretty simple and there is no schema considerations.

In the past year, we have made a lot of changes in Pulsar and brought
Pulsar schema as the first-class citizen in Pulsar. We also integrated with
other computing engines for processing Pulsar event streams with Pulsar
schema.

It led us to rethink how to integrate with Flink in the best way. Then we
reimplement the pulsar-flink connectors from the ground up with schema and
bring table API and catalog API as the first-class citizen in the
integration. With that being said, in the new pulsar-flink implementation,
you can register pulsar as a flink catalog and query / process the event
streams using Flink SQL.

This is an example about how to use Pulsar as a Flink catalog:

https://github.com/streamnative/pulsar-flink/blob/3eeddec5625fc7dddc3f8a3ec69f72e1614ca9c9/README.md#use-pulsar-catalog

Yijie has also written a blog post explaining why we re-implement the flink
connector with Flink 1.9 and what are the changes we made in the new
connector:

https://medium.com/streamnative/use-apache-pulsar-as-streaming-table-with-8-lines-of-code-39033a93947f

We believe Pulsar is not just a simple data sink or source for Flink. It
actually can be a fully integrated streaming data storage for Flink in many
areas (sink, source, schema/catalog and state). The combination of Flink
and Pulsar can create a great streaming warehouse architecture for
streaming-first, unified data processing. Since we are talking to
contribute Pulsar integration to Flink here, we are also dedicated to
maintain, improve and evolve the integration with Flink to help the users
who use both Flink and Pulsar.

Hope this give you a bit more background about the pulsar flink
integration. Let me know what are your thoughts.

Thanks,
Sijie


On Tue, Sep 3, 2019 at 11:54 AM Yun Tang  wrote:


Hi Yijie

I can see that Pulsar becomes more and more popular recently and very

glad

to see more people willing to contribute to Flink ecosystem.

Before any further discussion, would you please give some explanation of
the relationship between this thread to current existing JIRAs of pulsar
source [1] and sink [2] connector? Will the contribution contains part of
those PRs or totally different implementation?

[1] https://issues.apache.org/jira/browse/FLINK-9641
[2] https://issues.apache.org/jira/browse/FLINK-9168

Best
Yun Tang

From: Yijie Shen 
Sent: Tuesday, September 3, 2019 13:57
To: dev@flink.apache.org 
Subject: [DISCUSS] Contribute Pulsar Flink connector back to Flink

Dear Flink Community!

I would like to open the discussion of contributing Pulsar Flink
connector [0] back to Flink.

## A brief introduction to Apache Pulsar

Apache Pulsar[1] is a multi-tenant, high-performance distributed
pub-sub messaging system. Pulsar includes multiple features such as
native support for multiple cluster

Re: [VOTE] FLIP-61 Simplify Flink's cluster level RestartStrategy configuration

2019-09-04 Thread Chesnay Schepler


+1 (binding)

On 04/09/2019 11:13, Zhu Zhu wrote:

+1 (non-binding)

Thanks,
Zhu Zhu

Till Rohrmann  于2019年9月4日周三 下午5:05写道：


Hi everyone,

I would like to start the voting process for FLIP-61 [1], which is
discussed and reached consensus in this thread [2].

Since the change is rather small I'd like to shorten the voting period to
48 hours. Hence, I'll try to close it September 6th, 11:00 am CET, unless
there is an objection or not enough votes.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-61+Simplify+Flink%27s+cluster+level+RestartStrategy+configuration
[2]

https://lists.apache.org/thread.html/e206390127bcbd9b24d9c41a838faa75157e468e01552ad241e3e24b@%3Cdev.flink.apache.org%3E

Cheers,
Till

Re: [VOTE] FLIP-62: Set default restart delay for FixedDelay- and FailureRateRestartStrategy to 1s

2019-09-04 Thread Chesnay Schepler


+1 (binding)

On 04/09/2019 11:18, JingsongLee wrote:

+1 (non-binding)

default 0 is really not user production friendly.

Best,
Jingsong Lee


--
From:Zhu Zhu 
Send Time:2019年9月4日(星期三) 17:13
To:dev 
Subject:Re: [VOTE] FLIP-62: Set default restart delay for FixedDelay- and 
FailureRateRestartStrategy to 1s

+1 (non-binding)

Thanks,
Zhu Zhu

Till Rohrmann  于2019年9月4日周三 下午5:06写道：


Hi everyone,

I would like to start the voting process for FLIP-62 [1], which
is discussed and reached consensus in this thread [2].

Since the change is rather small I'd like to shorten the voting period to
48 hours. Hence, I'll try to close it September 6th, 11:00 am CET, unless
there is an objection or not enough votes.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-62%3A+Set+default+restart+delay+for+FixedDelay-+and+FailureRateRestartStrategy+to+1s
[2]

https://lists.apache.org/thread.html/9602b342602a0181fcb618581f3b12e692ed2fad98c59fd6c1caeabd@%3Cdev.flink.apache.org%3E

Cheers,
Till

Re: [DISCUSS] FLIP-62: Set default restart delay for FixedDelay- and FailureRateRestartStrategy to 1s

2019-09-04 Thread Chesnay Schepler

The issue we seem to run into again and again is that we want to try to
find a value that provides a good experience when trying out Flink, but
also somewhat usable for production users.
We should look into solutions for this; maybe having a "recommended"
value in the docs would help sufficiently, or even configuration
profiles for Flink "dev"/"production" which influence the default values.

On 03/09/2019 11:41, Till Rohrmann wrote:

Hi everyone,

I'd like to discuss changing the default restart delay for FixedDelay- and
FailureRateRestartStrategy to "1 s" [1].

According to a user survey about the default value of the restart delay
[2], it turned out that the current default value of "0 s" is not optimal.
In practice Flink users tend to set it to a non-zero value (e.g. "10 s") in
order to prevent restart storms originating from overloaded external
systems.

I would like to set the default restart delay of the
FixedDelayRestartStrategy ("restart-strategy.fixed-delay.delay") and of the
FailureRateRestartStrategy ("restart-strategy.failure-rate.delay") to "1
s". "1 s" should prevent restart storms originating from causes outside of
Flink (e.g. overloaded external systems) and still be fast enough to not
having a noticeable effect on most Flink deployments.

However, this change will affect all users who currently rely on the
current default restart delay value ("0 s"). The plan is to add a release
note to make these users aware of this change when upgrading Flink.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-62%3A+Set+default+restart+delay+for+FixedDelay-+and+FailureRateRestartStrategy+to+1s
[2]
https://lists.apache.org/thread.html/107b15de6b8ac849610d99c4754715d2a8a2f32ddfe9f8da02af2ccc@%3Cdev.flink.apache.org%3E

Cheers,
Till

Re: [DISCUSS] Reducing build times

2019-09-04 Thread Chesnay Schepler

e2e tests on Travis add another 4-5 hours, but we never optimized these 
to make use of the cached Flink artifact.


On 04/09/2019 13:26, Till Rohrmann wrote:

How long do we need to run all e2e tests? They are not included in the 3,5
hours I assume.

Cheers,
Till

On Wed, Sep 4, 2019 at 11:59 AM Robert Metzger  wrote:


Yes, we can ensure the same (or better) experience for contributors.

On the powerful machines, builds finish in 1.5 hours (without any caching
enabled).

Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
build for open source projects. Flink needs 3.5 hours on that infra (not
parallelized at all, no caching). These free machines are very similar to
those of Travis, so I expect no build time regressions, if we set it up
similarly.


On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler 
wrote:


Will using more powerful for the project make it more difficult to
ensure that contributor builds are still running in a reasonable time?

As an example of this happening on Travis, contributors currently cannot
run all e2e tests since they timeout, but on apache we have a larger
timeout.

On 03/09/2019 18:57, Robert Metzger wrote:

Hi all,

I wanted to give a short update on this:
- Arvid, Aljoscha and I have started working on a Gradle PoC, currently
working on making all modules compile and test with Gradle. We've also
identified some problematic areas (shading being the most obvious one)
which we will analyse as part of the PoC.
The goal is to see how much Gradle helps to parallelise our build, and

to

avoid duplicate work (incremental builds).

- I am working on setting up a Flink testing infrastructure based on

Azure

Pipelines, using more powerful hardware. Alibaba kindly provided me

with

two 32 core machines (temporarily), and another company reached out to
privately, looking into options for cheap, fast machines :)
If nobody in the community disagrees, I am going to set up Azure

Pipelines

with our apache/flink GitHub as a build infrastructure that exists next

to

Flinkbot and flink-ci. I would like to make sure that Azure Pipelines

is

equally or even more reliable than Travis, and I want to see what the
required maintenance work is.
On top of that, Azure Pipelines is a very feature-rich tool with a lot

of

nice options for us to improve the build experience (statistics about

tests

(flaky tests etc.), nice docker support, plenty of free build resources

for

open source projects, ...)

Best,
Robert





On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger 

wrote:

Hi all,

I have summarized all arguments mentioned so far + some additional
research into a Wiki page here:


https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we

can

find more pro's and con's for the different options.

My opinion after looking at the options:

 - Flink relies on an outdated build tool (Maven), while a good
 alternative is well-established (gradle), and will likely provide

a

much

 better CI and local build experience through incremental build and

cached

 intermediates.
 Scripting around Maven, or splitting modules / test execution /
 repositories won't solve this problem. We should rather spend the

effort in

 migrating to a modern build tool which will provide us benefits in

the long

 run.
 - Flink relies on a fairly slow build service (Travis CI), while
 simply putting more money onto the problem could cut the build

time

at

 least in half.
 We should consider using a build service that provides bigger

machines

 to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we

find

sponsors for bigger build machines) that we need to test first through

PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <

aljos...@apache.org>

wrote:


I did a quick test: a normal "mvn clean install -DskipTests
-Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my

machine

takes about 14 minutes. After removing all mentions of

maven-shade-plugin

the build time goes down to roughly 11.5 minutes. (Obviously the

resulting

Flink won’t work, because some expected stuff is not packaged and

most

of

the end-to-end tests use the shade plugin to package the jars for

testing.

Aljoscha


On 18. Aug 2019, at 19:52, Robert Metzger 

wrote:

Hi all,

I wanted to understand the impact of the hardware we are using for

running

our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory

[1].

They are using Google Cloud Compute Engine *n1-standard-2*

instances.

Running a full "mvn clean verify" takes *03:32 h* on such a machine

type.

Running the same workload on a 32 virtual cores, 64 gb machine,

takes

*1:21

h*.

What is interesting

Re: Fine grained batch recovery vs. native libraries

2019-09-04 Thread Chesnay Schepler


This sounds like a serious bug, please open a JIRA ticket.

On 04/09/2019 13:41, David Morávek wrote:

Hi,

we're testing the newly released batch recovery and are running into class
loading related issues.

1) We have a per-job flink cluster
2) We use BATCH execution mode + region failover strategy

Point 1) should imply single user code class loader per task manager
(because there is only single pipeline, that reuses class loader cached in
BlobLibraryCacheManager). We need this property, because we have UDFs that
access C libraries using JNI (I think this may be fairly common use-case
when dealing with legacy code). JDK internals

make sure that single library can be only loaded by a single class loader
per JVM.

When region recovery is triggered, vertices that need recover are first
reset back to CREATED stated and then rescheduled. In case all tasks in a
task manager are reset, this results in cached class loader being released
.
This unfortunately causes job failure, because we try to reload a native
library in a newly created class loader.

I know that there is always possibility to distribute native libraries with
flink's libs and load it using system class loader, but this introduces a
build & operations overhead and just make it really unfriendly for cluster
user, so I'd rather not work around the issue this way (per-job cluster
should be more user friendly).

I believe the correct approach would be not to release cached class loader
if the job is recovering, even though there are no tasks currently
registered with TM.

What do you think? Thanks for help.

D.

Re: [DISCUSS] Reducing build times

2019-09-05 Thread Chesnay Schepler

e 30 minutes on the big

machine

(while 31 CPUs are idling :) )

Let me know what you think about these results. If the community is
generally interested in further investigating into that direction, I

could

look into software to orchestrate this, as well as sponsors for such an
infrastructure.

[1] https://docs.travis-ci.com/user/reference/overview/


On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler 

wrote:

@Aljoscha Shading takes a few minutes for a full build; you can see

this

quite easily by looking at the compile step in the misc profile
<https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules that
longer than a fraction of a section are usually caused by shading lots
of classes. Note that I cannot tell you how much of this is spent on
relocations, and how much on writing the jar.

Personally, I'd very much like us to move all shading to flink-shaded;
this would finally allows us to use newer maven versions without

needing

cumbersome workarounds for flink-dist. However, this isn't a trivial
affair in some cases; IIRC calcite could be difficult to handle.

On another note, this would also simplify switching the main repo to
another build system, since you would no longer had to deal with
relocations, just packaging + merging NOTICE files.

@BowenLi I disagree, flink-shaded does not include any tests,  API
compatibility checks, checkstyle, layered shading (e.g., flink-runtime
and flink-dist, where both relocate dependencies and one is bundled by
the other), and, most importantly, CI (and really, without CI being
covered in a PoC there's nothing to discuss).

On 16/08/2019 15:13, Aljoscha Krettek wrote:

Speaking of flink-shaded, do we have any idea what the impact of

shading

is on the build time? We could get rid of shading completely in the

Flink

main repository by moving everything that we shade to flink-shaded.

Aljoscha


On 16. Aug 2019, at 14:58, Bowen Li  wrote:

+1 to Till's points on #2 and #5, especially the potential

non-disruptive,

gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's

of

much

smaller size, totally isolated from and not interfered with flink

project

[2], and it actually covers most of our practical feature

requirements

for

a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann 

wrote:

For the sake of keeping the discussion focused and not cluttering

the

discussion thread I would suggest to split the detailed reporting

for

reusing JVMs to a separate thread and cross linking it from here.

Cheers,
Till

On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <

ches...@apache.org>

wrote:


Update:

TL;DR: table-planner is a good candidate for enabling fork reuse

right

away, while flink-tests has the potential for huge savings, but we

have

to figure out some issues first.


Build link: https://travis-ci.org/zentol/flink/builds/572659220

4/8 profiles failed.

No speedup in libraries, python, blink_planner, 7 minutes saved in
libraries (table-planner).

The kafka and connectors profiles both fail in kafka tests due to
producer leaks, and no speed up could be confirmed so far:

java.lang.AssertionError: Detected producer leak. Thread name:
kafka-producer-network-thread | producer-239
at org.junit.Assert.fail(Assert.java:88)
at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)

at


org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)

The tests profile failed due to various errors in migration tests:

junit.framework.AssertionFailedError: Did not see the expected

accumulator

results within time limit.
at


org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)

*However*, a normal tests run takes 40 minutes, while this one

above

failed after 19 minutes and is only missing the migration tests

(which

currently need 6-7 minutes). So we could save somewhere between 15

to

20

minutes here.


Finally, the misc profiles fails in YARN:

java.lang.AssertionError
at

org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)

No significant speedup could be observed in other modules; for
flink-yarn-tests we can maybe get a minute or 2 out of it.

On 16/08/2019 10:43, Chesnay Schepler wrote:

There appears to be a general agreement that 1) should be looked

into;

I've setup a branch with fork reuse being enabled for all tests;

will

report back the results.

On 15/08/2019 09:38, Chesnay Schepler wrote:

Hello everyone,

improving our build times is a h

[ANNOUNCE] Java 11 cron builds activated on master

2019-09-05 Thread Chesnay Schepler


Hello everyone,

I just wanted to inform everyone that we now run Java 11 builds on 
Travis as part of the cron jobs, subsuming the existing Java 9 tests. 
All existing Java 9 build/test infrastructure has been removed.


If you spot any test failures that appear to be specific to Java 11, 
please add a sub-task to FLINK-10725.


I would also encourage everyone to try out Java 11 for local development 
and usage, so that we can find pain points in the dev and user experience.

Re: Is Flink documentation deployment script broken ?

2019-09-06 Thread Chesnay Schepler


The scripts are fine, but the buildbot slave is currently down.

I've already opened a ticket with INFRA: 
https://issues.apache.org/jira/browse/INFRA-18986


On 06/09/2019 11:44, Jark Wu wrote:

Hi all,

I merged several documentation pull requests[1][2][3] days ago.
AFAIK, the documentation deployment is scheduled every day.
However, I didn't see the changes are available in the Flink doc website[4]
until now.
The same to Till's PR[5] merged 3 days ago.


Best,
Jark

[1]: https://github.com/apache/flink/pull/9545
[2]: https://github.com/apache/flink/pull/9511
[3]: https://github.com/apache/flink/pull/9525
[4]: https://ci.apache.org/projects/flink/flink-docs-master/
[5]: https://github.com/apache/flink/pull/9571

[DISCUSS] FLIP-67: Global partitions lifecycle

2019-09-06 Thread Chesnay Schepler


Hello,

FLIP-36 (interactive programming) 
 
proposes a new programming paradigm where jobs are built incrementally 
by the user.


To support this in an efficient manner I propose to extend partition 
life-cycle to support the notion of /global partitions/, which are 
partitions that can exist beyond the life-time of a job.


These partitions could then be re-used by subsequent jobs in a fairly 
efficient manner, as they don't have to persisted to an external storage 
first and consuming tasks could be scheduled to exploit data-locality.


The FLIP outlines the required changes on the JobMaster, TaskExecutor 
and ResourceManager to support this from a life-cycle perspective.


This FLIP does /not/ concern itself with the /usage/ of global 
partitions, including client-side APIs, job-submission, scheduling and 
reading said partitions; these are all follow-ups that will either be 
part of FLIP-36 or spliced out into separate FLIPs.

Re: [Discussion] - Release major Flink version to support JDK 17 (LTS)

2023-04-24 Thread Chesnay Schepler


As it turns out Kryo isn't a blocker; we ran into a JDK bug.

On 31/03/2023 08:57, Chesnay Schepler wrote:

https://github.com/EsotericSoftware/kryo/wiki/Migration-to-v5#migration-guide

Kroy themselves state that v5 likely can't read v2 data.

However, both versions can be on the classpath without classpath as v5 
offers a versioned artifact that includes the version in the package.


It probably wouldn't be difficult to migrate a savepoint to Kryo v5, 
purely from a read/write perspective.


The bigger question is how we expose this new Kryo version in the API. 
If we stick to the versioned jar we need to either duplicate all 
current Kryo-related APIs or find a better way to integrate other 
serialization stacks.


On 30/03/2023 17:50, Piotr Nowojski wrote:

Hey,

> 1. The Flink community agrees that we upgrade Kryo to a later 
version, which means breaking all checkpoint/savepoint compatibility 
and releasing a Flink 2.0 with Java 17 support added and Java 8 and 
Flink Scala API support dropped. This is probably the quickest way, 
but would still mean that we expose Kryo in the Flink APIs, which is 
the main reason why we haven't been able to upgrade Kryo at all.


This sounds pretty bad to me.

Has anyone looked into what it would take to provide a smooth 
migration from Kryo2 -> Kryo5?


Best,
Piotrek

czw., 30 mar 2023 o 16:54 Alexis Sarda-Espinosa 
 napisał(a):


Hi Martijn,

just to be sure, if all state-related classes use a POJO
serializer, Kryo will never come into play, right? Given
FLINK-16686 [1], I wonder how many users actually have jobs with
Kryo and RocksDB, but even if there aren't many, that still
leaves those who don't use RocksDB for checkpoints/savepoints.

If Kryo were to stay in the Flink APIs in v1.X, is it impossible
to let users choose between v2/v5 jars by separating them like
log4j2 jars?

[1] https://issues.apache.org/jira/browse/FLINK-16686

Regards,
Alexis.

Am Do., 30. März 2023 um 14:26 Uhr schrieb Martijn Visser
:

Hi all,

I also saw a thread on this topic from Clayton Wohl [1] on
this topic, which I'm including in this discussion thread to
avoid that it gets lost.

From my perspective, there's two main ways to get to Java 17:

1. The Flink community agrees that we upgrade Kryo to a later
version, which means breaking all checkpoint/savepoint
compatibility and releasing a Flink 2.0 with Java 17 support
added and Java 8 and Flink Scala API support dropped. This is
probably the quickest way, but would still mean that we
expose Kryo in the Flink APIs, which is the main reason why
we haven't been able to upgrade Kryo at all.
2. There's a contributor who makes a contribution that bumps
Kryo, but either a) automagically reads in all old
checkpoints/savepoints in using Kryo v2 and writes them to
new snapshots using Kryo v5 (like is mentioned in the Kryo
migration guide [2][3] or b) provides an offline tool that
allows users that are interested in migrating their snapshots
manually before starting from a newer version. That
potentially could prevent the need to introduce a new Flink
major version. In both scenarios, ideally the contributor
would also help with avoiding the exposure of Kryo so that we
will be in a better shape in the future.

It would be good to get the opinion of the community for
either of these two options, or potentially for another one
that I haven't mentioned. If it appears that there's an
overall agreement on the direction, I would propose that a
FLIP gets created which describes the entire process.

Looking forward to the thoughts of others, including the
Users (therefore including the User ML).

Best regards,

Martijn

[1]
https://lists.apache.org/thread/qcw8wy9dv8szxx9bh49nz7jnth22p1v2
[2]
https://lists.apache.org/thread/gv49jfkhmbshxdvzzozh017ntkst3sgq
[3] https://github.com/EsotericSoftware/kryo/wiki/Migration-to-v5

On Sun, Mar 19, 2023 at 8:16 AM Tamir Sagi
 wrote:

I agree, there are several options to mitigate the
migration from v2 to v5.
yet, Oracle roadmap is to end JDK 11 support in September
this year.




From: ConradJam 
Sent: Thursday, March 16, 2023 4:36 AM
To: dev@flink.apache.org 
Subject: Re: [Discussion] - Release major Flink version
to support JDK 17 (LTS)

EXTERNAL EMAIL



Thanks for your start this discuss


I have been tracking this problem for a long time, until
I saw a
conversation in I

Re: [DISCUSS] Planning Flink 2.0

2023-04-25 Thread Chesnay Schepler

This is definitely a good discussion so have.

Some thoughts:

One aspect that wasn't mentioned is what this release means going 
forward. I already waited a decade for 2.0; don't really want to wait 
another one to see Flink 3.0.
We should discuss how regularly we will ship major releases from now on. 
Let's avoid again making breaking changes because we "gotta do it now 
because 3.0 isn't happening anytime soon".

(e.g., every 2 years or something)

Related to that we need to figure out how long 1.x will be supported and 
in what way (features+patches vs only patches).

The timeline/branch/release-manager bits sound good to me.

> /There are also opinions that we should stay focused as much as 
//possible on the breaking changes only. Incremental / 
non-breaking//improvements and features, or anything that can be added 
in 2.x minor releases, should not block the 2.0 release./

I would definitely agree with this. I'd much rather focus on resolving 
technical debt and setting us up for improvements later than trying to 
tackle both at the same time.
The "marketing perspective" of having big key features to me just 
doesn't make sense considering what features we shipped with 1.x 
releases in the past years.

If that means 2.0 comes along faster, then that's a bonus in my book.
We may of course ship features (e.g., Java 17 which basically comes for 
free if we drop the Scala APIs), but they shouldn't be a focus.

> /With breaking API changes, we may need multiple 2.0-alpha/beta 
versions to collect feedback./

Personally I wouldn't even aim for a big 2.0 release. I think that will 
become quiet a mess and very difficult to actually get feedback on.
My thinking goes rather in the area of defining Milestone releases, each 
Milestone targeting specific changes.
For example, one milestone could cleanup the REST API (+ X,Y,Z), while 
another removes deprecated APIs, etc etc.

Depending on the scope we could iterate quite fast on these.
(Note that I haven't thought this through yet from the dev workflow 
perspective, but it'd likely require longer-living feature branches)

There are some clear benefits to this approach; if we'd drop deprecated 
APIs in M1 then we could already offers users a version of Flink that 
works with Java 17.

On 25/04/2023 13:09, Xintong Song wrote:

Hi everyone,

I'd like to start a discussion on planning for a Flink 2.0 release.

AFAIK, in the past years this topic has been mentioned from time to time,
in mailing lists, jira tickets and offline discussions. However, few
concrete steps have been taken, due to the significant determination and
efforts it requires and distractions from other prioritized focuses. After
a series of offline discussions in the recent weeks, with folks mostly from
our team internally as well as a few from outside Alibaba / Ververica
(thanks for insights from Becket and Robert), we believe it's time to kick
this off in the community.

Below are some of our thoughts about the 2.0 release. Looking forward to
your opinions and feedback.

## Why plan for release 2.0?

Flink 1.0.0 was released in March 2016. In the past 7 years, many new
features have been added and the project has become different from what it
used to be. So what is Flink now? What will it become in the next 3-5
years? What do we think of Flink's position in the industry? We believe
it's time to rethink these questions, and draw a roadmap towards another
milestone, a milestone that worths a new major release.

In addition, we are still providing backwards compatibility (maybe not
perfectly but largely) with APIs that we designed and claimed stable 7
years ago. While such backwards compatibility helps users to stick with the
latest Flink releases more easily, it sometimes, and more and more over
time, also becomes a burden for maintenance and a limitation for new
features and improvements. It's probably time to have a comprehensive
review and clean-up over all the public APIs.

Furthermore, next year is the 10th year for Flink as an Apache project.
Flink joined the Apache incubator in April 2014, and became a top-level
project in December 2014. That makes 2024 a perfect time for bringing out
the release 2.0 milestone. And for such a major release, we'd expect it
takes one year or even longer to prepare for, which means we probably
should start now.

## What should we focus on in release 2.0?

- Roadmap discussion - How do we define and position Flink for now and
in future? This is probably something we lacked. I believe some people have
thought about it, but at least it's not explicitly discussed and aligned in
the community. Ideally, the 2.0 release should be a result of the roadmap.
- Breaking changes - Important improvements, bugfixes, technical debts
that involve breaking of API backwards compatibility, which can only be
carried out in major releases.
   - With breaking API changes, we may need multiple 2.0-alpha/beta
   versions to collect f

Re: [DISCUSS] Preventing Mockito usage for the new code with Checkstyle

2023-04-25 Thread Chesnay Schepler


The checkstyle rule would just ban certain imports.
We'd add exclusions for all existing usages as we did when introducing 
other rules.

So far we usually disabled checkstyle rules for a specific files.

On 25/04/2023 16:34, Piotr Nowojski wrote:

+1 to the idea.

How would this checkstyle rule work? Are you suggesting to start with a
number of exclusions? On what level will those exclusions be? Per file? Per
line?

Best,
Piotrek

wt., 25 kwi 2023 o 13:18 David Morávek  napisał(a):


Hi Everyone,

A long time ago, the community decided not to use Mockito-based tests
because those are hard to maintain. This is already baked in our Code Style
and Quality Guide [1].

Because we still have Mockito imported into the code base, it's very easy
for newcomers to unconsciously introduce new tests violating the code style
because they're unaware of the decision.

I propose to prevent Mockito usage with a Checkstyle rule for a new code,
which would eventually allow us to eliminate it. This could also prevent
some wasted work and unnecessary feedback cycles during reviews.

WDYT?

[1]

https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-mockito---use-reusable-test-implementations

Best,
D.

Re: [DISCUSS] Planning Flink 2.0

2023-04-26 Thread Chesnay Schepler

> /Instead of defining compatibility guarantees as "this API won't 
change in all 1.x/2.x series", what if we define it as "this API won't 
change in the next 2/3 years"./

I can see some benefits to this approach (all APIs having a fixed 
minimum lifetime) but it's just gonna be difficult to communicate. 
Technically this implies that every minor release may contain breaking 
changes, which is exactly what users don't want.

What problems to do you see in creating major releases every N years?

> /IIUC, the milestone releases are a breakdown of the 2.0 release, 
while we are free to introduce breaking changes between them. And you 
suggest using longer-living feature branches to keep the master branch 
in a releasable state (in terms of milestone releases). Am I 
understanding it correctly?/

I think you got the general idea. There are a lot of details to be 
ironed out though (e.g., do we release connectors for each milestone?...).

Conflicts in the long-lived branches are certainly a concern, but I 
think those will be inevitable. Right now I'm not _too_ worried about 
them, at least based on my personal wish-list.
Maybe the milestones could even help with that, as we could preemptively 
decide on an order for certain changes that have a high chance of 
conflicting with each other?

I guess we could do that anyway.
Maybe we should explicitly evaluate how invasive a change is (in 
relation to other breaking changes!) and manage things accordingly

Other thoughts:

We need to figure out what this release means for connectors 
compatibility-wise. The current rules for which versions a connector 
must support don't cover major releases at all.
(This depends a bit on the scope of 2.0; if we add binary compatibility 
to Public APIs and promote a few Evolving ones then compatibility across 
minor releases becomes trivial)

What process are you thinking of for deciding what breaking changes to 
make? The obvious choice would be FLIPs, but I'm worried that this will 
overload the mailing list / wiki for lots of tiny changes.

Provided that we agree on doing 2.0, when would we cut the 2.0 branch? 
Would we wait a few months for people to prepare/agree on changes so we 
reduce the time we need to merge things into 2 branches?

On 26/04/2023 05:51, Xintong Song wrote:

Thanks all for the positive feedback.

@Martijn

If we want to have that roadmap, should we consolidate this into a

dedicated Confluence page over storing it in a Google doc?

Having a dedicated wiki page is definitely a good way for the roadmap
discussion. I haven't created one yet because it's still a proposal to have
such roadmap discussion. If the community agrees with our proposal, the
release manager team can decide how they want to drive and track the
roadmap discussion.

@Chesnay

We should discuss how regularly we will ship major releases from now on.

Let's avoid again making breaking changes because we "gotta do it now
because 3.0 isn't happening anytime soon". (e.g., every 2 years or
something)

I'm not entirely sure about shipping major releases regularly. But I do
agree that we may want to avoid the situation that "breaking changes can
only happen now, or no idea when". Instead of defining compatibility
guarantees as "this API won't change in all 1.x/2.x series", what if we
define it as "this API won't change in the next 2/3 years". That should
allow us to incrementally iterate the APIs.

E.g., in 2.a, all APIs annotated as `@Stable` will be guaranteed compatible
until 2 years after 2.a is shipped, and in 2.b if the API is still
annotated `@Stable` it extends the compatibility guarantee to 2 years after
2.b is shipped. To remove an API, we would need to mark it as `@Deprecated`
and wait for 2 years after the last release in which it was marked
`@Stable`.

My thinking goes rather in the area of defining Milestone releases, each

Milestone targeting specific changes.

I'm trying to understand what you are suggesting here. IIUC, the milestone
releases are a breakdown of the 2.0 release, while we are free to introduce
breaking changes between them. And you suggest using longer-living feature
branches to keep the master branch in a releasable state (in terms of
milestone releases). Am I understanding it correctly?

I haven't thought this through. My gut feeling is this might be a good
direction to go, in terms of keeping things organized. The risk is the cost
of merging feature branches and rebasing feature branches after other
features are merged. That depends on how close the features are related to
each other. E.g., reorganization of the project modules and dependencies
may change the project structure a lot, which may significantly affect most
of the feature branches. Maybe we can identify such widely-affecting
changes and perform them at the beginning or end of the release cycle.

Best,

Xintong

On Wed, Apr 26, 2023 at 8:23 AM ConradJam  wrote:

Thanks Xintong and Jark for kicking off the great discussion!

I checked th

Re: [DISCUSS] Preventing Mockito usage for the new code with Checkstyle

2023-04-26 Thread Chesnay Schepler

* adds a note to not include "import " in the regex" *

On 26/04/2023 11:22, Maximilian Michels wrote:

If we ban Mockito imports, I can still write tests using the full
qualifiers, right?

For example:

org.mockito.Mockito.when(somethingThatShouldHappen).thenReturn(somethingThatNeverActuallyHappens)

Just kidding, +1 on the proposal.

-Max

On Wed, Apr 26, 2023 at 9:02 AM Panagiotis Garefalakis
wrote:

Thanks for bringing this up! +1 for the proposal

@Jing Ge -- we don't necessarily need to completely migrate to Junit5 (even
though it would be ideal).
We could introduce the checkstyle rule and add suppressions for the
existing problematic paths (as we do today for other rules e.g.,
AvoidStarImport)

Cheers,
Panagiotis

On Tue, Apr 25, 2023 at 11:48 PM Weihua Hu wrote:

Thanks for driving this.

+1 for Mockito and Junit4.

A clarity checkstyle will be of great help to new developers.

Best,
Weihua

On Wed, Apr 26, 2023 at 1:47 PM Jing Ge
wrote:

This is a great idea, thanks for bringing this up. +1

Also +1 for Junit4. If I am not mistaken, it could only be done after the
Junit5 migration is done.

@Chesnay thanks for the hint. Do we have any doc about it? If not, it

might

deserve one. WDYT?

Best regards,
Jing

On Wed, Apr 26, 2023 at 5:13 AM Lijie Wang
wrote:

Thanks for driving this. +1 for the proposal.

Can we also prevent Junit4 usage in new code by this way？Because

currently

we are aiming to migrate our codebase to JUnit 5.

Best,
Lijie

Piotr Nowojski 于2023年4月25日周二 23:02写道：

Ok, thanks for the clarification.

Piotrek

wt., 25 kwi 2023 o 16:38 Chesnay Schepler

napisał(a):

The checkstyle rule would just ban certain imports.
We'd add exclusions for all existing usages as we did when

introducing

other rules.
So far we usually disabled checkstyle rules for a specific files.

On 25/04/2023 16:34, Piotr Nowojski wrote:

+1 to the idea.

How would this checkstyle rule work? Are you suggesting to start

with a

number of exclusions? On what level will those exclusions be? Per

file?

Per

line?

Best,
Piotrek

wt., 25 kwi 2023 o 13:18 David Morávek

napisał(a):

Hi Everyone,

A long time ago, the community decided not to use Mockito-based

tests

because those are hard to maintain. This is already baked in our

Code

Style

and Quality Guide [1].

Because we still have Mockito imported into the code base, it's

very

easy

for newcomers to unconsciously introduce new tests violating the

code

style

because they're unaware of the decision.

I propose to prevent Mockito usage with a Checkstyle rule for a

new

code,

which would eventually allow us to eliminate it. This could also

prevent

some wasted work and unnecessary feedback cycles during reviews.

WDYT?

[1]

https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-mockito---use-reusable-test-implementations

Best,
D.

Re: [Discussion] - Release major Flink version to support JDK 17 (LTS)

2023-04-28 Thread Chesnay Schepler

We don't know yet. I wanted to run some more experiments to see if I 
cant get Scala 2.12.7 working on Java 17.


If that doesn't work, then it would also be an option to bump Scala in 
the Java 17 builds (breaking savepoint compatibility), and users should 
just only use the Java APIs.


The alternative to _that_ is doing this when we drop the Scala API.

On 28/04/2023 01:11, Thomas Weise wrote:
Is the intention to bump the Flink major version and only support Java 
17+? If so, can Scala not be upgraded at the same time?


Thanks,
Thomas


On Thu, Apr 27, 2023 at 4:53 PM Martijn Visser 
 wrote:


Scala 2.12.7 doesn't compile on Java 17, see
https://issues.apache.org/jira/browse/FLINK-25000.

On Thu, Apr 27, 2023 at 3:11 PM Jing Ge  wrote:

> Thanks Tamir for the information. According to the latest
comment of the
> task FLINK-24998, this bug should be gone while using the latest
JDK 17. I
> was wondering whether it means that there are no more issues to
stop us
> releasing a major Flink version to support Java 17? Did I miss
something?
>
> Best regards,
> Jing
>
> On Thu, Apr 27, 2023 at 8:18 AM Tamir Sagi

> wrote:
>
>> More details about the JDK bug here
>> https://bugs.openjdk.org/browse/JDK-8277529
>>
>> Related Jira ticket
>> https://issues.apache.org/jira/browse/FLINK-24998
>>
>> --
    >> *From:* Jing Ge via user 
>> *Sent:* Monday, April 24, 2023 11:15 PM
>> *To:* Chesnay Schepler 
>> *Cc:* Piotr Nowojski ; Alexis
Sarda-Espinosa <
>> sarda.espin...@gmail.com>; Martijn Visser
;
>> dev@flink.apache.org ; user

>> *Subject:* Re: [Discussion] - Release major Flink version to
support JDK
>> 17 (LTS)
>>
>>
>> *EXTERNAL EMAIL*
>>
>>
>> Thanks Chesnay for working on this. Would you like to share
more info
>> about the JDK bug?
>>
>> Best regards,
>> Jing
>>
>> On Mon, Apr 24, 2023 at 11:39 AM Chesnay Schepler

>> wrote:
>>
>> As it turns out Kryo isn't a blocker; we ran into a JDK bug.
>>
>> On 31/03/2023 08:57, Chesnay Schepler wrote:
>>
>>
>>

https://github.com/EsotericSoftware/kryo/wiki/Migration-to-v5#migration-guide
>>
>> Kroy themselves state that v5 likely can't read v2 data.
>>
>> However, both versions can be on the classpath without
classpath as v5
>> offers a versioned artifact that includes the version in the
package.
>>
>> It probably wouldn't be difficult to migrate a savepoint to
Kryo v5,
>> purely from a read/write perspective.
>>
>> The bigger question is how we expose this new Kryo version in
the API. If
>> we stick to the versioned jar we need to either duplicate all
current
>> Kryo-related APIs or find a better way to integrate other
serialization
>> stacks.
>> On 30/03/2023 17:50, Piotr Nowojski wrote:
>>
>> Hey,
>>
>> > 1. The Flink community agrees that we upgrade Kryo to a later
version,
>> which means breaking all checkpoint/savepoint compatibility and
releasing a
>> Flink 2.0 with Java 17 support added and Java 8 and Flink Scala
API support
>> dropped. This is probably the quickest way, but would still
mean that we
>> expose Kryo in the Flink APIs, which is the main reason why we
haven't been
>> able to upgrade Kryo at all.
>>
>> This sounds pretty bad to me.
>>
>> Has anyone looked into what it would take to provide a smooth
migration
>> from Kryo2 -> Kryo5?
>>
>> Best,
>> Piotrek
>>
>> czw., 30 mar 2023 o 16:54 Alexis Sarda-Espinosa

>> napisał(a):
>>
>> Hi Martijn,
>>
>> just to be sure, if all state-related classes use a POJO
serializer, Kryo
>> will never come into play, right? Given FLINK-16686 [1], I
wonder how many
>> users actually have jobs with Kryo and RocksDB, but even if
there aren't
>> many, that still leaves those who don't use RocksDB for
>> checkpoints/savepoints.
>>
>> If Kryo were to stay in the Flink APIs in v1.X, is it
impossible to let
>> users choose between v2/v5 jars by separating them like log4j2
jars?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-1

[NOTICE] Flink master branch now uses Maven 3.8.6

2023-05-12 Thread Chesnay Schepler



 What happened?

I have just merged the last commits to properly support Maven 3.3+ on 
the Flink master branch.


mvnw and CI have been updated to use Maven 3.8.6.


 What does this mean for me?

 * You can now use Maven versions beyond 3.2.5 (duh).
 o Most versions should work, but 3.8.6 was the most tested and is
   thus recommended.
 o 3.8.*5* is known to *NOT* work.
 * Starting from 1.18.0 you need to use Maven 3.8.6 for releases.
 o This may change to a later version until the release of 1.18.0.
 o There have been too many issues with recent Maven releases to
   make a range acceptable.
 * *All dependencies that are bundled by a module must be marked as
   optional.*
 o *This is verified on CI
   
.*
 o *Background info can be found in the wiki
   .*


 Can I continue using Maven 3.2.5?

For now, yes, but support will eventually be removed.


 Does this affect users?

No.


Please ping me if you run into any issues.

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

2023-06-01 Thread Chesnay Schepler

The version in the state is the serializer version, and applies to the 
entire state, independent of what it contains.
If you use Kryo2 for reading and Kryo5 for writing (which also implies 
writing the new serializer version into state), then I'd assume that a 
migration is an all-or-nothing kind of deal.
IOW, you'd have to load a savepoint and write out an entirely new 
savepoint with the new state.
Otherwise you may have only re-written part of the checkpoint, and now 
contains a mix of Kryo2/Kryo5 serialized classes, which should then fail 
_hard_ on any recovery attempt because we wouldn't use Kryo2 to read 
anything.


If I'm right, then as is this sounds like quite a trap for users to fall 
into because from what I gathered this is the default behavior in the PR 
(I could be wrong though since I haven't fully dug through the 8k lines 
PR yet...)


What we kind of want is this:
1) Kryo5 is used as the default for new jobs. (maybe not even that, 
making it an explicit opt-in)

2) Kryo2 is used for reading AND writing for existing* jobs by default.
3) Users can explicitly (and easily!) do a full migration of their jobs, 
after which 2) should no longer apply.




In the PR you mentioned running into issues on Java 17; to have have 
some error stacktraces and examples data/serializers still around?


On 30/05/2023 00:38, Kurt Ostfeld wrote:

I’d assumed that there wasn’t a good way to migrate state stored with an older 
version of Kryo to a newer version - if you’ve solved that, kudos.

I hope I've solved this. The pull request is supposed to do exactly this. 
Please let me know if you can propose a scenario that would break this.

The pull-request has both Kryo 2.x and 5.x dependencies. It looks at the state 
version number written to the state to determine which version of Kryo to use 
for deserialization. Kryo 2.x is not used to write new state.

--- Original Message ---
On Monday, May 29th, 2023 at 5:17 PM, Ken Krugler  
wrote:




Hi Kurt,

I personally think it’s a very nice improvement, and that the longer-term goal 
of removing built-in Kryo support for state serialization (while a good one) 
warrants a separate FLIP.

Perhaps an intermediate approach would be to disable the use of Kryo for state 
serialization by default, and force a user to disregard warnings and explicitly 
enable it if they want to go down that path.

I’d assumed that there wasn’t a good way to migrate state stored with an older 
version of Kryo to a newer version - if you’ve solved that, kudos.

— Ken



On May 29, 2023, at 2:21 PM, Kurt Ostfeld kurtostf...@proton.me.INVALID wrote:

Hi everyone. I would like to start the discussion thread for FLIP-317: Upgrade 
Kryo from 2.24.0 to 5.5.0 [1].

There is a pull-request associated with this linked in the FLIP.

I'd particularly like to hear about:

- Chesnay Schepler's request to consider removing Kryo serializers from the 
execution config. Is this a reasonable task to add into this FLIP? Is there 
consensus on how to resolve that? Would that be better addressed in a separate 
future FLIP after the Kryo upgrade FLIP is completed?

- Backwards compatibility. The automated CI tests have a lot of backwards 
compatibility tests that are passing. I also wrote a Flink application with 
keyed state using custom Kryo v2 serializers and then an upgraded version with 
both Kryo v2 and Kryo v5 serializers to stress test the upgrade process. I'd 
like to hear about additional scenarios that need to be tested.

- Is this worth pursuing or is the Flink project looking to go in a different 
direction? I'd like to do some more work on the pull request if this is being 
seriously considered for adoption.

I'm looking forward to hearing everyone's feedback and suggestions.

Thank you,
Kurt

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-317%3A+Upgrade+Kryo+from+2.24.0+to+5.5.0


--
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch

Re: [VOTE] Release flink-connector-jdbc v3.1.1, release candidate #1

2023-06-06 Thread Chesnay Schepler

I'm a bit concerned that the last 4 CI runs haven't succeeded in the 3.1 
branch.


Has anyone looked into the failing oracle test (both 1.17/1.18)?
https://github.com/apache/flink-connector-jdbc/actions/runs/5058372107/jobs/9078398092

Why is a vote being opened when there's still a blocker ticket for this 
very test failure? (FLINK-31770)


How can it be that CI is broken for 2 months and no one notices?

On 24/05/2023 20:54, Martijn Visser wrote:

Hi everyone,
Please review and vote on the release candidate #1 for the version 3.1.1,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which are signed with the key with fingerprint
A5F3BCE4CBE993573EC5966A65321B8382B219AF [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag v3.1.1-rc1 [5],
* website pull request listing the new release [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353281
[2]
https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.1-rc1
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1636/
[5] https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.1-rc1
[6] https://github.com/apache/flink-web/pull/654

Re: [DISCUSS] Status of Statefun Project

2023-06-06 Thread Chesnay Schepler

If you were to fork it /and want to redistribute it/ then the short 
version is that


1. you have to adhere to the Apache licensing requirements
2. you have to make it clear that your fork does not belong to the
   Apache Flink project. (Trademarks and all that)

Neither should be significant hurdles (there should also be plenty of 
online resources regarding 1), and if you do this then you can freely 
share your fork with others.


I've also pinged Martijn to take a look at this thread.
To my knowledge the project hasn't decided anything yet.

On 27/05/2023 04:05, Galen Warren wrote:

Ok, I get it. No interest.

If this project is being abandoned, I guess I'll work with my own fork. Is
there anything I should consider here? Can I share it with other people who
use this project?

On Tue, May 16, 2023 at 10:50 AM Galen Warren
wrote:


Hi Martijn, since you opened this discussion thread, I'm curious what your
thoughts are in light of the responses? Thanks.

On Wed, Apr 19, 2023 at 1:21 PM Galen Warren
wrote:


I use Apache Flink for stream processing, and StateFun as a hand-off

point for the rest of the application.
It serves well as a bridge between a Flink Streaming job and
micro-services.


This is essentially how I use it as well, and I would also be sad to see
it sunsetted. It works well; I don't know that there is a lot of new
development required, but if there are no new Statefun releases, then
Statefun can only be used with older Flink versions.

On Tue, Apr 18, 2023 at 10:04 PM Marco Villalobos <
mvillalo...@kineteque.com> wrote:


I am currently using Stateful Functions in my application.

I use Apache Flink for stream processing, and StateFun as a hand-off
point for the rest of the application.
It serves well as a bridge between a Flink Streaming job and
micro-services.

I would be disappointed if StateFun was sunsetted.  Its a good idea.

If there is anything I can do to help, as a contributor perhaps, please
let me know.


On Apr 3, 2023, at 2:02 AM, Martijn Visser

wrote:

Hi everyone,

I want to open a discussion on the status of the Statefun Project [1]

in Apache Flink. As you might have noticed, there hasn't been much
development over the past months in the Statefun repository [2]. There is
currently a lack of active contributors and committers who are able to help
with the maintenance of the project.

In order to improve the situation, we need to solve the lack of

committers and the lack of contributors.

On the lack of committers:

1. Ideally, there are some of the current Flink committers who have

the bandwidth and can help with reviewing PRs and merging them.

2. If that's not an option, it could be a consideration that current

committers only approve and review PRs, that are approved by those who are
willing to contribute to Statefun and if the CI passes

On the lack of contributors:

3. Next to having this discussion on the Dev and User mailing list, we

can also create a blog with a call for new contributors on the Flink
project website, send out some tweets on the Flink / Statefun twitter
accounts, post messages on Slack etc. In that message, we would inform how
those that are interested in contributing can start and where they could
reach out for more information.

There's also option 4. where a group of interested people would split

Statefun from the Flink project and make it a separate top level project
under the Apache Flink umbrella (similar as recently has happened with
Flink Table Store, which has become Apache Paimon).

If we see no improvements in the coming period, we should consider

sunsetting Statefun and communicate that clearly to the users.

I'm looking forward to your thoughts.

Best regards,

Martijn

[1]https://nightlies.apache.org/flink/flink-statefun-docs-master/  <

https://nightlies.apache.org/flink/flink-statefun-docs-master/>

[2]https://github.com/apache/flink-statefun  <

https://github.com/apache/flink-statefun>

Re: [DISCUSS] FLIP-317: Upgrade Kryo from 2.24.0 to 5.5.0

2023-06-08 Thread Chesnay Schepler


On 08/06/2023 16:06, Kurt Ostfeld wrote:

  If I understand correctly, the scenario is resuming from multiple checkpoint 
files or from a savepoint and checkpoint files which may be generated by 
different versions of Flink


No; it's the same version of Flink, you just didn't do a full migration 
of the savepoint from the start.


So, load old savepoint -> create an incremental checkpoint (which writes 
bit new state with Kryo5) -> jobs fails -> try recover job (which now 
has to read state was written with either Kryo2 or Kryo5).


On 08/06/2023 16:06, Kurt Ostfeld wrote:

This pull-request build supports Java records


We'd have to see but of the top of my head I doubt we want to use Kryo 
for that, and rather extend our PojoSerializer. At least so far that was 
the plan.

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-13 Thread Chesnay Schepler


On 13/06/2023 12:50, Jing Ge wrote:

One major issue we have, afaiu, is caused by the lack of housekeeping/house
cleaning, there are many APIs that were marked as deprecated a few years
ago and still don't get removed. Some APIs should be easy to remove and
others will need some more clear rules, like the issue discussed at [1].


This is by design. Most of these are @Public APIs that we had to carry 
around until Flink 2.0, because that was the initial guarantee that we 
gave people.



As for the FLIP, I like the idea of explicitly writing down a 
deprecation period for APIs, particularly PublicEvolving ones.
For Experimental I don't think it'd be a problem if we could change them 
right away,
but looking back a bit I don't think it hurts us to also enforce some 
deprecation period.

1 release for both of these sound fine to me.


My major point of contention is the removal of Public APIs between minor 
versions.

This to me would a major setback towards a simpler upgrade path for users.
If these can be removed in minor versions than what even is a major release?
The very definition we have for Public APIs is that they stick around 
until the next major one.
Any rule that theoretically allows for breaking changes in Public API in 
every minor release is in my opinion not a viable option.



The "carry recent Public APIs forward into the next major release" thing 
seems to presume a linear release history (aka, if 2.0 is released after 
1.20, then there will be no 1.21), which I doubt will be the case. The 
idea behind it is good, but I'd say the right conclusion would be to not 
make that API public if we know a new major release hits in 3 months and 
is about to modify it. With a regular schedule for major releases this 
wouldn't be difficult to do.

Re: [DISCUSS] FLIP-321: Introduce an API deprecation process

2023-06-15 Thread Chesnay Schepler


On 13/06/2023 17:26, Becket Qin wrote:

It would be valuable if we can avoid releasing minor versions for previous
major versions.


On paper, /absolutely /agree, but I'm not sure how viable that is in 
practice.


On the current 2.0 agenda is potentially dropping support for Java 8/11, 
which may very well be a problem for our current users.



On 13/06/2023 17:26, Becket Qin wrote:

Thanks for the feedback and sorry for the confusion about Public API
deprecation. I just noticed that there was a mistake in the NOTES part for
Public API due to a copy-paste error... I just fixed it.
I'm very relieved to hear that. Glad to hear that we are on the same 
page on that note.



On 15/06/2023 15:20, Becket Qin wrote:

But it should be
completely OK to bump up the major version if we really want to get rid of
a public API, right?


Technically yes, but look at how long it took to get us to 2.0. ;)

There's a separate discussion to be had on the cadence of major releases 
going forward, and there seem to be different opinions on that.


If we take the Kafka example of 2 minor releases between major ones, 
that for us means that users have to potentially deal with breaking 
changes every 6 months, which seems like a lot.


Given our track record I would prefer a regular cycle (1-2 years) to 
force us to think about this whole topic, and not put it again to the 
wayside and giving us (and users) a clear expectation on when breaking 
changes can be made.


But again, maybe this should be in a separate thread.

On 14/06/2023 11:37, Becket Qin wrote:

Do you have an example of behavioral change in mind? Not sure I fully
understand the concern for behavioral change here.


This could be a lot of things. It can be performance in certain 
edge-cases, a bug fix that users (maybe unknowingly) relied upon 
(https://xkcd.com/1172/), a semantic change to some API.


For a concrete example, consider the job submission. A few releases back 
we made changes such that the initialization of the job master happens 
asynchronously.
This meant the job submission call returns sooner, and the job state 
enum was extended to cover this state.
API-wise we consider this a compatible change, but the observed behavior 
may be different.


Metrics are another example; I believe over time we changed what some 
metrics returned a few times.

Re: [DISCUSS] FLIP-322 Cooldown period for adaptive scheduler

2023-06-16 Thread Chesnay Schepler

1) Options specific to the adaptive scheduler should start with 
"jobmanager.adaptive-scheduler".


2)
There isn't /really /a notion of a "scaling event". The scheduler is 
informed about new/lost slots and job failures, and reacts accordingly 
by maybe rescaling the job.
(sure, you can think of these as events, but you can think of 
practically everything as events)


There shouldn't be a queue for events. All the scheduler should have to 
know is that the next rescale check is scheduled for time T, which in 
practice boils down to a flag and a scheduled action that runs 
Executing#maybeRescale.
With that in mind, we also have to look at how we keep this state 
around. Presumably it is scoped to the current state, such that the 
cooldown is reset if a job fails.

Maybe we should add a separate ExecutingWithCooldown state; not sure yet.

It would be good to clarify whether this FLIP only attempts to cover 
scale up operations, or also scale downs in case of slot losses.


We should also think about how it relates to the externalized 
declarative resource management. Should we always rescale immediately? 
Should we wait until the cooldown is over?
Related to this, there's the min-parallelism-increase option, that if 
for example set to "2" restricts rescale operations to only occur if the 
parallelism increases by at least 2.

Ideally however there would be a max timeout for this.

As such we could maybe think about this a bit differently:
Add 2 new options instead of 1:
jobmanager.adaptive-scheduler.scaling-interval.min: The minimum time the 
scheduler will wait for the next effective rescale operations.
jobmanager.adaptive-scheduler.scaling-interval.max: The maximum time the 
scheduler will wait for the next effective rescale operations.


3) It sounds fine that we lose the cooldown state, because imo we want 
to reset the cooldown anyway on job failures (because a job failure 
inherently implies a potential rescaling).


4) The stabilization time isn't really redundant and serves a different 
use-case. The idea behind is that if a users adds multiple TMs at once 
then we don't want to rescale immediately at the first received slot. 
Without the stabilization time the cooldown would actually cause bad 
behavior here, because not only would we rescale immediately upon 
receiving the minimum required slots to scale up, but we also wouldn't 
use the remaining slots just because the cooldown says so.


On 16/06/2023 15:47, Etienne Chauchot wrote:

Hi Robert,

Thanks for your feedback. I don't know the scheduler part well enough 
yet and I'm taking this ticket as a learning workshop.


Regarding your comments:

1. Taking a look at the AdaptiveScheduler class which takes all its 
configuration from the JobManagerOptions, and also to be consistent 
with other parameters name, I'd suggest 
/jobmanager.scheduler-scaling-cooldown-period/


2. I thought scaling events existed already and the scheduler received 
them as mentioned in FLIP-160 (cf "Whenever the scheduler is in the 
Executing state and receives new slots") or in FLIP-138 (cf "Whenever 
new slots are available the SlotPool notifies the Scheduler"). If it 
is not the case (it is the scheduler who asks for slots), then there 
is no need for storing scaling requests indeed.


=> I need a confirmation here

3. If we loose the JobManager, we loose both the AdaptiveScheduler 
state and the CoolDownTimer state. So, upon recovery, it would be as 
if there was no ongoing coolDown period. So, a first re-scale could 
happen right away and it will start a coolDown period. A second 
re-scale would have to wait for the end of this period.


4. When a pipeline is re-scaled, it is restarted. Upon restart, the 
AdaptiveScheduler passes again in the "waiting for resources" state as 
FLIP-160 suggests. If so, then it seems that the coolDown period is 
kind of redundant with the resource-stabilization-timeout. I guess it 
is not the case otherwise the FLINK-21883 ticket would not have been 
created.


=> I need a confirmation here also.


Thanks for your views on point 2 and 4.


Best

Etienne

Le 15/06/2023 à 13:35, Robert Metzger a écrit :

Thanks for the FLIP.

Some comments:
1. Can you specify the full proposed configuration name? "
scaling-cooldown-period" is probably not the full config name?
2. Why is the concept of scaling events and a scaling queue needed? If I
remember correctly, the adaptive scheduler will just check how many
TaskManagers are available and then adjust the execution graph 
accordingly.

There's no need to store a number of scaling events. We just need to
determine the time to trigger an adjustment of the execution graph.
3. What's the behavior wrt to JobManager failures (e.g. we lose the 
state

of the Adaptive Scheduler?). My proposal would be to just reset the
cooldown period, so after recovery of a JobManager, we have to wait at
least for the cooldown period until further scaling operations are done.
4. What's the relationship to the
"jobmanage

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3255 matches

Mail list logo