date:20190122

Re: [ANNOUNCE] Contributing Alibaba's Blink

2019-01-22 Thread Kurt Young

Hi Driesprong,

Glad to hear that you're interested with blink's codes. Actually, blink
only has one branch by itself, so either a separated repo or a flink's
branch works for blink's code share.

Best,
Kurt


On Tue, Jan 22, 2019 at 2:30 PM Driesprong, Fokko 
wrote:

> Great news Stephan!
>
> Why not make the code available by having a fork of Flink on Alibaba's
> Github account. This will allow us to do easy diff's in the Github UI and
> create PR's of cherry-picked commits if needed. I can imagine that the
> Blink codebase has a lot of branches by itself, so just pushing a couple of
> branches to the main Flink repo is not ideal. Looking forward to it!
>
> Cheers, Fokko
>
>
>
>
>
> Op di 22 jan. 2019 om 03:48 schreef Shaoxuan Wang :
>
> > big +1 to contribute Blink codebase directly into the Apache Flink
> project.
> > Looking forward to the new journey.
> >
> > Regards,
> > Shaoxuan
> >
> > On Tue, Jan 22, 2019 at 3:52 AM Xiaowei Jiang 
> wrote:
> >
> > >  Thanks Stephan! We are hoping to make the process as non-disruptive as
> > > possible to the Flink community. Making the Blink codebase public is
> the
> > > first step that hopefully facilitates further discussions.
> > > Xiaowei
> > >
> > > On Monday, January 21, 2019, 11:46:28 AM PST, Stephan Ewen <
> > > se...@apache.org> wrote:
> > >
> > >  Dear Flink Community!
> > >
> > > Some of you may have heard it already from announcements or from a
> Flink
> > > Forward talk:
> > > Alibaba has decided to open source its in-house improvements to Flink,
> > > called Blink!
> > > First of all, big thanks to team that developed these improvements and
> > made
> > > this
> > > contribution possible!
> > >
> > > Blink has some very exciting enhancements, most prominently on the
> Table
> > > API/SQL side
> > > and the unified execution of these programs. For batch (bounded) data,
> > the
> > > SQL execution
> > > has full TPC-DS coverage (which is a big deal), and the execution is
> more
> > > than 10x faster
> > > than the current SQL runtime in Flink. Blink has also added support for
> > > catalogs,
> > > improved the failover speed of batch queries and the resource
> management.
> > > It also
> > > makes some good steps in the direction of more deeply unifying the
> batch
> > > and streaming
> > > execution.
> > >
> > > The proposal is to merge Blink's enhancements into Flink, to give
> Flink's
> > > SQL/Table API and
> > > execution a big boost in usability and performance.
> > >
> > > Just to avoid any confusion: This is not a suggested change of focus to
> > > batch processing,
> > > nor would this break with any of the streaming architecture and vision
> of
> > > Flink.
> > > This contribution follows very much the principle of "batch is a
> special
> > > case of streaming".
> > > As a special case, batch makes special optimizations possible. In its
> > > current state,
> > > Flink does not exploit many of these optimizations. This contribution
> > adds
> > > exactly these
> > > optimizations and makes the streaming model of Flink applicable to
> harder
> > > batch use cases.
> > >
> > > Assuming that the community is excited about this as well, and in favor
> > of
> > > these enhancements
> > > to Flink's capabilities, below are some thoughts on how this
> contribution
> > > and integration
> > > could work.
> > >
> > > --- Making the code available ---
> > >
> > > At the moment, the Blink code is in the form of a big Flink fork
> (rather
> > > than isolated
> > > patches on top of Flink), so the integration is unfortunately not as
> easy
> > > as merging a
> > > few patches or pull requests.
> > >
> > > To support a non-disruptive merge of such a big contribution, I believe
> > it
> > > make sense to make
> > > the code of the fork available in the Flink project first.
> > > From there on, we can start to work on the details for merging the
> > > enhancements, including
> > > the refactoring of the necessary parts in the Flink master and the
> Blink
> > > code to make a
> > > merge possible without repeatedly breaking compatibility.
> > >
> > > The first question is where do we put the code of the Blink fork during
> > the
> > > merging procedure?
> > > My first thought was to temporarily add a repository (like
> > > "flink-blink-staging"), but we could
> > > also put it into a special branch in the main Flink repository.
> > >
> > >
> > > I will start a separate thread about discussing a possible strategy to
> > > handle and merge
> > > such a big contribution.
> > >
> > > Best,
> > > Stephan
> > >
> >
>

Re: [ANNOUNCE] Contributing Alibaba's Blink

2019-01-22 Thread Timo Walther

Thanks for driving these efforts, Stephan! Great news that the Blink 
code base will be available for everyone soon. I already got access to 
it and the added functionality and improved architecture is impressive. 
There will be nice additions to Flink.


I guess the Blink code base will be continuously updated while the Flink 
community merged chunks of it, right? If yes, I would also be in favor 
of a separate repository similar to flink-shaded.


Regards,
Timo


Am 22.01.19 um 09:20 schrieb Kurt Young:

Hi Driesprong,

Glad to hear that you're interested with blink's codes. Actually, blink
only has one branch by itself, so either a separated repo or a flink's
branch works for blink's code share.

Best,
Kurt


On Tue, Jan 22, 2019 at 2:30 PM Driesprong, Fokko 
wrote:


Great news Stephan!

Why not make the code available by having a fork of Flink on Alibaba's
Github account. This will allow us to do easy diff's in the Github UI and
create PR's of cherry-picked commits if needed. I can imagine that the
Blink codebase has a lot of branches by itself, so just pushing a couple of
branches to the main Flink repo is not ideal. Looking forward to it!

Cheers, Fokko





Op di 22 jan. 2019 om 03:48 schreef Shaoxuan Wang :


big +1 to contribute Blink codebase directly into the Apache Flink

project.

Looking forward to the new journey.

Regards,
Shaoxuan

On Tue, Jan 22, 2019 at 3:52 AM Xiaowei Jiang 

wrote:

  Thanks Stephan! We are hoping to make the process as non-disruptive as
possible to the Flink community. Making the Blink codebase public is

the

first step that hopefully facilitates further discussions.
Xiaowei

 On Monday, January 21, 2019, 11:46:28 AM PST, Stephan Ewen <
se...@apache.org> wrote:

  Dear Flink Community!

Some of you may have heard it already from announcements or from a

Flink

Forward talk:
Alibaba has decided to open source its in-house improvements to Flink,
called Blink!
First of all, big thanks to team that developed these improvements and

made

this
contribution possible!

Blink has some very exciting enhancements, most prominently on the

Table

API/SQL side
and the unified execution of these programs. For batch (bounded) data,

the

SQL execution
has full TPC-DS coverage (which is a big deal), and the execution is

more

than 10x faster
than the current SQL runtime in Flink. Blink has also added support for
catalogs,
improved the failover speed of batch queries and the resource

management.

It also
makes some good steps in the direction of more deeply unifying the

batch

and streaming
execution.

The proposal is to merge Blink's enhancements into Flink, to give

Flink's

SQL/Table API and
execution a big boost in usability and performance.

Just to avoid any confusion: This is not a suggested change of focus to
batch processing,
nor would this break with any of the streaming architecture and vision

of

Flink.
This contribution follows very much the principle of "batch is a

special

case of streaming".
As a special case, batch makes special optimizations possible. In its
current state,
Flink does not exploit many of these optimizations. This contribution

adds

exactly these
optimizations and makes the streaming model of Flink applicable to

harder

batch use cases.

Assuming that the community is excited about this as well, and in favor

of

these enhancements
to Flink's capabilities, below are some thoughts on how this

contribution

and integration
could work.

--- Making the code available ---

At the moment, the Blink code is in the form of a big Flink fork

(rather

than isolated
patches on top of Flink), so the integration is unfortunately not as

easy

as merging a
few patches or pull requests.

To support a non-disruptive merge of such a big contribution, I believe

it

make sense to make
the code of the fork available in the Flink project first.
 From there on, we can start to work on the details for merging the
enhancements, including
the refactoring of the necessary parts in the Flink master and the

Blink

code to make a
merge possible without repeatedly breaking compatibility.

The first question is where do we put the code of the Blink fork during

the

merging procedure?
My first thought was to temporarily add a repository (like
"flink-blink-staging"), but we could
also put it into a special branch in the main Flink repository.


I will start a separate thread about discussing a possible strategy to
handle and merge
such a big contribution.

Best,
Stephan

Re: [ANNOUNCE] Contributing Alibaba's Blink

2019-01-22 Thread jincheng sun

Thanks Stephan!  This is a very exciting news for the flink community.

I recommend creating a branch for blink in the Flink repository. Just like
feature development, the blink branch is a branch with many enhancements,
and the enhanced functionality is continuously merged to the flink master.

Cheers,
Jincheng

Timo Walther  于2019年1月22日周二 下午4:45写道：

> Thanks for driving these efforts, Stephan! Great news that the Blink
> code base will be available for everyone soon. I already got access to
> it and the added functionality and improved architecture is impressive.
> There will be nice additions to Flink.
>
> I guess the Blink code base will be continuously updated while the Flink
> community merged chunks of it, right? If yes, I would also be in favor
> of a separate repository similar to flink-shaded.
>
> Regards,
> Timo
>
>
> Am 22.01.19 um 09:20 schrieb Kurt Young:
> > Hi Driesprong,
> >
> > Glad to hear that you're interested with blink's codes. Actually, blink
> > only has one branch by itself, so either a separated repo or a flink's
> > branch works for blink's code share.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jan 22, 2019 at 2:30 PM Driesprong, Fokko 
> > wrote:
> >
> >> Great news Stephan!
> >>
> >> Why not make the code available by having a fork of Flink on Alibaba's
> >> Github account. This will allow us to do easy diff's in the Github UI
> and
> >> create PR's of cherry-picked commits if needed. I can imagine that the
> >> Blink codebase has a lot of branches by itself, so just pushing a
> couple of
> >> branches to the main Flink repo is not ideal. Looking forward to it!
> >>
> >> Cheers, Fokko
> >>
> >>
> >>
> >>
> >>
> >> Op di 22 jan. 2019 om 03:48 schreef Shaoxuan Wang  >:
> >>
> >>> big +1 to contribute Blink codebase directly into the Apache Flink
> >> project.
> >>> Looking forward to the new journey.
> >>>
> >>> Regards,
> >>> Shaoxuan
> >>>
> >>> On Tue, Jan 22, 2019 at 3:52 AM Xiaowei Jiang 
> >> wrote:
>    Thanks Stephan! We are hoping to make the process as non-disruptive
> as
>  possible to the Flink community. Making the Blink codebase public is
> >> the
>  first step that hopefully facilitates further discussions.
>  Xiaowei
> 
>   On Monday, January 21, 2019, 11:46:28 AM PST, Stephan Ewen <
>  se...@apache.org> wrote:
> 
>    Dear Flink Community!
> 
>  Some of you may have heard it already from announcements or from a
> >> Flink
>  Forward talk:
>  Alibaba has decided to open source its in-house improvements to Flink,
>  called Blink!
>  First of all, big thanks to team that developed these improvements and
> >>> made
>  this
>  contribution possible!
> 
>  Blink has some very exciting enhancements, most prominently on the
> >> Table
>  API/SQL side
>  and the unified execution of these programs. For batch (bounded) data,
> >>> the
>  SQL execution
>  has full TPC-DS coverage (which is a big deal), and the execution is
> >> more
>  than 10x faster
>  than the current SQL runtime in Flink. Blink has also added support
> for
>  catalogs,
>  improved the failover speed of batch queries and the resource
> >> management.
>  It also
>  makes some good steps in the direction of more deeply unifying the
> >> batch
>  and streaming
>  execution.
> 
>  The proposal is to merge Blink's enhancements into Flink, to give
> >> Flink's
>  SQL/Table API and
>  execution a big boost in usability and performance.
> 
>  Just to avoid any confusion: This is not a suggested change of focus
> to
>  batch processing,
>  nor would this break with any of the streaming architecture and vision
> >> of
>  Flink.
>  This contribution follows very much the principle of "batch is a
> >> special
>  case of streaming".
>  As a special case, batch makes special optimizations possible. In its
>  current state,
>  Flink does not exploit many of these optimizations. This contribution
> >>> adds
>  exactly these
>  optimizations and makes the streaming model of Flink applicable to
> >> harder
>  batch use cases.
> 
>  Assuming that the community is excited about this as well, and in
> favor
> >>> of
>  these enhancements
>  to Flink's capabilities, below are some thoughts on how this
> >> contribution
>  and integration
>  could work.
> 
>  --- Making the code available ---
> 
>  At the moment, the Blink code is in the form of a big Flink fork
> >> (rather
>  than isolated
>  patches on top of Flink), so the integration is unfortunately not as
> >> easy
>  as merging a
>  few patches or pull requests.
> 
>  To support a non-disruptive merge of such a big contribution, I
> believe
> >>> it
>  make sense to make
>  the code of the fork available in the Flink project first.
>   From there on, we can start to work on the details for merging the

Re: [ANNOUNCE] Contributing Alibaba's Blink

2019-01-22 Thread Ufuk Celebi

Hey Stephan and others,

thanks for the summary. I'm very excited about the outlined improvements. :-)

Separate branch vs. fork: I'm fine with either of the suggestions.
Depending on the expected strategy for merging the changes, expected
number of additional changes, etc., either one or the other approach
might be better suited.

– Ufuk

On Tue, Jan 22, 2019 at 9:20 AM Kurt Young  wrote:
>
> Hi Driesprong,
>
> Glad to hear that you're interested with blink's codes. Actually, blink
> only has one branch by itself, so either a separated repo or a flink's
> branch works for blink's code share.
>
> Best,
> Kurt
>
>
> On Tue, Jan 22, 2019 at 2:30 PM Driesprong, Fokko 
> wrote:
>
> > Great news Stephan!
> >
> > Why not make the code available by having a fork of Flink on Alibaba's
> > Github account. This will allow us to do easy diff's in the Github UI and
> > create PR's of cherry-picked commits if needed. I can imagine that the
> > Blink codebase has a lot of branches by itself, so just pushing a couple of
> > branches to the main Flink repo is not ideal. Looking forward to it!
> >
> > Cheers, Fokko
> >
> >
> >
> >
> >
> > Op di 22 jan. 2019 om 03:48 schreef Shaoxuan Wang :
> >
> > > big +1 to contribute Blink codebase directly into the Apache Flink
> > project.
> > > Looking forward to the new journey.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > > On Tue, Jan 22, 2019 at 3:52 AM Xiaowei Jiang 
> > wrote:
> > >
> > > >  Thanks Stephan! We are hoping to make the process as non-disruptive as
> > > > possible to the Flink community. Making the Blink codebase public is
> > the
> > > > first step that hopefully facilitates further discussions.
> > > > Xiaowei
> > > >
> > > > On Monday, January 21, 2019, 11:46:28 AM PST, Stephan Ewen <
> > > > se...@apache.org> wrote:
> > > >
> > > >  Dear Flink Community!
> > > >
> > > > Some of you may have heard it already from announcements or from a
> > Flink
> > > > Forward talk:
> > > > Alibaba has decided to open source its in-house improvements to Flink,
> > > > called Blink!
> > > > First of all, big thanks to team that developed these improvements and
> > > made
> > > > this
> > > > contribution possible!
> > > >
> > > > Blink has some very exciting enhancements, most prominently on the
> > Table
> > > > API/SQL side
> > > > and the unified execution of these programs. For batch (bounded) data,
> > > the
> > > > SQL execution
> > > > has full TPC-DS coverage (which is a big deal), and the execution is
> > more
> > > > than 10x faster
> > > > than the current SQL runtime in Flink. Blink has also added support for
> > > > catalogs,
> > > > improved the failover speed of batch queries and the resource
> > management.
> > > > It also
> > > > makes some good steps in the direction of more deeply unifying the
> > batch
> > > > and streaming
> > > > execution.
> > > >
> > > > The proposal is to merge Blink's enhancements into Flink, to give
> > Flink's
> > > > SQL/Table API and
> > > > execution a big boost in usability and performance.
> > > >
> > > > Just to avoid any confusion: This is not a suggested change of focus to
> > > > batch processing,
> > > > nor would this break with any of the streaming architecture and vision
> > of
> > > > Flink.
> > > > This contribution follows very much the principle of "batch is a
> > special
> > > > case of streaming".
> > > > As a special case, batch makes special optimizations possible. In its
> > > > current state,
> > > > Flink does not exploit many of these optimizations. This contribution
> > > adds
> > > > exactly these
> > > > optimizations and makes the streaming model of Flink applicable to
> > harder
> > > > batch use cases.
> > > >
> > > > Assuming that the community is excited about this as well, and in favor
> > > of
> > > > these enhancements
> > > > to Flink's capabilities, below are some thoughts on how this
> > contribution
> > > > and integration
> > > > could work.
> > > >
> > > > --- Making the code available ---
> > > >
> > > > At the moment, the Blink code is in the form of a big Flink fork
> > (rather
> > > > than isolated
> > > > patches on top of Flink), so the integration is unfortunately not as
> > easy
> > > > as merging a
> > > > few patches or pull requests.
> > > >
> > > > To support a non-disruptive merge of such a big contribution, I believe
> > > it
> > > > make sense to make
> > > > the code of the fork available in the Flink project first.
> > > > From there on, we can start to work on the details for merging the
> > > > enhancements, including
> > > > the refactoring of the necessary parts in the Flink master and the
> > Blink
> > > > code to make a
> > > > merge possible without repeatedly breaking compatibility.
> > > >
> > > > The first question is where do we put the code of the Blink fork during
> > > the
> > > > merging procedure?
> > > > My first thought was to temporarily add a repository (like
> > > > "flink-blink-staging"), but we could
> > > > also put it into a special branch in th

Re: [ANNOUNCE] Contributing Alibaba's Blink

2019-01-22 Thread Dominik Wosiński

Hey!
I also think that creating the separate branch for Blink in Flink repo is a
better idea than creating the fork as IMHO it will allow merging changes
more easily.

Best Regards,
Dom.

wt., 22 sty 2019 o 10:09 Ufuk Celebi  napisał(a):

> Hey Stephan and others,
>
> thanks for the summary. I'm very excited about the outlined improvements.
> :-)
>
> Separate branch vs. fork: I'm fine with either of the suggestions.
> Depending on the expected strategy for merging the changes, expected
> number of additional changes, etc., either one or the other approach
> might be better suited.
>
> – Ufuk
>
> On Tue, Jan 22, 2019 at 9:20 AM Kurt Young  wrote:
> >
> > Hi Driesprong,
> >
> > Glad to hear that you're interested with blink's codes. Actually, blink
> > only has one branch by itself, so either a separated repo or a flink's
> > branch works for blink's code share.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jan 22, 2019 at 2:30 PM Driesprong, Fokko 
> > wrote:
> >
> > > Great news Stephan!
> > >
> > > Why not make the code available by having a fork of Flink on Alibaba's
> > > Github account. This will allow us to do easy diff's in the Github UI
> and
> > > create PR's of cherry-picked commits if needed. I can imagine that the
> > > Blink codebase has a lot of branches by itself, so just pushing a
> couple of
> > > branches to the main Flink repo is not ideal. Looking forward to it!
> > >
> > > Cheers, Fokko
> > >
> > >
> > >
> > >
> > >
> > > Op di 22 jan. 2019 om 03:48 schreef Shaoxuan Wang  >:
> > >
> > > > big +1 to contribute Blink codebase directly into the Apache Flink
> > > project.
> > > > Looking forward to the new journey.
> > > >
> > > > Regards,
> > > > Shaoxuan
> > > >
> > > > On Tue, Jan 22, 2019 at 3:52 AM Xiaowei Jiang 
> > > wrote:
> > > >
> > > > >  Thanks Stephan! We are hoping to make the process as
> non-disruptive as
> > > > > possible to the Flink community. Making the Blink codebase public
> is
> > > the
> > > > > first step that hopefully facilitates further discussions.
> > > > > Xiaowei
> > > > >
> > > > > On Monday, January 21, 2019, 11:46:28 AM PST, Stephan Ewen <
> > > > > se...@apache.org> wrote:
> > > > >
> > > > >  Dear Flink Community!
> > > > >
> > > > > Some of you may have heard it already from announcements or from a
> > > Flink
> > > > > Forward talk:
> > > > > Alibaba has decided to open source its in-house improvements to
> Flink,
> > > > > called Blink!
> > > > > First of all, big thanks to team that developed these improvements
> and
> > > > made
> > > > > this
> > > > > contribution possible!
> > > > >
> > > > > Blink has some very exciting enhancements, most prominently on the
> > > Table
> > > > > API/SQL side
> > > > > and the unified execution of these programs. For batch (bounded)
> data,
> > > > the
> > > > > SQL execution
> > > > > has full TPC-DS coverage (which is a big deal), and the execution
> is
> > > more
> > > > > than 10x faster
> > > > > than the current SQL runtime in Flink. Blink has also added
> support for
> > > > > catalogs,
> > > > > improved the failover speed of batch queries and the resource
> > > management.
> > > > > It also
> > > > > makes some good steps in the direction of more deeply unifying the
> > > batch
> > > > > and streaming
> > > > > execution.
> > > > >
> > > > > The proposal is to merge Blink's enhancements into Flink, to give
> > > Flink's
> > > > > SQL/Table API and
> > > > > execution a big boost in usability and performance.
> > > > >
> > > > > Just to avoid any confusion: This is not a suggested change of
> focus to
> > > > > batch processing,
> > > > > nor would this break with any of the streaming architecture and
> vision
> > > of
> > > > > Flink.
> > > > > This contribution follows very much the principle of "batch is a
> > > special
> > > > > case of streaming".
> > > > > As a special case, batch makes special optimizations possible. In
> its
> > > > > current state,
> > > > > Flink does not exploit many of these optimizations. This
> contribution
> > > > adds
> > > > > exactly these
> > > > > optimizations and makes the streaming model of Flink applicable to
> > > harder
> > > > > batch use cases.
> > > > >
> > > > > Assuming that the community is excited about this as well, and in
> favor
> > > > of
> > > > > these enhancements
> > > > > to Flink's capabilities, below are some thoughts on how this
> > > contribution
> > > > > and integration
> > > > > could work.
> > > > >
> > > > > --- Making the code available ---
> > > > >
> > > > > At the moment, the Blink code is in the form of a big Flink fork
> > > (rather
> > > > > than isolated
> > > > > patches on top of Flink), so the integration is unfortunately not
> as
> > > easy
> > > > > as merging a
> > > > > few patches or pull requests.
> > > > >
> > > > > To support a non-disruptive merge of such a big contribution, I
> believe
> > > > it
> > > > > make sense to make
> > > > > the code of the fork available in the Flink project first.
> > > > > From there o

Re: Apply for Flink contributor permission

2019-01-22 Thread Fabian Hueske

HI Hongtao,

Welcome to the Flink community!
I gave you contributor permissions for Jira.

Best, Fabian

Am Di., 22. Jan. 2019 um 07:04 Uhr schrieb 张洪涛 :

> Hi Guys,
>
> Could anyone give me the contributor permission ?
>
> My Jira ID is hongtao12310
>
> Regards,
> Hongtao
>

[DISCUSS] A strategy for merging the Blink enhancements

2019-01-22 Thread Stephan Ewen

Dear Flink community!

As a follow-up to the thread announcing Alibaba's offer to contribute the
Blink code [1]

,
here are some thoughts on how this contribution could be merged.

As described in the announcement thread, it is a big contribution, and we
need to
carefully plan how to handle the contribution. We would like to get the
improvements to Flink,
while making it as non-disruptive as possible for the community.
I hope that this plan gives the community get a better understanding of
what the
proposed contribution would mean.

Here is an initial rough proposal, with thoughts from
Timo, Piotr, Dawid, Kurt, Shaoxuan, Jincheng, Jark, Aljoscha, Fabian,
Xiaowei:

- It is obviously very hard to merge all changes in a quick move, because
we
are talking about multiple 100k lines of code.

- As much as possible, we want to maintain compatibility with the current
Table API,
so that this becomes a transparent change for most users.

- The two areas with the most changes we identified were
(1) The SQL/Table query processor
(2) The batch scheduling/failover/shuffle

- For the query processor part, this is what we found and propose:

-> The Blink and Flink code have the same semantics (ANSI SQL) except
for minor
aspects (under discussion). Blink also covers more SQL operations.

-> The Blink code is quite different from the current Flink SQL runtime.
Merging as changes seems hardly feasible. From the current
evaluation, the
Blink query processor uses the more advanced architecture, so it
would make
sense to converge to that design.

-> We propose to gradually build up the Blink-based query processor as
a second
query processor under the SQL/Table API. Think of it as two
different runners
for the Table API.
As the new query processor becomes fully merged and stable, we can
deprecate and
eventually remove the existing query processor. That should give the
least
disruption to Flink users and allow for gradual merge/development.

-> Some refactoring of the Table API is necessary to support the above
strategy.
Most of the prerequisite refactoring is around splitting the project
into
different modules, following a similar idea as FLIP-28 [2]

-> A more detailed proposal is being worked on.

-> Same as FLIP-28, this approach would probably need to suspend Table
API
contributions for a short while. We hope that this can be a very
short period,
to not impact the very active development in Flink on Table API/SQL
too much.

- For the batch scheduling and failover enhancements, we should be able
to build
on the currently ongoing refactoring of the scheduling logic [3]
. That should
make it easy to plug in a new scheduler and failover logic. We can port
the Blink
enhancements as a new scheduler / failover handler. We can later make
it the
default for bounded stream programs once the merge is completed and it
is tested.

- For the catalog and source/sink design and interfaces, we would like to
continue with the already started design discussion threads. Once these
are
converged, we might use some of the Blink code for the implementation,
if it
is close to the outcome of the design discussions.

Best,
Stephan

[1]
https://lists.apache.org/thread.html/2f7330e85d702a53b4a2b361149930b50f2e89d8e8a572f8ee2a0e6d@%3Cdev.flink.apache.org%3E

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-28%3A+Long-term+goal+of+making+flink-table+Scala-free

[3] https://issues.apache.org/jira/browse/FLINK-10429

[DISCUSS] Start new Review Process

2019-01-22 Thread Fabian Hueske

Hi everyone,

A few months ago the community discussed and agreed to improve the PR
review process [1-4].
The idea is to follow a checklist to avoid in-depth reviews on
contributions that might not be accepted for other reasons. Thereby,
reviewers and contributors do not spend their time on PRs that will not be
merged.
The checklist consists of five points:

1. The contribution is well-described.
2. There is consensus that the contribution should go into to Flink.
3. [Does not need specific attention | Needs specific attention for X | Has
attention for X by Y]
4. The architectural approach is sound.
5. Overall code quality is good.

Back then we added a review guide to the website [5] but did not put the
new process in place yet. I would like to start this now.
There is a PR [6] that adds the review checklist to the PR template.
Committers who review add PR should follow the checklist and tick and sign
off the boxes by updating the PR description. For that committers need to
be members of the ASF Github organization.

If nobody has concerns, I'll merge the PR in a few days.
Once the PR is merged, the reviews of all new PRs should follow the
checklist.

Best,
Fabian

[1]
https://lists.apache.org/thread.html/dcbe377eb477b531f49c462e90d8b1e50e0ff33c6efd296081c6934d@%3Cdev.flink.apache.org%3E
[2]
https://lists.apache.org/thread.html/172aa6d12ed442ea4da9ed2a72fe0894c9be7408fb2e1b7b50dfcb8c@%3Cdev.flink.apache.org%3E
[3]
https://lists.apache.org/thread.html/5e07c1be8078dd7b89d93c67b71defacff137f3df56ccf4adb04b4d7@%3Cdev.flink.apache.org%3E
[4]
https://lists.apache.org/thread.html/d7fd1fe45949f7c706142c62de85d246c7f6a1485a186fd3e9dced01@%3Cdev.flink.apache.org%3E
[5] https://flink.apache.org/reviewing-prs.html
[6] https://github.com/apache/flink/pull/6873

Re: [ANNOUNCE] Contributing Alibaba's Blink

2019-01-22 Thread Jark Wu

Great news! Looking forward to the new wave of developments.

If Blink needs to be continuously updated, fix bugs, release versions,
maybe a separate repository is a better idea.

Best,
Jark

On Tue, 22 Jan 2019 at 18:29, Dominik Wosiński  wrote:

> Hey!
> I also think that creating the separate branch for Blink in Flink repo is a
> better idea than creating the fork as IMHO it will allow merging changes
> more easily.
>
> Best Regards,
> Dom.
>
> wt., 22 sty 2019 o 10:09 Ufuk Celebi  napisał(a):
>
> > Hey Stephan and others,
> >
> > thanks for the summary. I'm very excited about the outlined improvements.
> > :-)
> >
> > Separate branch vs. fork: I'm fine with either of the suggestions.
> > Depending on the expected strategy for merging the changes, expected
> > number of additional changes, etc., either one or the other approach
> > might be better suited.
> >
> > – Ufuk
> >
> > On Tue, Jan 22, 2019 at 9:20 AM Kurt Young  wrote:
> > >
> > > Hi Driesprong,
> > >
> > > Glad to hear that you're interested with blink's codes. Actually, blink
> > > only has one branch by itself, so either a separated repo or a flink's
> > > branch works for blink's code share.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Jan 22, 2019 at 2:30 PM Driesprong, Fokko  >
> > > wrote:
> > >
> > > > Great news Stephan!
> > > >
> > > > Why not make the code available by having a fork of Flink on
> Alibaba's
> > > > Github account. This will allow us to do easy diff's in the Github UI
> > and
> > > > create PR's of cherry-picked commits if needed. I can imagine that
> the
> > > > Blink codebase has a lot of branches by itself, so just pushing a
> > couple of
> > > > branches to the main Flink repo is not ideal. Looking forward to it!
> > > >
> > > > Cheers, Fokko
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Op di 22 jan. 2019 om 03:48 schreef Shaoxuan Wang <
> wshaox...@gmail.com
> > >:
> > > >
> > > > > big +1 to contribute Blink codebase directly into the Apache Flink
> > > > project.
> > > > > Looking forward to the new journey.
> > > > >
> > > > > Regards,
> > > > > Shaoxuan
> > > > >
> > > > > On Tue, Jan 22, 2019 at 3:52 AM Xiaowei Jiang 
> > > > wrote:
> > > > >
> > > > > >  Thanks Stephan! We are hoping to make the process as
> > non-disruptive as
> > > > > > possible to the Flink community. Making the Blink codebase public
> > is
> > > > the
> > > > > > first step that hopefully facilitates further discussions.
> > > > > > Xiaowei
> > > > > >
> > > > > > On Monday, January 21, 2019, 11:46:28 AM PST, Stephan Ewen <
> > > > > > se...@apache.org> wrote:
> > > > > >
> > > > > >  Dear Flink Community!
> > > > > >
> > > > > > Some of you may have heard it already from announcements or from
> a
> > > > Flink
> > > > > > Forward talk:
> > > > > > Alibaba has decided to open source its in-house improvements to
> > Flink,
> > > > > > called Blink!
> > > > > > First of all, big thanks to team that developed these
> improvements
> > and
> > > > > made
> > > > > > this
> > > > > > contribution possible!
> > > > > >
> > > > > > Blink has some very exciting enhancements, most prominently on
> the
> > > > Table
> > > > > > API/SQL side
> > > > > > and the unified execution of these programs. For batch (bounded)
> > data,
> > > > > the
> > > > > > SQL execution
> > > > > > has full TPC-DS coverage (which is a big deal), and the execution
> > is
> > > > more
> > > > > > than 10x faster
> > > > > > than the current SQL runtime in Flink. Blink has also added
> > support for
> > > > > > catalogs,
> > > > > > improved the failover speed of batch queries and the resource
> > > > management.
> > > > > > It also
> > > > > > makes some good steps in the direction of more deeply unifying
> the
> > > > batch
> > > > > > and streaming
> > > > > > execution.
> > > > > >
> > > > > > The proposal is to merge Blink's enhancements into Flink, to give
> > > > Flink's
> > > > > > SQL/Table API and
> > > > > > execution a big boost in usability and performance.
> > > > > >
> > > > > > Just to avoid any confusion: This is not a suggested change of
> > focus to
> > > > > > batch processing,
> > > > > > nor would this break with any of the streaming architecture and
> > vision
> > > > of
> > > > > > Flink.
> > > > > > This contribution follows very much the principle of "batch is a
> > > > special
> > > > > > case of streaming".
> > > > > > As a special case, batch makes special optimizations possible. In
> > its
> > > > > > current state,
> > > > > > Flink does not exploit many of these optimizations. This
> > contribution
> > > > > adds
> > > > > > exactly these
> > > > > > optimizations and makes the streaming model of Flink applicable
> to
> > > > harder
> > > > > > batch use cases.
> > > > > >
> > > > > > Assuming that the community is excited about this as well, and in
> > favor
> > > > > of
> > > > > > these enhancements
> > > > > > to Flink's capabilities, below are some thoughts on how this
> > > > contribution
> > > > > > and integration
>

[jira] [Created] (FLINK-11406) Return TypeSerializerSchemaCompatibility.incompatible() when arity of nested serializers don't match in CompositeTypeSerializerSnapshot's compatibility check

2019-01-22 Thread Tzu-Li (Gordon) Tai (JIRA)

Tzu-Li (Gordon) Tai created FLINK-11406:
---

 Summary: Return TypeSerializerSchemaCompatibility.incompatible() 
when arity of nested serializers don't match in 
CompositeTypeSerializerSnapshot's compatibility check
 Key: FLINK-11406
 URL: https://issues.apache.org/jira/browse/FLINK-11406
 Project: Flink
  Issue Type: Improvement
  Components: Type Serialization System
Reporter: Tzu-Li (Gordon) Tai
Assignee: Tzu-Li (Gordon) Tai
 Fix For: 1.8.0


Right now, in 
{{CompositeTypeSerializerSnapshot.resolveSchemaCompatibility(...)}}, if arity 
of nested serializers don't match between the snapshot and the provided new 
serializer, a runtime exception is thrown.

More ideally, this should return 
{{TypeSerializerSchemaCompatibility.incompatible()}}.
We should also make it more clearer in the Javadocs that the 
{{CompositeTypeSerializerSnapshot}} is intended for product serializers that 
have a fixed arity of nested serializers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11407) Allow providing reason messages for TypeSerializerSchemaCompatibility.incompatible()

2019-01-22 Thread Tzu-Li (Gordon) Tai (JIRA)

Tzu-Li (Gordon) Tai created FLINK-11407:
---

 Summary: Allow providing reason messages for 
TypeSerializerSchemaCompatibility.incompatible()
 Key: FLINK-11407
 URL: https://issues.apache.org/jira/browse/FLINK-11407
 Project: Flink
  Issue Type: Improvement
  Components: Type Serialization System
Reporter: Tzu-Li (Gordon) Tai
Assignee: Tzu-Li (Gordon) Tai
 Fix For: 1.8.0


There are a few different scenarios where a new serializer can be determined 
incompatible in a compatibility check.

Allowing the incompatible result to be accompanied by a message indicating why 
the new serializer is incompatible would be beneficial, and allows the state 
backends to throw more meaningful exceptions when they do encounter an 
incompatible new serializer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11408) ContinuousProcessingTimeTrigger: NPE on clear() and state is lost on merge

2019-01-22 Thread Cristian (JIRA)

Cristian created FLINK-11408:


 Summary: ContinuousProcessingTimeTrigger: NPE on clear() and state 
is lost on merge
 Key: FLINK-11408
 URL: https://issues.apache.org/jira/browse/FLINK-11408
 Project: Flink
  Issue Type: Bug
 Environment: Put both bugs in 
[https://github.com/casidiablo/flink-continuous-processing-trigger-bugs]

This is running Flink 1.7.1 locally.
Reporter: Cristian


I was testing session windows using processing time and found a couple of 
problems with the 

ContinuousProcessingTimeTrigger. The first one is an NPE in the clear method:

[https://github.com/casidiablo/flink-continuous-processing-trigger-bugs/blob/master/src/main/java/flink/bug/Bug1.java]

The second one, which is most likely related and the root cause of the first 
one, is that the way the state is merged for windows that are merged somehow 
makes it so that the trigger gets confused and it stops triggering:

 

[https://github.com/casidiablo/flink-continuous-processing-trigger-bugs/blob/master/src/main/java/flink/bug/Bug2.java]

 

I managed to solve both of these by using a modified version of 

ContinuousProcessingTimeTrigger which does NOT call 
`ctx.mergePartitionedState(stateDesc);` in the onMerge method.

This is what I understand it happens at the trigger level:
 * The first element in the stream sets an initial fire time (logic is in

ContinuousProcessingTimeTrigger#onElement()), if there is no trigger set.
 * From then on, ContinuousProcessingTimeTrigger#onProcessingTime() is 
responsible for scheduling the next trigger. 
 * When the windows are merged (in the case of session windows), somehow the 
clear and merge methods are called using the wrong window namespace (I think 
this is the root cause of the bug, but I'm not too familiar with that code).
 * Because the state is not cleared properly in the new window namespace, the 
previously scheduled trigger gets executed against the window that was cleared.
 * Moreover, the new window has the state of the previous window, which means 
that:
 ## onElement will NOT schedule a fire trigger
 ## onProcessingTime will never be called at all



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Flink CEP : Doesn't generate output

2019-01-22 Thread dhanuka ranasinghe

Hi All,

I have used Flink CEP to filter some events and generate some alerts based
on certain conditions. But unfortunately doesn't print any result. I have
attached source code herewith, could you please help me on this.




package org.monitoring.stream.analytics;

import java.io.IOException;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.Set;

import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer010;
import org.apache.flink.table.functions.ScalarFunction;
import org.apache.flink.table.shaded.org.apache.commons.lang3.StringUtils;
import org.monitoring.stream.analytics.util.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.MultiMap;

import org.apache.flink.cep.CEP;
import org.apache.flink.cep.PatternSelectFunction;
import org.apache.flink.cep.PatternStream;
import org.apache.flink.cep.pattern.Pattern;
import org.apache.flink.cep.pattern.conditions.SimpleCondition;
import org.apache.flink.streaming.api.windowing.time.Time;


public class FlinkCEP {
private final static Logger LOGGER =
LoggerFactory.getLogger(FlinkCEP.class);

public static void main(String[] args) throws Exception {

String query =
FileHandler.readInputStream(FileHandler.getResourceAsStream("query.sql"));
if (query == null) {
LOGGER.error("*  Can't read resources
");
} else {
LOGGER.info(" " + query + "
=");
}
Properties props =
FileHandler.loadResourceProperties("application.properties");
Properties kConsumer =
FileHandler.loadResourceProperties("consumer.properties");
Properties kProducer =
FileHandler.loadResourceProperties("producer.properties");
String hzConfig =
FileHandler.readInputStream(FileHandler.getResourceAsStream("hazelcast-client.xml"));
String schemaContent =
FileHandler.readInputStream(FileHandler.getResourceAsStream("IRIC-schema.json"));

props.setProperty("auto.offset.reset", "latest");
props.setProperty("flink.starting-position", "latest");
Map tempMap = new HashMap<>();
for (final String name : props.stringPropertyNames())
tempMap.put(name, props.getProperty(name));
final ParameterTool params = ParameterTool.fromMap(tempMap);
String jobName = props.getProperty(ApplicationConfig.JOB_NAME);

LOGGER.info("%%% Desktop Responsibility Start
%%");

LOGGER.info("$$$ Hz instance name " + props.toString());
HazelcastInstance hzInst = HazelcastUtils.getClient(hzConfig, "");

LOGGER.info("== schema " + schemaContent);

MultiMap distributedMap =
hzInst.getMultiMap("masterDataSynch");
distributedMap.put(jobName, query);

LOGGER.info("%% Desktop Responsibility End
%");

Collection queries = distributedMap.get(jobName);
Set rules = new HashSet<>(queries);
LOGGER.info("== query" + query);
rules.add(query);
hzInst.getLifecycleService().shutdown();
final String sourceTable = "dataTable";

String paral = props.getProperty(ApplicationConfig.FLINK_PARALLEL_TASK);
String noOfOROperatorsValue =
props.getProperty(ApplicationConfig.FLINK_NUMBER_OF_OR_OPERATORS);
int noOfOROperators = 50;
if(StringUtils.isNoneBlank(noOfOROperatorsValue)) {
noOfOROperators = Integer.parseInt(noOfOROperatorsValue);
}
List> subQueries = chunk(new ArrayList(rules),
noOfOROperators);

// define a schema

// setup streaming environment
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4,
1));
env.enableCheckpointing(30); // 300 seconds
env.getConfig().setGlobalJobParameters(params);
// env.getConfig().enableObjectReuse();
env.getConfig().setUseSnapshotCompression(true);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
env.setParallelism(Integer.parseInt(paral));
/* env.setStateBackend(new RocksDBStateBackend(env.getStateBackend(),
true)); */

FlinkKafkaConsumer010 kafka = new FlinkKafkaC

[jira] [Created] (FLINK-11409) Make `ProcessFunction`, `ProcessWindowFunction` and etc. pure interfaces

2019-01-22 Thread Kezhu Wang (JIRA)

Kezhu Wang created FLINK-11409:
--

 Summary: Make `ProcessFunction`, `ProcessWindowFunction` and etc. 
pure interfaces
 Key: FLINK-11409
 URL: https://issues.apache.org/jira/browse/FLINK-11409
 Project: Flink
  Issue Type: Improvement
Reporter: Kezhu Wang


I found these functions express no opinionated demands from implementing 
classes. It would be nice to implement as interfaces not abstract classes as 
abstract class is intrusive and hampers caller user cases. For example, client 
can't write an `AbstractFlinkRichFunction` to unify lifecycle management for 
all data processing functions in easy way.

I dive history of some of these functions, and find that some functions were 
converted as abstract class from interface due to default method 
implementation, such as `ProcessFunction` and `CoProcessFunction` were 
converted to abstract classes in FLINK-4460 which predate FLINK-7274. After 
FLINK-7274, [Java 8 default 
method|https://docs.oracle.com/javase/tutorial/java/IandI/defaultmethods.html]
 would be a better solution.

I notice also that some functions which are introduced after FLINK-7274, such 
as `ProcessJoinFunction`, are implemented as abstract classes. I think it would 
be better to establish a well-known principle to guide both api authors and 
callers of data processing functions.

Personally, I prefer interface for all exported function callbacks for the 
reason I express in first paragraph.

Besides this, with `AbstractRichFunction` and interfaces for data processing 
functions I think lots of rich data processing functions can be eliminated as 
they are plain classes extending `AbstractRichFunction` and implementing data 
processing interfaces, clients can write this in one line code with clear 
intention of both data processing and lifecycle management.

Following is a possible incomplete list of data processing functions 
implemented as abstract classes currently:
 * `ProcessFunction`, `KeyedProcessFunction`, `CoProcessFunction` and 
`ProcessJoinFunction`
 * `ProcessWindowFunction` and `ProcessAllWindowFunction`
 * `BaseBroadcastProcessFunction`, `BroadcastProcessFunction` and 
`KeyedBroadcastProcessFunction`

All above functions are annotated with `@PublicEvolving`, making they 
interfaces won't break Flink's compatibility guarantee but compatibility is 
still a big consideration to evaluate this proposal.

Any thoughts on this proposal ? Please must comment out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11410) Kubernetes Setup page gives incorrect url of Flink UI

2019-01-22 Thread Frank Huang (JIRA)

Frank Huang created FLINK-11410:
---

 Summary: Kubernetes Setup page gives incorrect url of Flink UI
 Key: FLINK-11410
 URL: https://issues.apache.org/jira/browse/FLINK-11410
 Project: Flink
  Issue Type: Bug
Reporter: Frank Huang


in this 
[section|https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/kubernetes.html#deploy-flink-session-cluster-on-kubernetes],
 url should be 
[http://localhost:8081/api/v1/namespaces/default/services/flink-jobmanager:ui/proxy|http://localhost:8001/api/v1/namespaces/default/services/flink-jobmanager:ui/proxy].
 The port should be 8081 instead of 8001. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DISCUSS] A strategy for merging the Blink enhancements

2019-01-22 Thread Zhang, Xuefu

Hi Stephan,

Thanks for bringing up the discussions. I'm +1 on the merging plan. One
question though: since the merge will not be completed for some time and there
are might be uses trying blink branch, what's the plan for the development in
the branch? Personally I think we may discourage big contributions to the
branch, which would further complicate the merge, while we shouldn't stop
critical fixes as well.

What's your take on this?

Thanks,
Xuefu

--
From:Stephan Ewen
Sent At:2019 Jan. 22 (Tue.) 06:16
To:dev
Subject:[DISCUSS] A strategy for merging the Blink enhancements