Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Thomas Weise Thu, 20 Dec 2018 21:40:04 -0800

Interest in Python seems on the rise and so this is a good discussion to
have :)


So far there seems to be agreement that Beam's approach towards Python and
other non-JVM language support (language SDK, portability layer etc.) is
the right direction? Specification and execution are native Python and it
does not suffer from the shortcomings of Flink's Jython API and few other
approaches.

Overall there already is good alignment between Beam and Flink in concepts
and model. There are also few of us that are active in both communities.
The Beam Flink runner has made a lot of progress this year, but work on
portability in Beam actually started much before that and was a very big
change (originally there was just the Java SDK). Much of the code has been
rewritten as part of the effort; that's what implementing a strong multi
language support story took. To have a decent shot at it, the equivalent of
much of the Beam portability framework would need to be reinvented in
Flink. This would fork resources and divert focus away from things that may
be more core to Flink. As you can guess I am in favor of option (1) !

We could take a look at SQL for reference. Flink community has invested a
lot in SQL and there remains a lot of work to do. Beam community has done
the same and we have two completely separate implementations. When I
recently learned more about the Beam SQL work, one of my first questions
was if joined effort would not lead to better user value? Calcite is
common, but isn't there much more that could be shared? Someone had the
idea that in such a world Flink could just substitute portions or all of
the graph provided by Beam with it's own optimized version but much of the
tooling could be same?

IO connectors are another area where much effort is repeated. It takes a
very long time to arrive at a solid, production quality implementation
(typically resulting from broad user exposure and running at scale).
Currently there is discussion how connectors can be done much better in
both projects: SDF in Beam [1] and FLIP-27.

It's a trade-off, but more synergy would be great!

Thomas

[1]
https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/


On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <whatwouldausti...@gmail.com>
wrote:

> Hi Shaoxuan,
>
> FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area
> Apache Beam Meetup[1] which included a bit on a vision for how Beam could
> better leverage runner specific optimizations -- as an example/extension,
> Beam SQL leveraging Flink specific SQL optimizations (to address your
> point).  So, that is part of the eventual roadmap for Beam, and illustrates
> how concrete efforts towards optimizations in Runner/SDK-Harness would
> likely yield the desired result of cross-language support (which could be
> done by leveraging Beam, and devote focus to optimizing that processing on
> Flink).
>
> Cheers,
> Austin
>
>
> [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ --
> I
> can post/share videos once available should someone desire.
>
> On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <m...@apache.org> wrote:
>
> > Hi Xianda, hi Shaoxuan,
> >
> > I'd be in favor of option (1). There is great potential in Beam and Flink
> > joining forces on this one. Here's why:
> >
> > The Beam project spent at least a year developing a portability layer
> with
> > a
> > reasonable amount of people working on it. Developing a new portability
> > layer
> > from scratch will probably take about the same amount of time and
> > resources.
> >
> > Concerning option (2): There is already a Python API for Flink but an API
> > is
> > only one part of the portability story. In Beam the portability is
> > structured
> > into three components:
> >
> > - SDK (API, its Protobuf serialization, and interaction with the SDK
> > Harness)
> > - Runner (Translation from Protobuf pipeline to Flink job)
> > - SDK Harness (UDF execution, Interaction with the SDK and the execution
> > engine)
> >
> > I could imagine the Flink Python API would be another SDK which could
> have
> > its
> > own API but would reuse code for the interaction with the SDK Harness.
> >
> > We would be able to focus on the optimizations instead of rebuilding a
> > portability layer from scratch.
> >
> > Thanks,
> > Max
> >
> > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > RE: Stephen's options (
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > )
> > > * Option (1): Language portability via Apache Beam
> > > * Option (2): Implement own Python API
> > > * Option (3): Implement own portability layer
> > >
> > > Hi Stephen,
> > > Eventually, I think we should support both option1 and option3. TMO,
> > these
> > > two options are orthogonal. I agree with you that we can leverage the
> > > existing work and ecosystem in beam by supporting option1. But the
> > problem
> > > of beam is that it skips (to the best of my knowledge) the natural
> > > table/SQL optimization framework provided by Flink. We should spend all
> > the
> > > needed efforts to support solution1 (as it is the better alternative of
> > the
> > > current Flink python API), but cannot solely bet on it. Option3 is the
> > > ideal choice for Flink to support all Non-JVM languages which we should
> > > better plan to achieve. We have done some preliminary prototypes for
> > > option2/option3, and it seems not quite complex and difficult to
> > accomplish.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexia...@gmail.com> wrote:
> > >
> > >> Currently there is an ongoing survey about Python usage of Flink [1].
> > Some
> > >> discussion was also brought up there regarding non-jvm language
> support
> > >> strategy in general. To avoid polluting the survey thread, we are
> > starting
> > >> this discussion thread and would like to move the discussions here.
> > >>
> > >> In the interest of facilitating the discussion, we would like to first
> > >> share the following design doc which describes what we have done at
> > Alibaba
> > >> about Python API for Flink. It could serve as a good reference to the
> > >> discussion.
> > >>
> > >>   [DISCUSS] Flink Python API
> > >> <
> > >>
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > >>>
> > >>
> > >> As of now, we've implemented and delivered Python UDF for SQL for the
> > >> internal users at Alibaba.
> > >> We are starting to implement Python API.
> > >>
> > >> To recap and continue the discussion from the survey thread, I agree
> > with
> > >> @Stephan that we should figure out in which general direction Python
> > >> support should go. Stephan also list three options there:
> > >> * Option (1): Language portability via Apache Beam
> > >> * Option (2): Implement own Python API
> > >> * Option (3): Implement own portability layer
> > >>
> > >>  From my perspective,
> > >> (1). Flink language APIs and Beam's languages support are not mutually
> > >> exclusive.
> > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as
> the
> > >> runner.
> > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.
> > >>
> > >> (2). Python API / portability layer
> > >> To support non-JVM languages in Flink,
> > >>   * at client side, Flink would provide language interfaces, which
> will
> > >> translate user's application to Flink StreamGraph.
> > >> * at server side, Flink would execute user's UDF code at runtime
> > >> The non-JVM languages communicate with JVM via RPC(or low-level
> socket,
> > >> embedded interpreter and so on). What the portability layer can do
> > maybe is
> > >> abstracting the RPC layer. When the portability layer is ready, still
> > there
> > >> are lots of stuff to do for a specified language. Say, Python, we may
> > still
> > >> have to write the interface classes by hand for the users because
> > generated
> > >> code without detailed documentation is unacceptable for users, or
> handle
> > >> the serialization issue of lambda/closure which is not a built-in
> > feature
> > >> in Python.  Maybe, we can start with Python API, then extend to other
> > >> languages and abstract the logic in common as the portability layer.
> > >>
> > >> ---
> > >> References:
> > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > >>
> > >> Regards,
> > >> Xianda
> > >>
> > >
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Reply via email to