Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

jincheng sun Tue, 02 Apr 2019 01:14:00 -0700

Thanks for your feedback Vino, Jeff!

I have started new threading outlining what we are proposing in Python
Table API.


[DISCUSS] FLIP-38 Support python language in flink TableAPI can be found
here:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html

Best,
Jincheng

Jeff Zhang <zjf...@gmail.com> 于2019年3月28日周四 下午4:59写道：

> Hi Shaoxuan & Jincheng,
>
> Thanks for driving this initiative. Python would be a very big add-on for
> flink adoption in data science world. One additional suggestion is you may
> need to think about how to transfer flink Table to pandas dataframe which
> is a very popular library in python. And you may be interested in apache
> arrow which is a common layer to transferring data efficiently across
> languages. https://arrow.apache.org/
>
>
>
>
>
>
> vino yang <yanghua1...@gmail.com> 于2019年3月28日周四 下午2:44写道：
>
>> Hi jincheng,
>>
>> Thanks for activating this discussion again.
>> I personally look forward to your design draft.
>>
>> Best,
>> Vino
>>
>> jincheng sun <sunjincheng...@gmail.com> 于2019年3月28日周四 下午12:16写道：
>>
>> > Hi everyone,
>> > Sorry to join in this discussion late.
>> >
>> > Thanks to Xianda Ke for initiating this discussion. I also enjoy the
>> > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.
>> >
>> > Recently, I did feel the desire of the community and Flink users for
>> Python
>> > support. Stephan also pointed out in the discussion of `Adding a
>> mid-term
>> > roadmap`: "Table API becomes primary API for analytics use cases",
>> while a
>> > large number of users in analytics use cases are accustomed to the
>> Python
>> > language, and the accumulation of a large number of class libraries is
>> also
>> > deposited in the python library.
>> >
>> > So I am very interested in participating in the discussion of supporting
>> > Python in Flink. With regard to the three options mentioned so far, it
>> is a
>> > great encouragement to leverage the beam’s language portable layer on
>> > Flink. For now, we can start with a step in the Flink to add a
>> Py-tableAPI.
>> > I believe in, in this process, we will have a deeper understanding of
>> how
>> > Flink support python. If we can quickly let users experience the first
>> > version of Flink Python TableAPI, we can also receive feedback from many
>> > users, and consider the long-term goals of multi-language support on
>> Flink.
>> >
>> > So if you agree, I volunteer to draft a document that would support the
>> > detailed design and implementation plan of Py-TableAPI on Flink.
>> >
>> > What do you think?
>> >
>> > Shaoxuan Wang <wshaox...@gmail.com> 于2019年2月21日周四 下午10:44写道：
>> >
>> > > Hey guys,
>> > > Thanks for your comments and sorry for the late reply.
>> > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
>> > > different manners. We got a chance to communicate with Tyler Akidau
>> (from
>> > > Beam) offline, and explained why the Flink tableAPI needs a specific
>> > design
>> > > for python, rather than purely leverage Beam portability layer.
>> > >
>> > > In our proposal, most of the Python code is just a DAG/pipeline
>> builder
>> > for
>> > > tableAPI. The majority of operators run purely in Java, while only
>> python
>> > > UDFs executed in Python environment during the runtime. This design
>> does
>> > > not affect the development and adoption of Beam language portability
>> > layer
>> > > with Flink runner. Flink and Beam community will still collaborate,
>> > jointly
>> > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
>> > > control connections between different processes) to ensure the
>> robustness
>> > > and performance.
>> > >
>> > > Regards,
>> > > Shaoxuan
>> > >
>> > >
>> > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <t...@apache.org> wrote:
>> > >
>> > > > Interest in Python seems on the rise and so this is a good
>> discussion
>> > to
>> > > > have :)
>> > > >
>> > > > So far there seems to be agreement that Beam's approach towards
>> Python
>> > > and
>> > > > other non-JVM language support (language SDK, portability layer
>> etc.)
>> > is
>> > > > the right direction? Specification and execution are native Python
>> and
>> > it
>> > > > does not suffer from the shortcomings of Flink's Jython API and few
>> > other
>> > > > approaches.
>> > > >
>> > > > Overall there already is good alignment between Beam and Flink in
>> > > concepts
>> > > > and model. There are also few of us that are active in both
>> > communities.
>> > > > The Beam Flink runner has made a lot of progress this year, but
>> work on
>> > > > portability in Beam actually started much before that and was a very
>> > big
>> > > > change (originally there was just the Java SDK). Much of the code
>> has
>> > > been
>> > > > rewritten as part of the effort; that's what implementing a strong
>> > multi
>> > > > language support story took. To have a decent shot at it, the
>> > equivalent
>> > > of
>> > > > much of the Beam portability framework would need to be reinvented
>> in
>> > > > Flink. This would fork resources and divert focus away from things
>> that
>> > > may
>> > > > be more core to Flink. As you can guess I am in favor of option (1)
>> !
>> > > >
>> > > > We could take a look at SQL for reference. Flink community has
>> > invested a
>> > > > lot in SQL and there remains a lot of work to do. Beam community has
>> > done
>> > > > the same and we have two completely separate implementations. When I
>> > > > recently learned more about the Beam SQL work, one of my first
>> > questions
>> > > > was if joined effort would not lead to better user value? Calcite is
>> > > > common, but isn't there much more that could be shared? Someone had
>> the
>> > > > idea that in such a world Flink could just substitute portions or
>> all
>> > of
>> > > > the graph provided by Beam with it's own optimized version but much
>> of
>> > > the
>> > > > tooling could be same?
>> > > >
>> > > > IO connectors are another area where much effort is repeated. It
>> takes
>> > a
>> > > > very long time to arrive at a solid, production quality
>> implementation
>> > > > (typically resulting from broad user exposure and running at scale).
>> > > > Currently there is discussion how connectors can be done much
>> better in
>> > > > both projects: SDF in Beam [1] and FLIP-27.
>> > > >
>> > > > It's a trade-off, but more synergy would be great!
>> > > >
>> > > > Thomas
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
>> > > >
>> > > >
>> > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
>> > > > whatwouldausti...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Shaoxuan,
>> > > > >
>> > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the
>> Bay
>> > > Area
>> > > > > Apache Beam Meetup[1] which included a bit on a vision for how
>> Beam
>> > > could
>> > > > > better leverage runner specific optimizations -- as an
>> > > example/extension,
>> > > > > Beam SQL leveraging Flink specific SQL optimizations (to address
>> your
>> > > > > point).  So, that is part of the eventual roadmap for Beam, and
>> > > > illustrates
>> > > > > how concrete efforts towards optimizations in Runner/SDK-Harness
>> > would
>> > > > > likely yield the desired result of cross-language support (which
>> > could
>> > > be
>> > > > > done by leveraging Beam, and devote focus to optimizing that
>> > processing
>> > > > on
>> > > > > Flink).
>> > > > >
>> > > > > Cheers,
>> > > > > Austin
>> > > > >
>> > > > >
>> > > > > [1]
>> > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
>> > > > --
>> > > > > I
>> > > > > can post/share videos once available should someone desire.
>> > > > >
>> > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <
>> m...@apache.org>
>> > > > wrote:
>> > > > >
>> > > > > > Hi Xianda, hi Shaoxuan,
>> > > > > >
>> > > > > > I'd be in favor of option (1). There is great potential in Beam
>> and
>> > > > Flink
>> > > > > > joining forces on this one. Here's why:
>> > > > > >
>> > > > > > The Beam project spent at least a year developing a portability
>> > layer
>> > > > > with
>> > > > > > a
>> > > > > > reasonable amount of people working on it. Developing a new
>> > > portability
>> > > > > > layer
>> > > > > > from scratch will probably take about the same amount of time
>> and
>> > > > > > resources.
>> > > > > >
>> > > > > > Concerning option (2): There is already a Python API for Flink
>> but
>> > an
>> > > > API
>> > > > > > is
>> > > > > > only one part of the portability story. In Beam the portability
>> is
>> > > > > > structured
>> > > > > > into three components:
>> > > > > >
>> > > > > > - SDK (API, its Protobuf serialization, and interaction with the
>> > SDK
>> > > > > > Harness)
>> > > > > > - Runner (Translation from Protobuf pipeline to Flink job)
>> > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the
>> > > > execution
>> > > > > > engine)
>> > > > > >
>> > > > > > I could imagine the Flink Python API would be another SDK which
>> > could
>> > > > > have
>> > > > > > its
>> > > > > > own API but would reuse code for the interaction with the SDK
>> > > Harness.
>> > > > > >
>> > > > > > We would be able to focus on the optimizations instead of
>> > rebuilding
>> > > a
>> > > > > > portability layer from scratch.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Max
>> > > > > >
>> > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
>> > > > > > > RE: Stephen's options (
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
>> > > > > > > )
>> > > > > > > * Option (1): Language portability via Apache Beam
>> > > > > > > * Option (2): Implement own Python API
>> > > > > > > * Option (3): Implement own portability layer
>> > > > > > >
>> > > > > > > Hi Stephen,
>> > > > > > > Eventually, I think we should support both option1 and
>> option3.
>> > > TMO,
>> > > > > > these
>> > > > > > > two options are orthogonal. I agree with you that we can
>> leverage
>> > > the
>> > > > > > > existing work and ecosystem in beam by supporting option1. But
>> > the
>> > > > > > problem
>> > > > > > > of beam is that it skips (to the best of my knowledge) the
>> > natural
>> > > > > > > table/SQL optimization framework provided by Flink. We should
>> > spend
>> > > > all
>> > > > > > the
>> > > > > > > needed efforts to support solution1 (as it is the better
>> > > alternative
>> > > > of
>> > > > > > the
>> > > > > > > current Flink python API), but cannot solely bet on it.
>> Option3
>> > is
>> > > > the
>> > > > > > > ideal choice for Flink to support all Non-JVM languages which
>> we
>> > > > should
>> > > > > > > better plan to achieve. We have done some preliminary
>> prototypes
>> > > for
>> > > > > > > option2/option3, and it seems not quite complex and difficult
>> to
>> > > > > > accomplish.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > Shaoxuan
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexia...@gmail.com
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > >> Currently there is an ongoing survey about Python usage of
>> Flink
>> > > > [1].
>> > > > > > Some
>> > > > > > >> discussion was also brought up there regarding non-jvm
>> language
>> > > > > support
>> > > > > > >> strategy in general. To avoid polluting the survey thread, we
>> > are
>> > > > > > starting
>> > > > > > >> this discussion thread and would like to move the discussions
>> > > here.
>> > > > > > >>
>> > > > > > >> In the interest of facilitating the discussion, we would
>> like to
>> > > > first
>> > > > > > >> share the following design doc which describes what we have
>> done
>> > > at
>> > > > > > Alibaba
>> > > > > > >> about Python API for Flink. It could serve as a good
>> reference
>> > to
>> > > > the
>> > > > > > >> discussion.
>> > > > > > >>
>> > > > > > >>   [DISCUSS] Flink Python API
>> > > > > > >> <
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
>> > > > > > >>>
>> > > > > > >>
>> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL
>> > for
>> > > > the
>> > > > > > >> internal users at Alibaba.
>> > > > > > >> We are starting to implement Python API.
>> > > > > > >>
>> > > > > > >> To recap and continue the discussion from the survey thread,
>> I
>> > > agree
>> > > > > > with
>> > > > > > >> @Stephan that we should figure out in which general direction
>> > > Python
>> > > > > > >> support should go. Stephan also list three options there:
>> > > > > > >> * Option (1): Language portability via Apache Beam
>> > > > > > >> * Option (2): Implement own Python API
>> > > > > > >> * Option (3): Implement own portability layer
>> > > > > > >>
>> > > > > > >>  From my perspective,
>> > > > > > >> (1). Flink language APIs and Beam's languages support are not
>> > > > mutually
>> > > > > > >> exclusive.
>> > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support
>> > Flink
>> > > as
>> > > > > the
>> > > > > > >> runner.
>> > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
>> > > > ecosystem.
>> > > > > > >>
>> > > > > > >> (2). Python API / portability layer
>> > > > > > >> To support non-JVM languages in Flink,
>> > > > > > >>   * at client side, Flink would provide language interfaces,
>> > which
>> > > > > will
>> > > > > > >> translate user's application to Flink StreamGraph.
>> > > > > > >> * at server side, Flink would execute user's UDF code at
>> runtime
>> > > > > > >> The non-JVM languages communicate with JVM via RPC(or
>> low-level
>> > > > > socket,
>> > > > > > >> embedded interpreter and so on). What the portability layer
>> can
>> > do
>> > > > > > maybe is
>> > > > > > >> abstracting the RPC layer. When the portability layer is
>> ready,
>> > > > still
>> > > > > > there
>> > > > > > >> are lots of stuff to do for a specified language. Say,
>> Python,
>> > we
>> > > > may
>> > > > > > still
>> > > > > > >> have to write the interface classes by hand for the users
>> > because
>> > > > > > generated
>> > > > > > >> code without detailed documentation is unacceptable for
>> users,
>> > or
>> > > > > handle
>> > > > > > >> the serialization issue of lambda/closure which is not a
>> > built-in
>> > > > > > feature
>> > > > > > >> in Python.  Maybe, we can start with Python API, then extend
>> to
>> > > > other
>> > > > > > >> languages and abstract the logic in common as the portability
>> > > layer.
>> > > > > > >>
>> > > > > > >> ---
>> > > > > > >> References:
>> > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
>> > > > > > >>
>> > > > > > >> Regards,
>> > > > > > >> Xianda
>> > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Reply via email to