Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Jeff Zhang Thu, 28 Mar 2019 02:08:58 -0700

Hi Shaoxuan & Jincheng,

Thanks for driving this initiative. Python would be a very big add-on for
flink adoption in data science world. One additional suggestion is you may
need to think about how to transfer flink Table to pandas dataframe which
is a very popular library in python. And you may be interested in apache
arrow which is a common layer to transferring data efficiently across
languages. https://arrow.apache.org/







vino yang <yanghua1...@gmail.com> 于2019年3月28日周四 下午2:44写道：

> Hi jincheng,
>
> Thanks for activating this discussion again.
> I personally look forward to your design draft.
>
> Best,
> Vino
>
> jincheng sun <sunjincheng...@gmail.com> 于2019年3月28日周四 下午12:16写道：
>
> > Hi everyone,
> > Sorry to join in this discussion late.
> >
> > Thanks to Xianda Ke for initiating this discussion. I also enjoy the
> > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.
> >
> > Recently, I did feel the desire of the community and Flink users for
> Python
> > support. Stephan also pointed out in the discussion of `Adding a mid-term
> > roadmap`: "Table API becomes primary API for analytics use cases", while
> a
> > large number of users in analytics use cases are accustomed to the Python
> > language, and the accumulation of a large number of class libraries is
> also
> > deposited in the python library.
> >
> > So I am very interested in participating in the discussion of supporting
> > Python in Flink. With regard to the three options mentioned so far, it
> is a
> > great encouragement to leverage the beam’s language portable layer on
> > Flink. For now, we can start with a step in the Flink to add a
> Py-tableAPI.
> > I believe in, in this process, we will have a deeper understanding of how
> > Flink support python. If we can quickly let users experience the first
> > version of Flink Python TableAPI, we can also receive feedback from many
> > users, and consider the long-term goals of multi-language support on
> Flink.
> >
> > So if you agree, I volunteer to draft a document that would support the
> > detailed design and implementation plan of Py-TableAPI on Flink.
> >
> > What do you think?
> >
> > Shaoxuan Wang <wshaox...@gmail.com> 于2019年2月21日周四 下午10:44写道：
> >
> > > Hey guys,
> > > Thanks for your comments and sorry for the late reply.
> > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
> > > different manners. We got a chance to communicate with Tyler Akidau
> (from
> > > Beam) offline, and explained why the Flink tableAPI needs a specific
> > design
> > > for python, rather than purely leverage Beam portability layer.
> > >
> > > In our proposal, most of the Python code is just a DAG/pipeline builder
> > for
> > > tableAPI. The majority of operators run purely in Java, while only
> python
> > > UDFs executed in Python environment during the runtime. This design
> does
> > > not affect the development and adoption of Beam language portability
> > layer
> > > with Flink runner. Flink and Beam community will still collaborate,
> > jointly
> > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
> > > control connections between different processes) to ensure the
> robustness
> > > and performance.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <t...@apache.org> wrote:
> > >
> > > > Interest in Python seems on the rise and so this is a good discussion
> > to
> > > > have :)
> > > >
> > > > So far there seems to be agreement that Beam's approach towards
> Python
> > > and
> > > > other non-JVM language support (language SDK, portability layer etc.)
> > is
> > > > the right direction? Specification and execution are native Python
> and
> > it
> > > > does not suffer from the shortcomings of Flink's Jython API and few
> > other
> > > > approaches.
> > > >
> > > > Overall there already is good alignment between Beam and Flink in
> > > concepts
> > > > and model. There are also few of us that are active in both
> > communities.
> > > > The Beam Flink runner has made a lot of progress this year, but work
> on
> > > > portability in Beam actually started much before that and was a very
> > big
> > > > change (originally there was just the Java SDK). Much of the code has
> > > been
> > > > rewritten as part of the effort; that's what implementing a strong
> > multi
> > > > language support story took. To have a decent shot at it, the
> > equivalent
> > > of
> > > > much of the Beam portability framework would need to be reinvented in
> > > > Flink. This would fork resources and divert focus away from things
> that
> > > may
> > > > be more core to Flink. As you can guess I am in favor of option (1) !
> > > >
> > > > We could take a look at SQL for reference. Flink community has
> > invested a
> > > > lot in SQL and there remains a lot of work to do. Beam community has
> > done
> > > > the same and we have two completely separate implementations. When I
> > > > recently learned more about the Beam SQL work, one of my first
> > questions
> > > > was if joined effort would not lead to better user value? Calcite is
> > > > common, but isn't there much more that could be shared? Someone had
> the
> > > > idea that in such a world Flink could just substitute portions or all
> > of
> > > > the graph provided by Beam with it's own optimized version but much
> of
> > > the
> > > > tooling could be same?
> > > >
> > > > IO connectors are another area where much effort is repeated. It
> takes
> > a
> > > > very long time to arrive at a solid, production quality
> implementation
> > > > (typically resulting from broad user exposure and running at scale).
> > > > Currently there is discussion how connectors can be done much better
> in
> > > > both projects: SDF in Beam [1] and FLIP-27.
> > > >
> > > > It's a trade-off, but more synergy would be great!
> > > >
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
> > > >
> > > >
> > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
> > > > whatwouldausti...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Shaoxuan,
> > > > >
> > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay
> > > Area
> > > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam
> > > could
> > > > > better leverage runner specific optimizations -- as an
> > > example/extension,
> > > > > Beam SQL leveraging Flink specific SQL optimizations (to address
> your
> > > > > point).  So, that is part of the eventual roadmap for Beam, and
> > > > illustrates
> > > > > how concrete efforts towards optimizations in Runner/SDK-Harness
> > would
> > > > > likely yield the desired result of cross-language support (which
> > could
> > > be
> > > > > done by leveraging Beam, and devote focus to optimizing that
> > processing
> > > > on
> > > > > Flink).
> > > > >
> > > > > Cheers,
> > > > > Austin
> > > > >
> > > > >
> > > > > [1]
> > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> > > > --
> > > > > I
> > > > > can post/share videos once available should someone desire.
> > > > >
> > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <m...@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > Hi Xianda, hi Shaoxuan,
> > > > > >
> > > > > > I'd be in favor of option (1). There is great potential in Beam
> and
> > > > Flink
> > > > > > joining forces on this one. Here's why:
> > > > > >
> > > > > > The Beam project spent at least a year developing a portability
> > layer
> > > > > with
> > > > > > a
> > > > > > reasonable amount of people working on it. Developing a new
> > > portability
> > > > > > layer
> > > > > > from scratch will probably take about the same amount of time and
> > > > > > resources.
> > > > > >
> > > > > > Concerning option (2): There is already a Python API for Flink
> but
> > an
> > > > API
> > > > > > is
> > > > > > only one part of the portability story. In Beam the portability
> is
> > > > > > structured
> > > > > > into three components:
> > > > > >
> > > > > > - SDK (API, its Protobuf serialization, and interaction with the
> > SDK
> > > > > > Harness)
> > > > > > - Runner (Translation from Protobuf pipeline to Flink job)
> > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the
> > > > execution
> > > > > > engine)
> > > > > >
> > > > > > I could imagine the Flink Python API would be another SDK which
> > could
> > > > > have
> > > > > > its
> > > > > > own API but would reuse code for the interaction with the SDK
> > > Harness.
> > > > > >
> > > > > > We would be able to focus on the optimizations instead of
> > rebuilding
> > > a
> > > > > > portability layer from scratch.
> > > > > >
> > > > > > Thanks,
> > > > > > Max
> > > > > >
> > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > > > > > RE: Stephen's options (
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > > > > > )
> > > > > > > * Option (1): Language portability via Apache Beam
> > > > > > > * Option (2): Implement own Python API
> > > > > > > * Option (3): Implement own portability layer
> > > > > > >
> > > > > > > Hi Stephen,
> > > > > > > Eventually, I think we should support both option1 and option3.
> > > TMO,
> > > > > > these
> > > > > > > two options are orthogonal. I agree with you that we can
> leverage
> > > the
> > > > > > > existing work and ecosystem in beam by supporting option1. But
> > the
> > > > > > problem
> > > > > > > of beam is that it skips (to the best of my knowledge) the
> > natural
> > > > > > > table/SQL optimization framework provided by Flink. We should
> > spend
> > > > all
> > > > > > the
> > > > > > > needed efforts to support solution1 (as it is the better
> > > alternative
> > > > of
> > > > > > the
> > > > > > > current Flink python API), but cannot solely bet on it. Option3
> > is
> > > > the
> > > > > > > ideal choice for Flink to support all Non-JVM languages which
> we
> > > > should
> > > > > > > better plan to achieve. We have done some preliminary
> prototypes
> > > for
> > > > > > > option2/option3, and it seems not quite complex and difficult
> to
> > > > > > accomplish.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Shaoxuan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexia...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > >> Currently there is an ongoing survey about Python usage of
> Flink
> > > > [1].
> > > > > > Some
> > > > > > >> discussion was also brought up there regarding non-jvm
> language
> > > > > support
> > > > > > >> strategy in general. To avoid polluting the survey thread, we
> > are
> > > > > > starting
> > > > > > >> this discussion thread and would like to move the discussions
> > > here.
> > > > > > >>
> > > > > > >> In the interest of facilitating the discussion, we would like
> to
> > > > first
> > > > > > >> share the following design doc which describes what we have
> done
> > > at
> > > > > > Alibaba
> > > > > > >> about Python API for Flink. It could serve as a good reference
> > to
> > > > the
> > > > > > >> discussion.
> > > > > > >>
> > > > > > >>   [DISCUSS] Flink Python API
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > > > > > >>>
> > > > > > >>
> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL
> > for
> > > > the
> > > > > > >> internal users at Alibaba.
> > > > > > >> We are starting to implement Python API.
> > > > > > >>
> > > > > > >> To recap and continue the discussion from the survey thread, I
> > > agree
> > > > > > with
> > > > > > >> @Stephan that we should figure out in which general direction
> > > Python
> > > > > > >> support should go. Stephan also list three options there:
> > > > > > >> * Option (1): Language portability via Apache Beam
> > > > > > >> * Option (2): Implement own Python API
> > > > > > >> * Option (3): Implement own portability layer
> > > > > > >>
> > > > > > >>  From my perspective,
> > > > > > >> (1). Flink language APIs and Beam's languages support are not
> > > > mutually
> > > > > > >> exclusive.
> > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support
> > Flink
> > > as
> > > > > the
> > > > > > >> runner.
> > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
> > > > ecosystem.
> > > > > > >>
> > > > > > >> (2). Python API / portability layer
> > > > > > >> To support non-JVM languages in Flink,
> > > > > > >>   * at client side, Flink would provide language interfaces,
> > which
> > > > > will
> > > > > > >> translate user's application to Flink StreamGraph.
> > > > > > >> * at server side, Flink would execute user's UDF code at
> runtime
> > > > > > >> The non-JVM languages communicate with JVM via RPC(or
> low-level
> > > > > socket,
> > > > > > >> embedded interpreter and so on). What the portability layer
> can
> > do
> > > > > > maybe is
> > > > > > >> abstracting the RPC layer. When the portability layer is
> ready,
> > > > still
> > > > > > there
> > > > > > >> are lots of stuff to do for a specified language. Say, Python,
> > we
> > > > may
> > > > > > still
> > > > > > >> have to write the interface classes by hand for the users
> > because
> > > > > > generated
> > > > > > >> code without detailed documentation is unacceptable for users,
> > or
> > > > > handle
> > > > > > >> the serialization issue of lambda/closure which is not a
> > built-in
> > > > > > feature
> > > > > > >> in Python.  Maybe, we can start with Python API, then extend
> to
> > > > other
> > > > > > >> languages and abstract the logic in common as the portability
> > > layer.
> > > > > > >>
> > > > > > >> ---
> > > > > > >> References:
> > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Xianda
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best Regards

Jeff Zhang

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Reply via email to