Hi Shaoxuan & Jincheng, Thanks for driving this initiative. Python would be a very big add-on for flink adoption in data science world. One additional suggestion is you may need to think about how to transfer flink Table to pandas dataframe which is a very popular library in python. And you may be interested in apache arrow which is a common layer to transferring data efficiently across languages. https://arrow.apache.org/
vino yang <yanghua1...@gmail.com> 于2019年3月28日周四 下午2:44写道: > Hi jincheng, > > Thanks for activating this discussion again. > I personally look forward to your design draft. > > Best, > Vino > > jincheng sun <sunjincheng...@gmail.com> 于2019年3月28日周四 下午12:16写道: > > > Hi everyone, > > Sorry to join in this discussion late. > > > > Thanks to Xianda Ke for initiating this discussion. I also enjoy the > > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others. > > > > Recently, I did feel the desire of the community and Flink users for > Python > > support. Stephan also pointed out in the discussion of `Adding a mid-term > > roadmap`: "Table API becomes primary API for analytics use cases", while > a > > large number of users in analytics use cases are accustomed to the Python > > language, and the accumulation of a large number of class libraries is > also > > deposited in the python library. > > > > So I am very interested in participating in the discussion of supporting > > Python in Flink. With regard to the three options mentioned so far, it > is a > > great encouragement to leverage the beam’s language portable layer on > > Flink. For now, we can start with a step in the Flink to add a > Py-tableAPI. > > I believe in, in this process, we will have a deeper understanding of how > > Flink support python. If we can quickly let users experience the first > > version of Flink Python TableAPI, we can also receive feedback from many > > users, and consider the long-term goals of multi-language support on > Flink. > > > > So if you agree, I volunteer to draft a document that would support the > > detailed design and implementation plan of Py-TableAPI on Flink. > > > > What do you think? > > > > Shaoxuan Wang <wshaox...@gmail.com> 于2019年2月21日周四 下午10:44写道: > > > > > Hey guys, > > > Thanks for your comments and sorry for the late reply. > > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in > > > different manners. We got a chance to communicate with Tyler Akidau > (from > > > Beam) offline, and explained why the Flink tableAPI needs a specific > > design > > > for python, rather than purely leverage Beam portability layer. > > > > > > In our proposal, most of the Python code is just a DAG/pipeline builder > > for > > > tableAPI. The majority of operators run purely in Java, while only > python > > > UDFs executed in Python environment during the runtime. This design > does > > > not affect the development and adoption of Beam language portability > > layer > > > with Flink runner. Flink and Beam community will still collaborate, > > jointly > > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and > > > control connections between different processes) to ensure the > robustness > > > and performance. > > > > > > Regards, > > > Shaoxuan > > > > > > > > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <t...@apache.org> wrote: > > > > > > > Interest in Python seems on the rise and so this is a good discussion > > to > > > > have :) > > > > > > > > So far there seems to be agreement that Beam's approach towards > Python > > > and > > > > other non-JVM language support (language SDK, portability layer etc.) > > is > > > > the right direction? Specification and execution are native Python > and > > it > > > > does not suffer from the shortcomings of Flink's Jython API and few > > other > > > > approaches. > > > > > > > > Overall there already is good alignment between Beam and Flink in > > > concepts > > > > and model. There are also few of us that are active in both > > communities. > > > > The Beam Flink runner has made a lot of progress this year, but work > on > > > > portability in Beam actually started much before that and was a very > > big > > > > change (originally there was just the Java SDK). Much of the code has > > > been > > > > rewritten as part of the effort; that's what implementing a strong > > multi > > > > language support story took. To have a decent shot at it, the > > equivalent > > > of > > > > much of the Beam portability framework would need to be reinvented in > > > > Flink. This would fork resources and divert focus away from things > that > > > may > > > > be more core to Flink. As you can guess I am in favor of option (1) ! > > > > > > > > We could take a look at SQL for reference. Flink community has > > invested a > > > > lot in SQL and there remains a lot of work to do. Beam community has > > done > > > > the same and we have two completely separate implementations. When I > > > > recently learned more about the Beam SQL work, one of my first > > questions > > > > was if joined effort would not lead to better user value? Calcite is > > > > common, but isn't there much more that could be shared? Someone had > the > > > > idea that in such a world Flink could just substitute portions or all > > of > > > > the graph provided by Beam with it's own optimized version but much > of > > > the > > > > tooling could be same? > > > > > > > > IO connectors are another area where much effort is repeated. It > takes > > a > > > > very long time to arrive at a solid, production quality > implementation > > > > (typically resulting from broad user exposure and running at scale). > > > > Currently there is discussion how connectors can be done much better > in > > > > both projects: SDF in Beam [1] and FLIP-27. > > > > > > > > It's a trade-off, but more synergy would be great! > > > > > > > > Thomas > > > > > > > > [1] > > > > > > > > > > > > > > https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ > > > > > > > > > > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < > > > > whatwouldausti...@gmail.com> > > > > wrote: > > > > > > > > > Hi Shaoxuan, > > > > > > > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay > > > Area > > > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam > > > could > > > > > better leverage runner specific optimizations -- as an > > > example/extension, > > > > > Beam SQL leveraging Flink specific SQL optimizations (to address > your > > > > > point). So, that is part of the eventual roadmap for Beam, and > > > > illustrates > > > > > how concrete efforts towards optimizations in Runner/SDK-Harness > > would > > > > > likely yield the desired result of cross-language support (which > > could > > > be > > > > > done by leveraging Beam, and devote focus to optimizing that > > processing > > > > on > > > > > Flink). > > > > > > > > > > Cheers, > > > > > Austin > > > > > > > > > > > > > > > [1] > > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ > > > > -- > > > > > I > > > > > can post/share videos once available should someone desire. > > > > > > > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <m...@apache.org > > > > > > wrote: > > > > > > > > > > > Hi Xianda, hi Shaoxuan, > > > > > > > > > > > > I'd be in favor of option (1). There is great potential in Beam > and > > > > Flink > > > > > > joining forces on this one. Here's why: > > > > > > > > > > > > The Beam project spent at least a year developing a portability > > layer > > > > > with > > > > > > a > > > > > > reasonable amount of people working on it. Developing a new > > > portability > > > > > > layer > > > > > > from scratch will probably take about the same amount of time and > > > > > > resources. > > > > > > > > > > > > Concerning option (2): There is already a Python API for Flink > but > > an > > > > API > > > > > > is > > > > > > only one part of the portability story. In Beam the portability > is > > > > > > structured > > > > > > into three components: > > > > > > > > > > > > - SDK (API, its Protobuf serialization, and interaction with the > > SDK > > > > > > Harness) > > > > > > - Runner (Translation from Protobuf pipeline to Flink job) > > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the > > > > execution > > > > > > engine) > > > > > > > > > > > > I could imagine the Flink Python API would be another SDK which > > could > > > > > have > > > > > > its > > > > > > own API but would reuse code for the interaction with the SDK > > > Harness. > > > > > > > > > > > > We would be able to focus on the optimizations instead of > > rebuilding > > > a > > > > > > portability layer from scratch. > > > > > > > > > > > > Thanks, > > > > > > Max > > > > > > > > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > > > > > > RE: Stephen's options ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > > > > > > ) > > > > > > > * Option (1): Language portability via Apache Beam > > > > > > > * Option (2): Implement own Python API > > > > > > > * Option (3): Implement own portability layer > > > > > > > > > > > > > > Hi Stephen, > > > > > > > Eventually, I think we should support both option1 and option3. > > > TMO, > > > > > > these > > > > > > > two options are orthogonal. I agree with you that we can > leverage > > > the > > > > > > > existing work and ecosystem in beam by supporting option1. But > > the > > > > > > problem > > > > > > > of beam is that it skips (to the best of my knowledge) the > > natural > > > > > > > table/SQL optimization framework provided by Flink. We should > > spend > > > > all > > > > > > the > > > > > > > needed efforts to support solution1 (as it is the better > > > alternative > > > > of > > > > > > the > > > > > > > current Flink python API), but cannot solely bet on it. Option3 > > is > > > > the > > > > > > > ideal choice for Flink to support all Non-JVM languages which > we > > > > should > > > > > > > better plan to achieve. We have done some preliminary > prototypes > > > for > > > > > > > option2/option3, and it seems not quite complex and difficult > to > > > > > > accomplish. > > > > > > > > > > > > > > Regards, > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexia...@gmail.com> > > > > wrote: > > > > > > > > > > > > > >> Currently there is an ongoing survey about Python usage of > Flink > > > > [1]. > > > > > > Some > > > > > > >> discussion was also brought up there regarding non-jvm > language > > > > > support > > > > > > >> strategy in general. To avoid polluting the survey thread, we > > are > > > > > > starting > > > > > > >> this discussion thread and would like to move the discussions > > > here. > > > > > > >> > > > > > > >> In the interest of facilitating the discussion, we would like > to > > > > first > > > > > > >> share the following design doc which describes what we have > done > > > at > > > > > > Alibaba > > > > > > >> about Python API for Flink. It could serve as a good reference > > to > > > > the > > > > > > >> discussion. > > > > > > >> > > > > > > >> [DISCUSS] Flink Python API > > > > > > >> < > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > > > > > >>> > > > > > > >> > > > > > > >> As of now, we've implemented and delivered Python UDF for SQL > > for > > > > the > > > > > > >> internal users at Alibaba. > > > > > > >> We are starting to implement Python API. > > > > > > >> > > > > > > >> To recap and continue the discussion from the survey thread, I > > > agree > > > > > > with > > > > > > >> @Stephan that we should figure out in which general direction > > > Python > > > > > > >> support should go. Stephan also list three options there: > > > > > > >> * Option (1): Language portability via Apache Beam > > > > > > >> * Option (2): Implement own Python API > > > > > > >> * Option (3): Implement own portability layer > > > > > > >> > > > > > > >> From my perspective, > > > > > > >> (1). Flink language APIs and Beam's languages support are not > > > > mutually > > > > > > >> exclusive. > > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support > > Flink > > > as > > > > > the > > > > > > >> runner. > > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's > > > > ecosystem. > > > > > > >> > > > > > > >> (2). Python API / portability layer > > > > > > >> To support non-JVM languages in Flink, > > > > > > >> * at client side, Flink would provide language interfaces, > > which > > > > > will > > > > > > >> translate user's application to Flink StreamGraph. > > > > > > >> * at server side, Flink would execute user's UDF code at > runtime > > > > > > >> The non-JVM languages communicate with JVM via RPC(or > low-level > > > > > socket, > > > > > > >> embedded interpreter and so on). What the portability layer > can > > do > > > > > > maybe is > > > > > > >> abstracting the RPC layer. When the portability layer is > ready, > > > > still > > > > > > there > > > > > > >> are lots of stuff to do for a specified language. Say, Python, > > we > > > > may > > > > > > still > > > > > > >> have to write the interface classes by hand for the users > > because > > > > > > generated > > > > > > >> code without detailed documentation is unacceptable for users, > > or > > > > > handle > > > > > > >> the serialization issue of lambda/closure which is not a > > built-in > > > > > > feature > > > > > > >> in Python. Maybe, we can start with Python API, then extend > to > > > > other > > > > > > >> languages and abstract the logic in common as the portability > > > layer. > > > > > > >> > > > > > > >> --- > > > > > > >> References: > > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > > > > > > >> > > > > > > >> Regards, > > > > > > >> Xianda > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Best Regards Jeff Zhang