Thanks for your feedback Vino, Jeff! I have started new threading outlining what we are proposing in Python Table API.
[DISCUSS] FLIP-38 Support python language in flink TableAPI can be found here: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html Best, Jincheng Jeff Zhang <zjf...@gmail.com> 于2019年3月28日周四 下午4:59写道: > Hi Shaoxuan & Jincheng, > > Thanks for driving this initiative. Python would be a very big add-on for > flink adoption in data science world. One additional suggestion is you may > need to think about how to transfer flink Table to pandas dataframe which > is a very popular library in python. And you may be interested in apache > arrow which is a common layer to transferring data efficiently across > languages. https://arrow.apache.org/ > > > > > > > vino yang <yanghua1...@gmail.com> 于2019年3月28日周四 下午2:44写道: > >> Hi jincheng, >> >> Thanks for activating this discussion again. >> I personally look forward to your design draft. >> >> Best, >> Vino >> >> jincheng sun <sunjincheng...@gmail.com> 于2019年3月28日周四 下午12:16写道: >> >> > Hi everyone, >> > Sorry to join in this discussion late. >> > >> > Thanks to Xianda Ke for initiating this discussion. I also enjoy the >> > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others. >> > >> > Recently, I did feel the desire of the community and Flink users for >> Python >> > support. Stephan also pointed out in the discussion of `Adding a >> mid-term >> > roadmap`: "Table API becomes primary API for analytics use cases", >> while a >> > large number of users in analytics use cases are accustomed to the >> Python >> > language, and the accumulation of a large number of class libraries is >> also >> > deposited in the python library. >> > >> > So I am very interested in participating in the discussion of supporting >> > Python in Flink. With regard to the three options mentioned so far, it >> is a >> > great encouragement to leverage the beam’s language portable layer on >> > Flink. For now, we can start with a step in the Flink to add a >> Py-tableAPI. >> > I believe in, in this process, we will have a deeper understanding of >> how >> > Flink support python. If we can quickly let users experience the first >> > version of Flink Python TableAPI, we can also receive feedback from many >> > users, and consider the long-term goals of multi-language support on >> Flink. >> > >> > So if you agree, I volunteer to draft a document that would support the >> > detailed design and implementation plan of Py-TableAPI on Flink. >> > >> > What do you think? >> > >> > Shaoxuan Wang <wshaox...@gmail.com> 于2019年2月21日周四 下午10:44写道: >> > >> > > Hey guys, >> > > Thanks for your comments and sorry for the late reply. >> > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in >> > > different manners. We got a chance to communicate with Tyler Akidau >> (from >> > > Beam) offline, and explained why the Flink tableAPI needs a specific >> > design >> > > for python, rather than purely leverage Beam portability layer. >> > > >> > > In our proposal, most of the Python code is just a DAG/pipeline >> builder >> > for >> > > tableAPI. The majority of operators run purely in Java, while only >> python >> > > UDFs executed in Python environment during the runtime. This design >> does >> > > not affect the development and adoption of Beam language portability >> > layer >> > > with Flink runner. Flink and Beam community will still collaborate, >> > jointly >> > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and >> > > control connections between different processes) to ensure the >> robustness >> > > and performance. >> > > >> > > Regards, >> > > Shaoxuan >> > > >> > > >> > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <t...@apache.org> wrote: >> > > >> > > > Interest in Python seems on the rise and so this is a good >> discussion >> > to >> > > > have :) >> > > > >> > > > So far there seems to be agreement that Beam's approach towards >> Python >> > > and >> > > > other non-JVM language support (language SDK, portability layer >> etc.) >> > is >> > > > the right direction? Specification and execution are native Python >> and >> > it >> > > > does not suffer from the shortcomings of Flink's Jython API and few >> > other >> > > > approaches. >> > > > >> > > > Overall there already is good alignment between Beam and Flink in >> > > concepts >> > > > and model. There are also few of us that are active in both >> > communities. >> > > > The Beam Flink runner has made a lot of progress this year, but >> work on >> > > > portability in Beam actually started much before that and was a very >> > big >> > > > change (originally there was just the Java SDK). Much of the code >> has >> > > been >> > > > rewritten as part of the effort; that's what implementing a strong >> > multi >> > > > language support story took. To have a decent shot at it, the >> > equivalent >> > > of >> > > > much of the Beam portability framework would need to be reinvented >> in >> > > > Flink. This would fork resources and divert focus away from things >> that >> > > may >> > > > be more core to Flink. As you can guess I am in favor of option (1) >> ! >> > > > >> > > > We could take a look at SQL for reference. Flink community has >> > invested a >> > > > lot in SQL and there remains a lot of work to do. Beam community has >> > done >> > > > the same and we have two completely separate implementations. When I >> > > > recently learned more about the Beam SQL work, one of my first >> > questions >> > > > was if joined effort would not lead to better user value? Calcite is >> > > > common, but isn't there much more that could be shared? Someone had >> the >> > > > idea that in such a world Flink could just substitute portions or >> all >> > of >> > > > the graph provided by Beam with it's own optimized version but much >> of >> > > the >> > > > tooling could be same? >> > > > >> > > > IO connectors are another area where much effort is repeated. It >> takes >> > a >> > > > very long time to arrive at a solid, production quality >> implementation >> > > > (typically resulting from broad user exposure and running at scale). >> > > > Currently there is discussion how connectors can be done much >> better in >> > > > both projects: SDF in Beam [1] and FLIP-27. >> > > > >> > > > It's a trade-off, but more synergy would be great! >> > > > >> > > > Thomas >> > > > >> > > > [1] >> > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ >> > > > >> > > > >> > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < >> > > > whatwouldausti...@gmail.com> >> > > > wrote: >> > > > >> > > > > Hi Shaoxuan, >> > > > > >> > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the >> Bay >> > > Area >> > > > > Apache Beam Meetup[1] which included a bit on a vision for how >> Beam >> > > could >> > > > > better leverage runner specific optimizations -- as an >> > > example/extension, >> > > > > Beam SQL leveraging Flink specific SQL optimizations (to address >> your >> > > > > point). So, that is part of the eventual roadmap for Beam, and >> > > > illustrates >> > > > > how concrete efforts towards optimizations in Runner/SDK-Harness >> > would >> > > > > likely yield the desired result of cross-language support (which >> > could >> > > be >> > > > > done by leveraging Beam, and devote focus to optimizing that >> > processing >> > > > on >> > > > > Flink). >> > > > > >> > > > > Cheers, >> > > > > Austin >> > > > > >> > > > > >> > > > > [1] >> > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ >> > > > -- >> > > > > I >> > > > > can post/share videos once available should someone desire. >> > > > > >> > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels < >> m...@apache.org> >> > > > wrote: >> > > > > >> > > > > > Hi Xianda, hi Shaoxuan, >> > > > > > >> > > > > > I'd be in favor of option (1). There is great potential in Beam >> and >> > > > Flink >> > > > > > joining forces on this one. Here's why: >> > > > > > >> > > > > > The Beam project spent at least a year developing a portability >> > layer >> > > > > with >> > > > > > a >> > > > > > reasonable amount of people working on it. Developing a new >> > > portability >> > > > > > layer >> > > > > > from scratch will probably take about the same amount of time >> and >> > > > > > resources. >> > > > > > >> > > > > > Concerning option (2): There is already a Python API for Flink >> but >> > an >> > > > API >> > > > > > is >> > > > > > only one part of the portability story. In Beam the portability >> is >> > > > > > structured >> > > > > > into three components: >> > > > > > >> > > > > > - SDK (API, its Protobuf serialization, and interaction with the >> > SDK >> > > > > > Harness) >> > > > > > - Runner (Translation from Protobuf pipeline to Flink job) >> > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the >> > > > execution >> > > > > > engine) >> > > > > > >> > > > > > I could imagine the Flink Python API would be another SDK which >> > could >> > > > > have >> > > > > > its >> > > > > > own API but would reuse code for the interaction with the SDK >> > > Harness. >> > > > > > >> > > > > > We would be able to focus on the optimizations instead of >> > rebuilding >> > > a >> > > > > > portability layer from scratch. >> > > > > > >> > > > > > Thanks, >> > > > > > Max >> > > > > > >> > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: >> > > > > > > RE: Stephen's options ( >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html >> > > > > > > ) >> > > > > > > * Option (1): Language portability via Apache Beam >> > > > > > > * Option (2): Implement own Python API >> > > > > > > * Option (3): Implement own portability layer >> > > > > > > >> > > > > > > Hi Stephen, >> > > > > > > Eventually, I think we should support both option1 and >> option3. >> > > TMO, >> > > > > > these >> > > > > > > two options are orthogonal. I agree with you that we can >> leverage >> > > the >> > > > > > > existing work and ecosystem in beam by supporting option1. But >> > the >> > > > > > problem >> > > > > > > of beam is that it skips (to the best of my knowledge) the >> > natural >> > > > > > > table/SQL optimization framework provided by Flink. We should >> > spend >> > > > all >> > > > > > the >> > > > > > > needed efforts to support solution1 (as it is the better >> > > alternative >> > > > of >> > > > > > the >> > > > > > > current Flink python API), but cannot solely bet on it. >> Option3 >> > is >> > > > the >> > > > > > > ideal choice for Flink to support all Non-JVM languages which >> we >> > > > should >> > > > > > > better plan to achieve. We have done some preliminary >> prototypes >> > > for >> > > > > > > option2/option3, and it seems not quite complex and difficult >> to >> > > > > > accomplish. >> > > > > > > >> > > > > > > Regards, >> > > > > > > Shaoxuan >> > > > > > > >> > > > > > > >> > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexia...@gmail.com >> > >> > > > wrote: >> > > > > > > >> > > > > > >> Currently there is an ongoing survey about Python usage of >> Flink >> > > > [1]. >> > > > > > Some >> > > > > > >> discussion was also brought up there regarding non-jvm >> language >> > > > > support >> > > > > > >> strategy in general. To avoid polluting the survey thread, we >> > are >> > > > > > starting >> > > > > > >> this discussion thread and would like to move the discussions >> > > here. >> > > > > > >> >> > > > > > >> In the interest of facilitating the discussion, we would >> like to >> > > > first >> > > > > > >> share the following design doc which describes what we have >> done >> > > at >> > > > > > Alibaba >> > > > > > >> about Python API for Flink. It could serve as a good >> reference >> > to >> > > > the >> > > > > > >> discussion. >> > > > > > >> >> > > > > > >> [DISCUSS] Flink Python API >> > > > > > >> < >> > > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web >> > > > > > >>> >> > > > > > >> >> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL >> > for >> > > > the >> > > > > > >> internal users at Alibaba. >> > > > > > >> We are starting to implement Python API. >> > > > > > >> >> > > > > > >> To recap and continue the discussion from the survey thread, >> I >> > > agree >> > > > > > with >> > > > > > >> @Stephan that we should figure out in which general direction >> > > Python >> > > > > > >> support should go. Stephan also list three options there: >> > > > > > >> * Option (1): Language portability via Apache Beam >> > > > > > >> * Option (2): Implement own Python API >> > > > > > >> * Option (3): Implement own portability layer >> > > > > > >> >> > > > > > >> From my perspective, >> > > > > > >> (1). Flink language APIs and Beam's languages support are not >> > > > mutually >> > > > > > >> exclusive. >> > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support >> > Flink >> > > as >> > > > > the >> > > > > > >> runner. >> > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's >> > > > ecosystem. >> > > > > > >> >> > > > > > >> (2). Python API / portability layer >> > > > > > >> To support non-JVM languages in Flink, >> > > > > > >> * at client side, Flink would provide language interfaces, >> > which >> > > > > will >> > > > > > >> translate user's application to Flink StreamGraph. >> > > > > > >> * at server side, Flink would execute user's UDF code at >> runtime >> > > > > > >> The non-JVM languages communicate with JVM via RPC(or >> low-level >> > > > > socket, >> > > > > > >> embedded interpreter and so on). What the portability layer >> can >> > do >> > > > > > maybe is >> > > > > > >> abstracting the RPC layer. When the portability layer is >> ready, >> > > > still >> > > > > > there >> > > > > > >> are lots of stuff to do for a specified language. Say, >> Python, >> > we >> > > > may >> > > > > > still >> > > > > > >> have to write the interface classes by hand for the users >> > because >> > > > > > generated >> > > > > > >> code without detailed documentation is unacceptable for >> users, >> > or >> > > > > handle >> > > > > > >> the serialization issue of lambda/closure which is not a >> > built-in >> > > > > > feature >> > > > > > >> in Python. Maybe, we can start with Python API, then extend >> to >> > > > other >> > > > > > >> languages and abstract the logic in common as the portability >> > > layer. >> > > > > > >> >> > > > > > >> --- >> > > > > > >> References: >> > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python >> > > > > > >> >> > > > > > >> Regards, >> > > > > > >> Xianda >> > > > > > >> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > > > -- > Best Regards > > Jeff Zhang >