Hi Xianda, hi Shaoxuan,
I'd be in favor of option (1). There is great potential in Beam and Flink
joining forces on this one. Here's why:
The Beam project spent at least a year developing a portability layer with a
reasonable amount of people working on it. Developing a new portability layer
from scratch will probably take about the same amount of time and resources.
Concerning option (2): There is already a Python API for Flink but an API is
only one part of the portability story. In Beam the portability is structured
into three components:
- SDK (API, its Protobuf serialization, and interaction with the SDK Harness)
- Runner (Translation from Protobuf pipeline to Flink job)
- SDK Harness (UDF execution, Interaction with the SDK and the execution engine)
I could imagine the Flink Python API would be another SDK which could have its
own API but would reuse code for the interaction with the SDK Harness.
We would be able to focus on the optimizations instead of rebuilding a
portability layer from scratch.
Thanks,
Max
On 13.12.18 11:52, Shaoxuan Wang wrote:
RE: Stephen's options (
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
)
* Option (1): Language portability via Apache Beam
* Option (2): Implement own Python API
* Option (3): Implement own portability layer
Hi Stephen,
Eventually, I think we should support both option1 and option3. TMO, these
two options are orthogonal. I agree with you that we can leverage the
existing work and ecosystem in beam by supporting option1. But the problem
of beam is that it skips (to the best of my knowledge) the natural
table/SQL optimization framework provided by Flink. We should spend all the
needed efforts to support solution1 (as it is the better alternative of the
current Flink python API), but cannot solely bet on it. Option3 is the
ideal choice for Flink to support all Non-JVM languages which we should
better plan to achieve. We have done some preliminary prototypes for
option2/option3, and it seems not quite complex and difficult to accomplish.
Regards,
Shaoxuan
On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexia...@gmail.com> wrote:
Currently there is an ongoing survey about Python usage of Flink [1]. Some
discussion was also brought up there regarding non-jvm language support
strategy in general. To avoid polluting the survey thread, we are starting
this discussion thread and would like to move the discussions here.
In the interest of facilitating the discussion, we would like to first
share the following design doc which describes what we have done at Alibaba
about Python API for Flink. It could serve as a good reference to the
discussion.
[DISCUSS] Flink Python API
<
https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
As of now, we've implemented and delivered Python UDF for SQL for the
internal users at Alibaba.
We are starting to implement Python API.
To recap and continue the discussion from the survey thread, I agree with
@Stephan that we should figure out in which general direction Python
support should go. Stephan also list three options there:
* Option (1): Language portability via Apache Beam
* Option (2): Implement own Python API
* Option (3): Implement own portability layer
From my perspective,
(1). Flink language APIs and Beam's languages support are not mutually
exclusive.
It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the
runner.
Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.
(2). Python API / portability layer
To support non-JVM languages in Flink,
* at client side, Flink would provide language interfaces, which will
translate user's application to Flink StreamGraph.
* at server side, Flink would execute user's UDF code at runtime
The non-JVM languages communicate with JVM via RPC(or low-level socket,
embedded interpreter and so on). What the portability layer can do maybe is
abstracting the RPC layer. When the portability layer is ready, still there
are lots of stuff to do for a specified language. Say, Python, we may still
have to write the interface classes by hand for the users because generated
code without detailed documentation is unacceptable for users, or handle
the serialization issue of lambda/closure which is not a built-in feature
in Python. Maybe, we can start with Python API, then extend to other
languages and abstract the logic in common as the portability layer.
---
References:
[1] [SURVEY] Usage of flink-python and flink-streaming-python
Regards,
Xianda