Re: [DISCUSS] What parts of the Python API should we focus on next ?

jincheng sun Wed, 18 Dec 2019 19:06:11 -0800

Also CC user-zh.

Best,
Jincheng



jincheng sun <sunjincheng...@gmail.com> 于2019年12月19日周四 上午10:20写道：

> Hi folks,
>
> As release-1.10 is under feature-freeze(The stateless Python UDF is
> already supported), it is time for us to plan the features of PyFlink for
> the next release.
>
> To make sure the features supported in PyFlink are the mostly demanded for
> the community, we'd like to get more people involved, i.e., it would be
> better if all of the devs and users join in the discussion of which kind of
> features are more important and urgent.
>
> We have already listed some features from different aspects which you can
> find below, however it is not the ultimate plan. We appreciate any
> suggestions from the community, either on the functionalities or
> performance improvements, etc. Would be great to have the following
> information if you want to suggest to add some features:
>
> ---------
> - Feature description: xxxx
> - Benefits of the feature: xxxx
> - Use cases (optional): xxxx
> ----------
>
> ----Features in my mind----
>
> 1. Integration with most popular Python libraries
>     - fromPandas/toPandas API
>        Description:
>           Support to convert between Table and pandas.DataFrame.
>        Benefits:
>           Users could switch between Flink and Pandas API, for example, do
> some analysis using Flink and then perform analysis using the Pandas API if
> the result data is small and could fit into the memory, and vice versa.
>
>     - Support Scalar Pandas UDF
>        Description:
>           Support scalar Pandas UDF in Python Table API & SQL. Both the
> input and output of the UDF is pandas.Series.
>        Benefits:
>           1) Scalar Pandas UDF performs better than row-at-a-time UDF,
> ranging from 3x to over 100x (from pyspark)
>           2) Users could use Pandas/Numpy API in the Python UDF
> implementation if the input/output data type is pandas.Series
>
>     - Support Pandas UDAF in batch GroupBy aggregation
>        Description:
>            Support Pandas UDAF in batch GroupBy aggregation of Python
> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>        Benefits:
>           1) Pandas UDAF performs better than row-at-a-time UDAF more than
> 10x in certain scenarios
>           2) Users could use Pandas/Numpy API in the Python UDAF
> implementation if the input/output data type is pandas.DataFrame
>
> 2. Fully support  all kinds of Python UDF
>     - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
> give us some use case if you want this feature to be contained in the next
> release)
>       Description:
>         Support UDAF in GroupBy aggregation.
>       Benefits:
>         Users could define and use Python UDAF and use it in GroupBy
> aggregation. Without it, users have to use Java/Scala UDAF.
>
>     - Support Python UDTF
>       Description:
>    Support  Python UDTF in Python Table API & SQL
>       Benefits:
>         Users could define and use Python UDTF in Python Table API & SQL.
> Without it, users have to use Java/Scala UDTF.
>
> 3. Debugging and Monitoring of Python UDF
>    - Support User-Defined Metrics
>      Description:
>        Allow users to define user-defined metrics and global job
> parameters with Python UDFs.
>      Benefits:
>        UDF needs metrics to monitor some business or technical indicators,
> which is also a requirement for UDFs.
>
>    - Make the log level configurable
>      Description:
>        Allow users to config the log level of Python UDF.
>      Benefits:
>        Users could configure different log levels when debugging and
> deploying.
>
> 4. Enrich the Python execution environment
>    - Docker Mode Support
>      Description:
>          Support running python UDF in docker workers.
>      Benefits:
>          Support various of deployments to meet more users' requirements.
>
> 5. Expand the usage scope of Python UDF
>    - Support to use Python UDF via SQL client
>      Description:
>          Support to register and use Python UDF via SQL client
>      Benefits:
>          SQL client is a very important interface for SQL users. This
> feature allows SQL users to use Python UDFs via SQL client.
>
>    - Integrate Python UDF with Notebooks
>      Description:
>          Such as Zeppelin, etc (Especially Python dependencies)
>
>    - Support to register Python UDF into catalog
>       Description:
>           Support to register Python UDF into catalog
>       Benefits:
>           1）Catalog is the centralized place to manage metadata such as
> tables, UDFs, etc. With it, users could register the UDFs once and use it
> anywhere.
>           2) It's an important part of the SQL functionality. If Python
> UDFs are not supported to be registered and used in catalog, Python UDFs
> could not be shared between jobs.
>
> 6. Performance Improvements of Python UDF
>    - Cython improvements
>       Description:
>           Cython Improvements in coder & operations
>       Benefits:
>           Initial tests show that Cython will speed 3x+ in coder
> serialization/deserialization.
>
> 7. Add Python ML API
>    - Add Python ML Pipeline API
>      Description:
>          Align Python ML Pipeline API with Java/Scala
>      Benefits:
>        1) Currently, we already have the Pipeline APIs for ML. It would be
> good to also have the related Python APIs.
>        2) In many cases, algorithm engineers prefer the Python language.
>
>
> BTW, the PyFlink is a new component, and there are still a lot of work
> need to do. Thus, everybody is cordially welcome to join the contribution
> to PyFlink, including asking questions, filing bug reports, proposing new
> features, joining discussions, contributing code or documentation ...
>
> Hope to see your feedback!
>
> Best,
> Jincheng
>

Re: [DISCUSS] What parts of the Python API should we focus on next ?

Reply via email to