Also CC user-zh. Best, Jincheng
jincheng sun <sunjincheng...@gmail.com> 于2019年12月19日周四 上午10:20写道: > Hi folks, > > As release-1.10 is under feature-freeze(The stateless Python UDF is > already supported), it is time for us to plan the features of PyFlink for > the next release. > > To make sure the features supported in PyFlink are the mostly demanded for > the community, we'd like to get more people involved, i.e., it would be > better if all of the devs and users join in the discussion of which kind of > features are more important and urgent. > > We have already listed some features from different aspects which you can > find below, however it is not the ultimate plan. We appreciate any > suggestions from the community, either on the functionalities or > performance improvements, etc. Would be great to have the following > information if you want to suggest to add some features: > > --------- > - Feature description: xxxx > - Benefits of the feature: xxxx > - Use cases (optional): xxxx > ---------- > > ----Features in my mind---- > > 1. Integration with most popular Python libraries > - fromPandas/toPandas API > Description: > Support to convert between Table and pandas.DataFrame. > Benefits: > Users could switch between Flink and Pandas API, for example, do > some analysis using Flink and then perform analysis using the Pandas API if > the result data is small and could fit into the memory, and vice versa. > > - Support Scalar Pandas UDF > Description: > Support scalar Pandas UDF in Python Table API & SQL. Both the > input and output of the UDF is pandas.Series. > Benefits: > 1) Scalar Pandas UDF performs better than row-at-a-time UDF, > ranging from 3x to over 100x (from pyspark) > 2) Users could use Pandas/Numpy API in the Python UDF > implementation if the input/output data type is pandas.Series > > - Support Pandas UDAF in batch GroupBy aggregation > Description: > Support Pandas UDAF in batch GroupBy aggregation of Python > Table API & SQL. Both the input and output of the UDF is pandas.DataFrame. > Benefits: > 1) Pandas UDAF performs better than row-at-a-time UDAF more than > 10x in certain scenarios > 2) Users could use Pandas/Numpy API in the Python UDAF > implementation if the input/output data type is pandas.DataFrame > > 2. Fully support all kinds of Python UDF > - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please > give us some use case if you want this feature to be contained in the next > release) > Description: > Support UDAF in GroupBy aggregation. > Benefits: > Users could define and use Python UDAF and use it in GroupBy > aggregation. Without it, users have to use Java/Scala UDAF. > > - Support Python UDTF > Description: > Support Python UDTF in Python Table API & SQL > Benefits: > Users could define and use Python UDTF in Python Table API & SQL. > Without it, users have to use Java/Scala UDTF. > > 3. Debugging and Monitoring of Python UDF > - Support User-Defined Metrics > Description: > Allow users to define user-defined metrics and global job > parameters with Python UDFs. > Benefits: > UDF needs metrics to monitor some business or technical indicators, > which is also a requirement for UDFs. > > - Make the log level configurable > Description: > Allow users to config the log level of Python UDF. > Benefits: > Users could configure different log levels when debugging and > deploying. > > 4. Enrich the Python execution environment > - Docker Mode Support > Description: > Support running python UDF in docker workers. > Benefits: > Support various of deployments to meet more users' requirements. > > 5. Expand the usage scope of Python UDF > - Support to use Python UDF via SQL client > Description: > Support to register and use Python UDF via SQL client > Benefits: > SQL client is a very important interface for SQL users. This > feature allows SQL users to use Python UDFs via SQL client. > > - Integrate Python UDF with Notebooks > Description: > Such as Zeppelin, etc (Especially Python dependencies) > > - Support to register Python UDF into catalog > Description: > Support to register Python UDF into catalog > Benefits: > 1)Catalog is the centralized place to manage metadata such as > tables, UDFs, etc. With it, users could register the UDFs once and use it > anywhere. > 2) It's an important part of the SQL functionality. If Python > UDFs are not supported to be registered and used in catalog, Python UDFs > could not be shared between jobs. > > 6. Performance Improvements of Python UDF > - Cython improvements > Description: > Cython Improvements in coder & operations > Benefits: > Initial tests show that Cython will speed 3x+ in coder > serialization/deserialization. > > 7. Add Python ML API > - Add Python ML Pipeline API > Description: > Align Python ML Pipeline API with Java/Scala > Benefits: > 1) Currently, we already have the Pipeline APIs for ML. It would be > good to also have the related Python APIs. > 2) In many cases, algorithm engineers prefer the Python language. > > > BTW, the PyFlink is a new component, and there are still a lot of work > need to do. Thus, everybody is cordially welcome to join the contribution > to PyFlink, including asking questions, filing bug reports, proposing new > features, joining discussions, contributing code or documentation ... > > Hope to see your feedback! > > Best, > Jincheng >