Dian Fu created FLINK-16114:
-------------------------------

             Summary: Support Scalar Vectorized Python UDF in PyFlink
                 Key: FLINK-16114
                 URL: https://issues.apache.org/jira/browse/FLINK-16114
             Project: Flink
          Issue Type: New Feature
          Components: API / Python
            Reporter: Dian Fu
            Assignee: Dian Fu
             Fix For: 1.11.0


Scalar Python UDF has already been supported in Flink 1.10 
([FLIP-58|https://cwiki.apache.org/confluence/display/FLINK/FLIP-58%3A+Flink+Python+User-Defined+Stateless+Function+for+Table])
 and it operates one row at a time. It works in the way that the Java operator 
serializes one input row to bytes and sends them to the Python worker; the 
Python worker deserializes the input row and evaluates the Python UDF with it; 
the result row is serialized and sent back to the Java operator.

It suffers from the following problems:
 # High serialization/deserialization overhead
 # It’s difficult to leverage the popular Python libraries used by data 
scientists, such as Pandas, Numpy, etc which provide high performance data 
structure and functions.

We want to introduce vectorized Python UDF to address this problem. For 
vectorized Python UDF, a batch of rows are transferred between JVM and Python 
VM in columnar format. The batch of rows will be converted to a collection of 
Pandas.Series and given to the vectorized Python UDF which could then leverage 
the popular Python libraries such as Pandas, Numpy, etc for the Python UDF 
implementation.

More details could be found in 
[FLIP-97.|https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to