[DISCUSS] Support scalar vectorized Python UDF in PyFlink

dianfu Tue, 04 Feb 2020 20:46:27 -0800

Hi all,

Scalar Python UDF has already been supported in the coming release 1.10 
(FLIP-58[1]). It operates one row at a time. It works in the way that the Java 
operator serializes one input row to bytes and sends them to the Python worker; 
the Python worker deserializes the input row and evaluates the Python UDF with 
it; the result row is serialized and sent back to the Java operator.


It suffers from the following problems:
1) High serialization/deserialization overhead
2) It’s difficult to leverage the popular Python libraries used by data 
scientists, such as Pandas, Numpy, etc which provide high performance data 
structure and functions.

Jincheng and I have discussed offline and we want to introduce vectorized 
Python UDF to address the above problems. This feature has also been mentioned 
in the discussion thread about the Python API plan[2]. For vectorized Python 
UDF, a batch of rows are transferred between JVM and Python VM in columnar 
format. The batch of rows will be converted to a collection of Pandas.Series 
and given to the vectorized Python UDF which could then leverage the popular 
Python libraries such as Pandas, Numpy, etc for the Python UDF implementation.

Please refer the design doc[3] for more details and welcome any feedback.

Regards,
Dian

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-58%3A+Flink+Python+User-Defined+Stateless+Function+for+Table
[2] 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-What-parts-of-the-Python-API-should-we-focus-on-next-tt36119.html
[3] 
https://docs.google.com/document/d/1ls0mt96YV1LWPHfLOh8v7lpgh1KsCNx8B9RrW1ilnoE/edit#heading=h.qivada1u8hwd

[DISCUSS] Support scalar vectorized Python UDF in PyFlink

Reply via email to