Re: [DISCUSS] Support scalar vectorized Python UDF in PyFlink

Jingsong Li Wed, 05 Feb 2020 00:58:14 -0800

Hi Dian,

+1 for this, thanks driving.
Documentation looks very good. I can imagine a huge performance improvement
and better integration to other Python libraries.


A few thoughts:
- About data split: "python.fn-execution.arrow.batch.size", can we unify it
with "python.fn-execution.bundle.size"?
- Use of Apache Arrow as the exchange format: Do you mean Arrow support
zero-copy between Java and Python?
- ArrowFieldWriter seems we can implement it by code generation. But it is
OK to initial version with virtual function call.
- ColumnarRow for vectorization reading seems that we need implement
ArrowColumnVectors.

Best,
Jingsong Lee

On Wed, Feb 5, 2020 at 12:45 PM dianfu <[email protected]> wrote:

> Hi all,
>
> Scalar Python UDF has already been supported in the coming release 1.10
> (FLIP-58[1]). It operates one row at a time. It works in the way that the
> Java operator serializes one input row to bytes and sends them to the
> Python worker; the Python worker deserializes the input row and evaluates
> the Python UDF with it; the result row is serialized and sent back to the
> Java operator.
>
> It suffers from the following problems:
> 1) High serialization/deserialization overhead
> 2) It’s difficult to leverage the popular Python libraries used by data
> scientists, such as Pandas, Numpy, etc which provide high performance data
> structure and functions.
>
> Jincheng and I have discussed offline and we want to introduce vectorized
> Python UDF to address the above problems. This feature has also been
> mentioned in the discussion thread about the Python API plan[2]. For
> vectorized Python UDF, a batch of rows are transferred between JVM and
> Python VM in columnar format. The batch of rows will be converted to a
> collection of Pandas.Series and given to the vectorized Python UDF which
> could then leverage the popular Python libraries such as Pandas, Numpy, etc
> for the Python UDF implementation.
>
> Please refer the design doc[3] for more details and welcome any feedback.
>
> Regards,
> Dian
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-58%3A+Flink+Python+User-Defined+Stateless+Function+for+Table
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-What-parts-of-the-Python-API-should-we-focus-on-next-tt36119.html
> [3]
> https://docs.google.com/document/d/1ls0mt96YV1LWPHfLOh8v7lpgh1KsCNx8B9RrW1ilnoE/edit#heading=h.qivada1u8hwd
>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] Support scalar vectorized Python UDF in PyFlink

Reply via email to