Re: PyFlink UDF: When to use vectorized vs scalar

Yik San Chan Mon, 19 Apr 2021 01:41:08 -0700

Hmm one more question - as I said, there are 2 gains from using pandas UDF
- (1) smaller ser-de and invocation overhead, and (2) vector calculation.


(2) depends on use cases, how about (1)? Is the benefit (1) always-true?

Best,
Yik San

On Mon, Apr 19, 2021 at 4:33 PM Yik San Chan <evan.chanyik...@gmail.com>
wrote:

> Hi Fabian and Dian,
>
> Thanks for the reply. They make sense.
>
> Best,
> Yik San
>
> On Mon, Apr 19, 2021 at 9:49 AM Dian Fu <dian0511...@gmail.com> wrote:
>
>> Hi Yik San,
>>
>> It much depends on what you want to do in your Python UDF implementation.
>> As you know that, for vectorized Python UDF (aka. Pandas UDF), the input
>> data are organized as columnar format. So if your Python UDF implementation
>> could benefit from this, e.g. making use of the functionalities provided in
>> the libraries such as Pandas, Numpy, etc which are columnar oriented, then
>> vectorized Python UDF is usually a better choice. However, if you have to
>> operate the input data one row at a time, then I guess that the
>> non-vectorized Python UDF is enough.
>>
>> PS: you could also run some performance test when it’s unclear which one
>> is better.
>>
>> Regards,
>> Dian
>>
>> 2021年4月16日 下午8:24，Fabian Paul <fabianp...@data-artisans.com> 写道：
>>
>> Hi Yik San,
>>
>> I think the usage of vectorized udfs highly depends on your input and
>> output formats. For your example my first impression would say that parsing
>> a JSON string is always an rather expensive operation and the vectorization
>> has not much impact on that.
>>
>> I am ccing Dian Fu who is more familiar with pyflink
>>
>> Best,
>> Fabian
>>
>> On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com>
>> wrote:
>>
>> The question is cross-posted on Stack Overflow
>> https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar
>>
>> Is there a simple set of rules to follow when deciding between vectorized
>> vs scalar PyFlink UDF?
>>
>> According to [docs](
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html),
>> vectorized UDF has advantages of: (1) smaller ser-de and invocation
>> overhead (2) Vector calculation are highly optimized thanks to libs such as
>> Numpy.
>>
>> > Vectorized Python user-defined functions are functions which are
>> executed by transferring a batch of elements between JVM and Python VM in
>> Arrow columnar format. The performance of vectorized Python user-defined
>> functions are usually much higher than non-vectorized Python user-defined
>> functions as the serialization/deserialization overhead and invocation
>> overhead are much reduced. Besides, users could leverage the popular Python
>> libraries such as Pandas, Numpy, etc for the vectorized Python user-defined
>> functions implementation. These Python libraries are highly optimized and
>> provide high-performance data structures and functions.
>>
>> **QUESTION 1**: Is vectorized UDF ALWAYS preferred?
>>
>> Let's say, in my use case, I want to simply extract some fields from a
>> JSON column, that is not supported by Flink [built-in functions](
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html)
>> yet, therefore I need to define my udf like:
>>
>> ```python
>> @udf(...)
>> def extract_field_from_json(json_value, field_name):
>>     import json
>>     return json.loads(json_value)[field_name]
>> ```
>>
>> **QUESTION 2**: Will I also benefit from vectorized UDF in this case?
>>
>> Best,
>> Yik San
>>
>>
>>
>>

Re: PyFlink UDF: When to use vectorized vs scalar

Reply via email to