Re: PyFlink UDF: When to use vectorized vs scalar

Yik San Chan Mon, 19 Apr 2021 05:23:25 -0700

Hi Dian,

By "access data at row basis", do you mean, for input X,


for row in X:
    doSomething(row)

If that's the case, I believe I am not accessing the vector like that. What
I do is pretty much, for input X1, X2 and X3:

model = ...
predictions = model.predict(X1, X2, X3)

Do I understand it correctly?

Best,
Yik San

On Mon, Apr 19, 2021 at 7:45 PM Dian Fu <dian0511...@gmail.com> wrote:

> I have not tested this and so I have no direct answer to this question.
>
> There are some tricky things behind this. For Pandas UDF, the input data
> will be organized as columnar format. however, if there are multiple input
> arguments for the Pandas UDF and you access data at row basis in the Pandas
> UDF implementation, then the cache locality may become a problem as you
> need to access the elements at position i for each of the columnar data
> structure when processing the ith row.
>
> Regards,
> Dian
>
> 2021年4月19日 下午4:40，Yik San Chan <evan.chanyik...@gmail.com> 写道：
>
> Hmm one more question - as I said, there are 2 gains from using pandas UDF
> - (1) smaller ser-de and invocation overhead, and (2) vector calculation.
>
> (2) depends on use cases, how about (1)? Is the benefit (1) always-true?
>
> Best,
> Yik San
>
> On Mon, Apr 19, 2021 at 4:33 PM Yik San Chan <evan.chanyik...@gmail.com>
> wrote:
>
>> Hi Fabian and Dian,
>>
>> Thanks for the reply. They make sense.
>>
>> Best,
>> Yik San
>>
>> On Mon, Apr 19, 2021 at 9:49 AM Dian Fu <dian0511...@gmail.com> wrote:
>>
>>> Hi Yik San,
>>>
>>> It much depends on what you want to do in your Python UDF
>>> implementation. As you know that, for vectorized Python UDF (aka. Pandas
>>> UDF), the input data are organized as columnar format. So if your Python
>>> UDF implementation could benefit from this, e.g. making use of the
>>> functionalities provided in the libraries such as Pandas, Numpy, etc which
>>> are columnar oriented, then vectorized Python UDF is usually a better
>>> choice. However, if you have to operate the input data one row at a time,
>>> then I guess that the non-vectorized Python UDF is enough.
>>>
>>> PS: you could also run some performance test when it’s unclear which one
>>> is better.
>>>
>>> Regards,
>>> Dian
>>>
>>> 2021年4月16日 下午8:24，Fabian Paul <fabianp...@data-artisans.com> 写道：
>>>
>>> Hi Yik San,
>>>
>>> I think the usage of vectorized udfs highly depends on your input and
>>> output formats. For your example my first impression would say that parsing
>>> a JSON string is always an rather expensive operation and the vectorization
>>> has not much impact on that.
>>>
>>> I am ccing Dian Fu who is more familiar with pyflink
>>>
>>> Best,
>>> Fabian
>>>
>>> On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com>
>>> wrote:
>>>
>>> The question is cross-posted on Stack Overflow
>>> https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar
>>>
>>> Is there a simple set of rules to follow when deciding between
>>> vectorized vs scalar PyFlink UDF?
>>>
>>> According to [docs](
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html),
>>> vectorized UDF has advantages of: (1) smaller ser-de and invocation
>>> overhead (2) Vector calculation are highly optimized thanks to libs such as
>>> Numpy.
>>>
>>> > Vectorized Python user-defined functions are functions which are
>>> executed by transferring a batch of elements between JVM and Python VM in
>>> Arrow columnar format. The performance of vectorized Python user-defined
>>> functions are usually much higher than non-vectorized Python user-defined
>>> functions as the serialization/deserialization overhead and invocation
>>> overhead are much reduced. Besides, users could leverage the popular Python
>>> libraries such as Pandas, Numpy, etc for the vectorized Python user-defined
>>> functions implementation. These Python libraries are highly optimized and
>>> provide high-performance data structures and functions.
>>>
>>> **QUESTION 1**: Is vectorized UDF ALWAYS preferred?
>>>
>>> Let's say, in my use case, I want to simply extract some fields from a
>>> JSON column, that is not supported by Flink [built-in functions](
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html)
>>> yet, therefore I need to define my udf like:
>>>
>>> ```python
>>> @udf(...)
>>> def extract_field_from_json(json_value, field_name):
>>>     import json
>>>     return json.loads(json_value)[field_name]
>>> ```
>>>
>>> **QUESTION 2**: Will I also benefit from vectorized UDF in this case?
>>>
>>> Best,
>>> Yik San
>>>
>>>
>>>
>>>
>

Re: PyFlink UDF: When to use vectorized vs scalar

Reply via email to