Re: PyFlink UDF: When to use vectorized vs scalar

Dian Fu Mon, 19 Apr 2021 18:25:09 -0700

Yes, your understanding is correct.

So model.predict accept pandas.Series as inputs? If this is the case, then I 
guess Pandas UDF is a perfect choice for your requirements.


Regards,
Dian

> 2021年4月19日 下午8:23，Yik San Chan <evan.chanyik...@gmail.com> 写道：
> 
> Hi Dian,
> 
> By "access data at row basis", do you mean, for input X,
> 
> for row in X:
>     doSomething(row)
> 
> If that's the case, I believe I am not accessing the vector like that. What I 
> do is pretty much, for input X1, X2 and X3:
> 
> model = ...
> predictions = model.predict(X1, X2, X3)
> 
> Do I understand it correctly?
> 
> Best,
> Yik San 
> 
> On Mon, Apr 19, 2021 at 7:45 PM Dian Fu <dian0511...@gmail.com 
> <mailto:dian0511...@gmail.com>> wrote:
> I have not tested this and so I have no direct answer to this question. 
> 
> There are some tricky things behind this. For Pandas UDF, the input data will 
> be organized as columnar format. however, if there are multiple input 
> arguments for the Pandas UDF and you access data at row basis in the Pandas 
> UDF implementation, then the cache locality may become a problem as you need 
> to access the elements at position i for each of the columnar data structure 
> when processing the ith row.
> 
> Regards,
> Dian
> 
>> 2021年4月19日 下午4:40，Yik San Chan <evan.chanyik...@gmail.com 
>> <mailto:evan.chanyik...@gmail.com>> 写道：
>> 
>> Hmm one more question - as I said, there are 2 gains from using pandas UDF - 
>> (1) smaller ser-de and invocation overhead, and (2) vector calculation.
>> 
>> (2) depends on use cases, how about (1)? Is the benefit (1) always-true?
>> 
>> Best,
>> Yik San
>> 
>> On Mon, Apr 19, 2021 at 4:33 PM Yik San Chan <evan.chanyik...@gmail.com 
>> <mailto:evan.chanyik...@gmail.com>> wrote:
>> Hi Fabian and Dian,
>> 
>> Thanks for the reply. They make sense.
>> 
>> Best,
>> Yik San
>> 
>> On Mon, Apr 19, 2021 at 9:49 AM Dian Fu <dian0511...@gmail.com 
>> <mailto:dian0511...@gmail.com>> wrote:
>> Hi Yik San,
>> 
>> It much depends on what you want to do in your Python UDF implementation. As 
>> you know that, for vectorized Python UDF (aka. Pandas UDF), the input data 
>> are organized as columnar format. So if your Python UDF implementation could 
>> benefit from this, e.g. making use of the functionalities provided in the 
>> libraries such as Pandas, Numpy, etc which are columnar oriented, then 
>> vectorized Python UDF is usually a better choice. However, if you have to 
>> operate the input data one row at a time, then I guess that the 
>> non-vectorized Python UDF is enough. 
>> 
>> PS: you could also run some performance test when it’s unclear which one is 
>> better.
>> 
>> Regards,
>> Dian
>> 
>>> 2021年4月16日 下午8:24，Fabian Paul <fabianp...@data-artisans.com 
>>> <mailto:fabianp...@data-artisans.com>> 写道：
>>> 
>>> Hi Yik San,
>>> 
>>> I think the usage of vectorized udfs highly depends on your input and 
>>> output formats. For your example my first impression would say that parsing 
>>> a JSON string is always an rather expensive operation and the vectorization 
>>> has not much impact on that. 
>>> 
>>> I am ccing Dian Fu who is more familiar with pyflink
>>> 
>>> Best,
>>> Fabian
>>> 
>>>> On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com 
>>>> <mailto:evan.chanyik...@gmail.com>> wrote:
>>>> 
>>>> The question is cross-posted on Stack Overflow 
>>>> https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar
>>>>  
>>>> <https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar>
>>>> 
>>>> Is there a simple set of rules to follow when deciding between vectorized 
>>>> vs scalar PyFlink UDF?
>>>> 
>>>> According to 
>>>> [docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html
>>>>  
>>>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html>),
>>>>  vectorized UDF has advantages of: (1) smaller ser-de and invocation 
>>>> overhead (2) Vector calculation are highly optimized thanks to libs such 
>>>> as Numpy.
>>>> 
>>>> > Vectorized Python user-defined functions are functions which are 
>>>> > executed by transferring a batch of elements between JVM and Python VM 
>>>> > in Arrow columnar format. The performance of vectorized Python 
>>>> > user-defined functions are usually much higher than non-vectorized 
>>>> > Python user-defined functions as the serialization/deserialization 
>>>> > overhead and invocation overhead are much reduced. Besides, users could 
>>>> > leverage the popular Python libraries such as Pandas, Numpy, etc for the 
>>>> > vectorized Python user-defined functions implementation. These Python 
>>>> > libraries are highly optimized and provide high-performance data 
>>>> > structures and functions.
>>>> 
>>>> **QUESTION 1**: Is vectorized UDF ALWAYS preferred? 
>>>> 
>>>> Let's say, in my use case, I want to simply extract some fields from a 
>>>> JSON column, that is not supported by Flink [built-in 
>>>> functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html
>>>>  
>>>> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html>)
>>>>  yet, therefore I need to define my udf like:
>>>> 
>>>> ```python
>>>> @udf(...)
>>>> def extract_field_from_json(json_value, field_name):
>>>>     import json
>>>>     return json.loads(json_value)[field_name]
>>>> ```
>>>> 
>>>> **QUESTION 2**: Will I also benefit from vectorized UDF in this case?
>>>> 
>>>> Best,
>>>> Yik San
>>> 
>> 
>

Re: PyFlink UDF: When to use vectorized vs scalar

Reply via email to