Re: PyFlink UDF: When to use vectorized vs scalar

Fabian Paul Fri, 16 Apr 2021 05:25:20 -0700

Hi Yik San,

I think the usage of vectorized udfs highly depends on your input and output 
formats. For your example my first impression would say that parsing a JSON 
string is always an rather expensive operation and the vectorization has not 
much impact on that.


I am ccing Dian Fu who is more familiar with pyflink

Best,
Fabian

> On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com> wrote:
> 
> The question is cross-posted on Stack Overflow 
> https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar
>  
> <https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar>
> 
> Is there a simple set of rules to follow when deciding between vectorized vs 
> scalar PyFlink UDF?
> 
> According to 
> [docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html>),
>  vectorized UDF has advantages of: (1) smaller ser-de and invocation overhead 
> (2) Vector calculation are highly optimized thanks to libs such as Numpy.
> 
> > Vectorized Python user-defined functions are functions which are executed 
> > by transferring a batch of elements between JVM and Python VM in Arrow 
> > columnar format. The performance of vectorized Python user-defined 
> > functions are usually much higher than non-vectorized Python user-defined 
> > functions as the serialization/deserialization overhead and invocation 
> > overhead are much reduced. Besides, users could leverage the popular Python 
> > libraries such as Pandas, Numpy, etc for the vectorized Python user-defined 
> > functions implementation. These Python libraries are highly optimized and 
> > provide high-performance data structures and functions.
> 
> **QUESTION 1**: Is vectorized UDF ALWAYS preferred? 
> 
> Let's say, in my use case, I want to simply extract some fields from a JSON 
> column, that is not supported by Flink [built-in 
> functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html
>  
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html>)
>  yet, therefore I need to define my udf like:
> 
> ```python
> @udf(...)
> def extract_field_from_json(json_value, field_name):
>     import json
>     return json.loads(json_value)[field_name]
> ```
> 
> **QUESTION 2**: Will I also benefit from vectorized UDF in this case?
> 
> Best,
> Yik San

Re: PyFlink UDF: When to use vectorized vs scalar

Reply via email to