Hi Yik San, I think the usage of vectorized udfs highly depends on your input and output formats. For your example my first impression would say that parsing a JSON string is always an rather expensive operation and the vectorization has not much impact on that.
I am ccing Dian Fu who is more familiar with pyflink Best, Fabian > On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com> wrote: > > The question is cross-posted on Stack Overflow > https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar > > <https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar> > > Is there a simple set of rules to follow when deciding between vectorized vs > scalar PyFlink UDF? > > According to > [docs](https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html > > <https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html>), > vectorized UDF has advantages of: (1) smaller ser-de and invocation overhead > (2) Vector calculation are highly optimized thanks to libs such as Numpy. > > > Vectorized Python user-defined functions are functions which are executed > > by transferring a batch of elements between JVM and Python VM in Arrow > > columnar format. The performance of vectorized Python user-defined > > functions are usually much higher than non-vectorized Python user-defined > > functions as the serialization/deserialization overhead and invocation > > overhead are much reduced. Besides, users could leverage the popular Python > > libraries such as Pandas, Numpy, etc for the vectorized Python user-defined > > functions implementation. These Python libraries are highly optimized and > > provide high-performance data structures and functions. > > **QUESTION 1**: Is vectorized UDF ALWAYS preferred? > > Let's say, in my use case, I want to simply extract some fields from a JSON > column, that is not supported by Flink [built-in > functions](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html > > <https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html>) > yet, therefore I need to define my udf like: > > ```python > @udf(...) > def extract_field_from_json(json_value, field_name): > import json > return json.loads(json_value)[field_name] > ``` > > **QUESTION 2**: Will I also benefit from vectorized UDF in this case? > > Best, > Yik San