Hmm one more question - as I said, there are 2 gains from using pandas UDF - (1) smaller ser-de and invocation overhead, and (2) vector calculation.
(2) depends on use cases, how about (1)? Is the benefit (1) always-true? Best, Yik San On Mon, Apr 19, 2021 at 4:33 PM Yik San Chan <evan.chanyik...@gmail.com> wrote: > Hi Fabian and Dian, > > Thanks for the reply. They make sense. > > Best, > Yik San > > On Mon, Apr 19, 2021 at 9:49 AM Dian Fu <dian0511...@gmail.com> wrote: > >> Hi Yik San, >> >> It much depends on what you want to do in your Python UDF implementation. >> As you know that, for vectorized Python UDF (aka. Pandas UDF), the input >> data are organized as columnar format. So if your Python UDF implementation >> could benefit from this, e.g. making use of the functionalities provided in >> the libraries such as Pandas, Numpy, etc which are columnar oriented, then >> vectorized Python UDF is usually a better choice. However, if you have to >> operate the input data one row at a time, then I guess that the >> non-vectorized Python UDF is enough. >> >> PS: you could also run some performance test when it’s unclear which one >> is better. >> >> Regards, >> Dian >> >> 2021年4月16日 下午8:24,Fabian Paul <fabianp...@data-artisans.com> 写道: >> >> Hi Yik San, >> >> I think the usage of vectorized udfs highly depends on your input and >> output formats. For your example my first impression would say that parsing >> a JSON string is always an rather expensive operation and the vectorization >> has not much impact on that. >> >> I am ccing Dian Fu who is more familiar with pyflink >> >> Best, >> Fabian >> >> On 16. Apr 2021, at 11:04, Yik San Chan <evan.chanyik...@gmail.com> >> wrote: >> >> The question is cross-posted on Stack Overflow >> https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar >> >> Is there a simple set of rules to follow when deciding between vectorized >> vs scalar PyFlink UDF? >> >> According to [docs]( >> https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html), >> vectorized UDF has advantages of: (1) smaller ser-de and invocation >> overhead (2) Vector calculation are highly optimized thanks to libs such as >> Numpy. >> >> > Vectorized Python user-defined functions are functions which are >> executed by transferring a batch of elements between JVM and Python VM in >> Arrow columnar format. The performance of vectorized Python user-defined >> functions are usually much higher than non-vectorized Python user-defined >> functions as the serialization/deserialization overhead and invocation >> overhead are much reduced. Besides, users could leverage the popular Python >> libraries such as Pandas, Numpy, etc for the vectorized Python user-defined >> functions implementation. These Python libraries are highly optimized and >> provide high-performance data structures and functions. >> >> **QUESTION 1**: Is vectorized UDF ALWAYS preferred? >> >> Let's say, in my use case, I want to simply extract some fields from a >> JSON column, that is not supported by Flink [built-in functions]( >> https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html) >> yet, therefore I need to define my udf like: >> >> ```python >> @udf(...) >> def extract_field_from_json(json_value, field_name): >> import json >> return json.loads(json_value)[field_name] >> ``` >> >> **QUESTION 2**: Will I also benefit from vectorized UDF in this case? >> >> Best, >> Yik San >> >> >> >>