It would be possible to use arrow on regular python udfs and avoid pandas,
and there would probably be some performance improvement. The difficult
part will be to ensure that the data remains consistent in the conversions
between Arrow and Python, e.g. timestamps are a bit tricky. Given that we
al
I was thinking of implementing that. But quickly realized that doing a
conversion of Spark -> Pandas -> Python causes errors.
A quick example being "None" in Numeric data types.
Pandas supports only NaN. Spark supports NULL and NaN.
This is just one of the issues I came to.
I'm not sure about som
Regular Python UDFs don't use PyArrow under the hood.
Yes, they can potentially benefit but they can be easily worked around via
Pandas UDFs.
For instance, both below are virtually identical.
@udf(...)
def func(col):
return col
@pandas_udf(...)
def pandas_func(col):
return a.apply(lambda