Re: Usage of PyArrow in Spark

2019-07-18 Thread Bryan Cutler
It would be possible to use arrow on regular python udfs and avoid pandas, and there would probably be some performance improvement. The difficult part will be to ensure that the data remains consistent in the conversions between Arrow and Python, e.g. timestamps are a bit tricky. Given that we al

Re: Usage of PyArrow in Spark

2019-07-18 Thread Abdeali Kothari
I was thinking of implementing that. But quickly realized that doing a conversion of Spark -> Pandas -> Python causes errors. A quick example being "None" in Numeric data types. Pandas supports only NaN. Spark supports NULL and NaN. This is just one of the issues I came to. I'm not sure about som

Re: Usage of PyArrow in Spark

2019-07-17 Thread Hyukjin Kwon
Regular Python UDFs don't use PyArrow under the hood. Yes, they can potentially benefit but they can be easily worked around via Pandas UDFs. For instance, both below are virtually identical. @udf(...) def func(col): return col @pandas_udf(...) def pandas_func(col): return a.apply(lambda