HyukjinKwon commented on code in PR #50034: URL: https://github.com/apache/spark/pull/50034#discussion_r1964752604
########## python/docs/source/migration_guide/pyspark_upgrade.rst: ########## @@ -75,6 +75,7 @@ Upgrading from PySpark 3.5 to 4.0 * In Spark 4.0, ``compute.ops_on_diff_frames`` is on by default. To restore the previous behavior, set ``compute.ops_on_diff_frames`` to ``false``. * In Spark 4.0, the data type ``YearMonthIntervalType`` in ``DataFrame.collect`` no longer returns the underlying integers. To restore the previous behavior, set ``PYSPARK_YM_INTERVAL_LEGACY`` environment variable to ``1``. * In Spark 4.0, items other than functions (e.g. ``DataFrame``, ``Column``, ``StructType``) have been removed from the wildcard import ``from pyspark.sql.functions import *``, you should import these items from proper modules (e.g. ``from pyspark.sql import DataFrame, Column``, ``from pyspark.sql.types import StructType``). +* In Spark 4.0, ``spark.sql.execution.pythonUDF.arrow.enabled`` is enabled by default. If users have PyArrow and pandas installed in their local and Spark Cluster, it automatically optimizes the regular Python UDFs with Arrow. To turn off the Arrow optimization, set ``spark.sql.execution.pythonUDF.arrow.enabled`` to ``false``. Review Comment: That is actually subtle. There are some type coercion difference when the return schema is not matched with return instance, e.g., https://github.com/apache/spark/blob/master/python/pyspark/sql/functions/builtin.py#L26484-L26502 vs https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/functions.py#L346-L364 (but those are internal, not the public documentation). So my take here is that If there is any issue related to running legacy Python UDFs in Spark 4.0, they will likely face Arrow errors, and they would google and read this, and turn it off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org