Note: forwarding to list, incorrectly hit "Repliy" first, instead of
"Reply List"
Hello,
Does your code run without enabling fallback mode? Arrow vectorization
might not just get applied - if you still observe "javaToPython" stages
on Spark UI. Also data is not skewed (partitions are too larg
Hi,
I used these settings but did not see obvious improvement (190 minutes
reduced to 170 minutes):
spark.sql.execution.arrow.pyspark.enabled: True
spark.sql.execution.arrow.pyspark.fallback.enabled: True
This job heavily uses pandas udfs and it runs on a 30 xlarge node emr.
Any idea why
Please ignore this question.
https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow
shows pandas udf should have avoided jvm<->Python SerDe by maintaining one
data copy in memory. spark.sql.execution.arrow.enabled is false by default.
I think I missed e