subject:"Re\: use java in Grouped Map pandas udf to avoid serDe"

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Evgeniy Ignatiev

Note: forwarding to list, incorrectly hit "Repliy" first, instead of "Reply List" Hello, Does your code run without enabling fallback mode? Arrow vectorization might not just get applied - if you still observe "javaToPython" stages on Spark UI. Also data is not skewed (partitions are too larg

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-06 Thread Lian Jiang

Hi, I used these settings but did not see obvious improvement (190 minutes reduced to 170 minutes): spark.sql.execution.arrow.pyspark.enabled: True spark.sql.execution.arrow.pyspark.fallback.enabled: True This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea why

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-04 Thread Lian Jiang

Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed e