RE: Profiling PySpark Pandas UDF

2022-08-29 Thread Luca Canali
Hi Abdeali, Thanks for the support. Indeed you can go ahead and test and review my latest PR for SPARK-34265 (Instrument Python UDF execution using SQL Metrics) if you want to: https://github.com/apache/spark/pull/33559 Currently I reduced the scope of the instrumentation to just 3 si

deciding Spark tasks & optimization resource

2022-08-29 Thread rajat kumar
Hello Members, I have a query for spark stages:- why every stage has a different number of tasks/partitions in spark. Or how is it determined? Moreover, where can i see the improvements done in spark3+ Thanks in advance Rajat

Re: Profiling PySpark Pandas UDF

2022-08-29 Thread Gourav Sengupta
Hi, I do send back those metrics in as columns in the pandas datagrams in case required, but the true thing is that we need to finally be able to find out the time for java object conversion along with the udf calls and actual python memory and other details which we can all do by tweaking udf. B

Re: deciding Spark tasks & optimization resource

2022-08-29 Thread Gibson
Hello Rajat, Look up the spark *Pipelining* concept; any sequence of operations that feed data directly into each other without need for shuffling will packed into a single stage, ie select -> filter -> select (SparkSQL) ; map -> filter -> map (RDD), for any operation that requires shuffling (sort