Hi Subash, Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? - https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
Hope it can help you. Thanks. On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com> wrote: > Subash, I’m here to help :) > > I started a test script to demonstrate a solution last night but got a > cold and haven’t finished it. Give me another day and I’ll get it to you. > My suggestion is that you run PySpark locally in pytest with a fixture to > generate and yield your SparckContext and SparkSession and the. Write tests > that load some test data, perform some count operation and checkpoint to > ensure that data is loaded, start a timer, run your UDF on the DataFrame, > checkpoint again or write some output to disk to make sure it finishes and > then stop the timer and compute how long it takes. I’ll show you some code, > I have to do this for Graphlet AI’s RTL utils and other tools to figure out > how much overhead there is using Pandera and Spark together to validate > data: https://github.com/Graphlet-AI/graphlet > > I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets > consistent, measurable and valid results! :) > > Russell Jurney > > On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote: > >> It's important to realize that while pandas UDFs and pandas on Spark are >> both related to pandas, they are not themselves directly related. The first >> lets you use pandas within Spark, the second lets you use pandas on Spark. >> >> Hard to say with this info but you want to look at whether you are doing >> something expensive in each UDF call and consider amortizing it with the >> scalar iterator UDF pattern. Maybe. >> >> A pandas UDF is not spark code itself so no there is no tool in spark to >> profile it. Conversely any approach to profiling pandas or python would >> work here . >> >> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >>> Hi, >>> >>> May be I am jumping to conclusions and making stupid guesses, but have >>> you tried koalas now that it is natively integrated with pyspark?? >>> >>> Regards >>> Gourav >>> >>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, < >>> subashpraba...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I was wondering if we have any best practices on using pandas UDF ? >>>> Profiling UDF is not an easy task and our case requires some drilling down >>>> on the logic of the function. >>>> >>>> >>>> Our use case: >>>> We are using func(Dataframe) => Dataframe as interface to use Pandas >>>> UDF, while running locally only the function, it runs faster but when >>>> executed in Spark environment - the processing time is more than expected. >>>> We have one column where the value is large (BinaryType -> 600KB), >>>> wondering whether this could make the Arrow computation slower ? >>>> >>>> Is there any profiling or best way to debug the cost incurred using >>>> pandas UDF ? >>>> >>>> >>>> Thanks, >>>> Subash >>>> >>>> -- > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > -- Takuya UESHIN