Wow, lots of good suggestions. I didn’t know about the profiler either. Great suggestion @Takuya.
Thanks, Subash On Thu, 25 Aug 2022 at 19:30, Russell Jurney <russell.jur...@gmail.com> wrote: > YOU know what you're talking about and aren't hacking a solution. You are > my new friend :) Thank you, this is incredibly helpful! > > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <ues...@happy-camper.st> > wrote: > >> Hi Subash, >> >> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? >> - >> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf >> >> Hope it can help you. >> >> Thanks. >> >> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com> >> wrote: >> >>> Subash, I’m here to help :) >>> >>> I started a test script to demonstrate a solution last night but got a >>> cold and haven’t finished it. Give me another day and I’ll get it to you. >>> My suggestion is that you run PySpark locally in pytest with a fixture to >>> generate and yield your SparckContext and SparkSession and the. Write tests >>> that load some test data, perform some count operation and checkpoint to >>> ensure that data is loaded, start a timer, run your UDF on the DataFrame, >>> checkpoint again or write some output to disk to make sure it finishes and >>> then stop the timer and compute how long it takes. I’ll show you some code, >>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out >>> how much overhead there is using Pandera and Spark together to validate >>> data: https://github.com/Graphlet-AI/graphlet >>> >>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it >>> gets consistent, measurable and valid results! :) >>> >>> Russell Jurney >>> >>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote: >>> >>>> It's important to realize that while pandas UDFs and pandas on Spark >>>> are both related to pandas, they are not themselves directly related. The >>>> first lets you use pandas within Spark, the second lets you use pandas on >>>> Spark. >>>> >>>> Hard to say with this info but you want to look at whether you are >>>> doing something expensive in each UDF call and consider amortizing it with >>>> the scalar iterator UDF pattern. Maybe. >>>> >>>> A pandas UDF is not spark code itself so no there is no tool in spark >>>> to profile it. Conversely any approach to profiling pandas or python would >>>> work here . >>>> >>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> May be I am jumping to conclusions and making stupid guesses, but have >>>>> you tried koalas now that it is natively integrated with pyspark?? >>>>> >>>>> Regards >>>>> Gourav >>>>> >>>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, < >>>>> subashpraba...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I was wondering if we have any best practices on using pandas UDF ? >>>>>> Profiling UDF is not an easy task and our case requires some drilling >>>>>> down >>>>>> on the logic of the function. >>>>>> >>>>>> >>>>>> Our use case: >>>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas >>>>>> UDF, while running locally only the function, it runs faster but when >>>>>> executed in Spark environment - the processing time is more than >>>>>> expected. >>>>>> We have one column where the value is large (BinaryType -> 600KB), >>>>>> wondering whether this could make the Arrow computation slower ? >>>>>> >>>>>> Is there any profiling or best way to debug the cost incurred using >>>>>> pandas UDF ? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Subash >>>>>> >>>>>> -- >>> >>> Thanks, >>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>> <http://facebook.com/jurney> datasyndrome.com >>> >> >> >> -- >> Takuya UESHIN >> >>