YOU know what you're talking about and aren't hacking a solution. You are my new friend :) Thank you, this is incredibly helpful!
Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <ues...@happy-camper.st> wrote: > Hi Subash, > > Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3? > - > https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf > > Hope it can help you. > > Thanks. > > On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com> > wrote: > >> Subash, I’m here to help :) >> >> I started a test script to demonstrate a solution last night but got a >> cold and haven’t finished it. Give me another day and I’ll get it to you. >> My suggestion is that you run PySpark locally in pytest with a fixture to >> generate and yield your SparckContext and SparkSession and the. Write tests >> that load some test data, perform some count operation and checkpoint to >> ensure that data is loaded, start a timer, run your UDF on the DataFrame, >> checkpoint again or write some output to disk to make sure it finishes and >> then stop the timer and compute how long it takes. I’ll show you some code, >> I have to do this for Graphlet AI’s RTL utils and other tools to figure out >> how much overhead there is using Pandera and Spark together to validate >> data: https://github.com/Graphlet-AI/graphlet >> >> I’ll respond by tomorrow evening with code in a fist! We’ll see if it >> gets consistent, measurable and valid results! :) >> >> Russell Jurney >> >> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote: >> >>> It's important to realize that while pandas UDFs and pandas on Spark are >>> both related to pandas, they are not themselves directly related. The first >>> lets you use pandas within Spark, the second lets you use pandas on Spark. >>> >>> Hard to say with this info but you want to look at whether you are doing >>> something expensive in each UDF call and consider amortizing it with the >>> scalar iterator UDF pattern. Maybe. >>> >>> A pandas UDF is not spark code itself so no there is no tool in spark to >>> profile it. Conversely any approach to profiling pandas or python would >>> work here . >>> >>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> May be I am jumping to conclusions and making stupid guesses, but have >>>> you tried koalas now that it is natively integrated with pyspark?? >>>> >>>> Regards >>>> Gourav >>>> >>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, < >>>> subashpraba...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I was wondering if we have any best practices on using pandas UDF ? >>>>> Profiling UDF is not an easy task and our case requires some drilling down >>>>> on the logic of the function. >>>>> >>>>> >>>>> Our use case: >>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas >>>>> UDF, while running locally only the function, it runs faster but when >>>>> executed in Spark environment - the processing time is more than expected. >>>>> We have one column where the value is large (BinaryType -> 600KB), >>>>> wondering whether this could make the Arrow computation slower ? >>>>> >>>>> Is there any profiling or best way to debug the cost incurred using >>>>> pandas UDF ? >>>>> >>>>> >>>>> Thanks, >>>>> Subash >>>>> >>>>> -- >> >> Thanks, >> Russell Jurney @rjurney <http://twitter.com/rjurney> >> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >> <http://facebook.com/jurney> datasyndrome.com >> > > > -- > Takuya UESHIN > >