Re: Profiling PySpark Pandas UDF

Subash Prabanantham Thu, 25 Aug 2022 11:55:56 -0700

Wow, lots of good suggestions. I didn’t know about the profiler either.
Great suggestion @Takuya.



Thanks,
Subash

On Thu, 25 Aug 2022 at 19:30, Russell Jurney <russell.jur...@gmail.com>
wrote:

> YOU know what you're talking about and aren't hacking a solution. You are
> my new friend :) Thank you, this is incredibly helpful!
>
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <ues...@happy-camper.st>
> wrote:
>
>> Hi Subash,
>>
>> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
>> -
>> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>>
>> Hope it can help you.
>>
>> Thanks.
>>
>> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>>> Subash, I’m here to help :)
>>>
>>> I started a test script to demonstrate a solution last night but got a
>>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>>> My suggestion is that you run PySpark locally in pytest with a fixture to
>>> generate and yield your SparckContext and SparkSession and the. Write tests
>>> that load some test data, perform some count operation and checkpoint to
>>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>>> checkpoint again or write some output to disk to make sure it finishes and
>>> then stop the timer and compute how long it takes. I’ll show you some code,
>>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>>> how much overhead there is using Pandera and Spark together to validate
>>> data: https://github.com/Graphlet-AI/graphlet
>>>
>>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>>> gets consistent, measurable and valid results! :)
>>>
>>> Russell Jurney
>>>
>>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> It's important to realize that while pandas UDFs and pandas on Spark
>>>> are both related to pandas, they are not themselves directly related. The
>>>> first lets you use pandas within Spark, the second lets you use pandas on
>>>> Spark.
>>>>
>>>> Hard to say with this info but you want to look at whether you are
>>>> doing something expensive in each UDF call and consider amortizing it with
>>>> the scalar iterator UDF pattern. Maybe.
>>>>
>>>> A pandas UDF is not spark code itself so no there is no tool in spark
>>>> to profile it. Conversely any approach to profiling pandas or python would
>>>> work here .
>>>>
>>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> May be I am jumping to conclusions and making stupid guesses, but have
>>>>> you tried koalas now that it is natively integrated with pyspark??
>>>>>
>>>>> Regards
>>>>> Gourav
>>>>>
>>>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>>>>> subashpraba...@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I was wondering if we have any best practices on using pandas UDF ?
>>>>>> Profiling UDF is not an easy task and our case requires some drilling 
>>>>>> down
>>>>>> on the logic of the function.
>>>>>>
>>>>>>
>>>>>> Our use case:
>>>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>>>>> UDF, while running locally only the function, it runs faster but when
>>>>>> executed in Spark environment - the processing time is more than 
>>>>>> expected.
>>>>>> We have one column where the value is large (BinaryType -> 600KB),
>>>>>> wondering whether this could make the Arrow computation slower ?
>>>>>>
>>>>>> Is there any profiling or best way to debug the cost incurred using
>>>>>> pandas UDF ?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Subash
>>>>>>
>>>>>> --
>>>
>>> Thanks,
>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com
>>>
>>
>>
>> --
>> Takuya UESHIN
>>
>>

Re: Profiling PySpark Pandas UDF

Reply via email to