Hi Subash,

Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
-
https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf

Hope it can help you.

Thanks.

On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com>
wrote:

> Subash, I’m here to help :)
>
> I started a test script to demonstrate a solution last night but got a
> cold and haven’t finished it. Give me another day and I’ll get it to you.
> My suggestion is that you run PySpark locally in pytest with a fixture to
> generate and yield your SparckContext and SparkSession and the. Write tests
> that load some test data, perform some count operation and checkpoint to
> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
> checkpoint again or write some output to disk to make sure it finishes and
> then stop the timer and compute how long it takes. I’ll show you some code,
> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
> how much overhead there is using Pandera and Spark together to validate
> data: https://github.com/Graphlet-AI/graphlet
>
> I’ll respond by tomorrow evening with code in a fist! We’ll see if it gets
> consistent, measurable and valid results! :)
>
> Russell Jurney
>
> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote:
>
>> It's important to realize that while pandas UDFs and pandas on Spark are
>> both related to pandas, they are not themselves directly related. The first
>> lets you use pandas within Spark, the second lets you use pandas on Spark.
>>
>> Hard to say with this info but you want to look at whether you are doing
>> something expensive in each UDF call and consider amortizing it with the
>> scalar iterator UDF pattern. Maybe.
>>
>> A pandas UDF is not spark code itself so no there is no tool in spark to
>> profile it. Conversely any approach to profiling pandas or python would
>> work here .
>>
>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> May be I am jumping to conclusions and making stupid guesses, but have
>>> you tried koalas now that it is natively integrated with pyspark??
>>>
>>> Regards
>>> Gourav
>>>
>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>>> subashpraba...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I was wondering if we have any best practices on using pandas UDF ?
>>>> Profiling UDF is not an easy task and our case requires some drilling down
>>>> on the logic of the function.
>>>>
>>>>
>>>> Our use case:
>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>>> UDF, while running locally only the function, it runs faster but when
>>>> executed in Spark environment - the processing time is more than expected.
>>>> We have one column where the value is large (BinaryType -> 600KB),
>>>> wondering whether this could make the Arrow computation slower ?
>>>>
>>>> Is there any profiling or best way to debug the cost incurred using
>>>> pandas UDF ?
>>>>
>>>>
>>>> Thanks,
>>>> Subash
>>>>
>>>> --
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>


-- 
Takuya UESHIN

Reply via email to