YOU know what you're talking about and aren't hacking a solution. You are
my new friend :) Thank you, this is incredibly helpful!

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


On Thu, Aug 25, 2022 at 10:52 AM Takuya UESHIN <ues...@happy-camper.st>
wrote:

> Hi Subash,
>
> Have you tried the Python/Pandas UDF Profiler introduced in Spark 3.3?
> -
> https://spark.apache.org/docs/latest/api/python/development/debugging.html#python-pandas-udf
>
> Hope it can help you.
>
> Thanks.
>
> On Thu, Aug 25, 2022 at 10:18 AM Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> Subash, I’m here to help :)
>>
>> I started a test script to demonstrate a solution last night but got a
>> cold and haven’t finished it. Give me another day and I’ll get it to you.
>> My suggestion is that you run PySpark locally in pytest with a fixture to
>> generate and yield your SparckContext and SparkSession and the. Write tests
>> that load some test data, perform some count operation and checkpoint to
>> ensure that data is loaded, start a timer, run your UDF on the DataFrame,
>> checkpoint again or write some output to disk to make sure it finishes and
>> then stop the timer and compute how long it takes. I’ll show you some code,
>> I have to do this for Graphlet AI’s RTL utils and other tools to figure out
>> how much overhead there is using Pandera and Spark together to validate
>> data: https://github.com/Graphlet-AI/graphlet
>>
>> I’ll respond by tomorrow evening with code in a fist! We’ll see if it
>> gets consistent, measurable and valid results! :)
>>
>> Russell Jurney
>>
>> On Thu, Aug 25, 2022 at 10:00 AM Sean Owen <sro...@gmail.com> wrote:
>>
>>> It's important to realize that while pandas UDFs and pandas on Spark are
>>> both related to pandas, they are not themselves directly related. The first
>>> lets you use pandas within Spark, the second lets you use pandas on Spark.
>>>
>>> Hard to say with this info but you want to look at whether you are doing
>>> something expensive in each UDF call and consider amortizing it with the
>>> scalar iterator UDF pattern. Maybe.
>>>
>>> A pandas UDF is not spark code itself so no there is no tool in spark to
>>> profile it. Conversely any approach to profiling pandas or python would
>>> work here .
>>>
>>> On Thu, Aug 25, 2022, 11:22 AM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> May be I am jumping to conclusions and making stupid guesses, but have
>>>> you tried koalas now that it is natively integrated with pyspark??
>>>>
>>>> Regards
>>>> Gourav
>>>>
>>>> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <
>>>> subashpraba...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I was wondering if we have any best practices on using pandas UDF ?
>>>>> Profiling UDF is not an easy task and our case requires some drilling down
>>>>> on the logic of the function.
>>>>>
>>>>>
>>>>> Our use case:
>>>>> We are using func(Dataframe) => Dataframe as interface to use Pandas
>>>>> UDF, while running locally only the function, it runs faster but when
>>>>> executed in Spark environment - the processing time is more than expected.
>>>>> We have one column where the value is large (BinaryType -> 600KB),
>>>>> wondering whether this could make the Arrow computation slower ?
>>>>>
>>>>> Is there any profiling or best way to debug the cost incurred using
>>>>> pandas UDF ?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Subash
>>>>>
>>>>> --
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com
>>
>
>
> --
> Takuya UESHIN
>
>

Reply via email to