Re: Profiling PySpark Pandas UDF

Andrew Melo Thu, 25 Aug 2022 09:50:08 -0700

Hi Gourav,

Since Koalas needs the same round-trip to/from JVM and Python, I
expect that the performance should be nearly the same for UDFs in
either API


Cheers
Andrew

On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta
<gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
> May be I am jumping to conclusions and making stupid guesses, but have you 
> tried koalas now that it is natively integrated with pyspark??
>
> Regards
> Gourav
>
> On Thu, 25 Aug 2022, 11:07 Subash Prabanantham, <subashpraba...@gmail.com> 
> wrote:
>>
>> Hi All,
>>
>> I was wondering if we have any best practices on using pandas UDF ? 
>> Profiling UDF is not an easy task and our case requires some drilling down 
>> on the logic of the function.
>>
>>
>> Our use case:
>> We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, 
>> while running locally only the function, it runs faster but when executed in 
>> Spark environment - the processing time is more than expected. We have one 
>> column where the value is large (BinaryType -> 600KB), wondering whether 
>> this could make the Arrow computation slower ?
>>
>> Is there any profiling or best way to debug the cost incurred using pandas 
>> UDF ?
>>
>>
>> Thanks,
>> Subash
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Profiling PySpark Pandas UDF

Reply via email to