Kube estimate for Spark

2021-06-03 Thread Subash Prabanantham
Hi Team, I am trying to understand how to estimate Kube cpu with respect to Spark executor cores. For example, Job configuration: (given to start) cores/executor = 4 # of executors = 240 But the allocated resources when we ran job are as follows, cores/executor = 4 # of executors = 47 So the q

Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
Hi All, I was wondering if we have any best practices on using pandas UDF ? Profiling UDF is not an easy task and our case requires some drilling down on the logic of the function. Our use case: We are using func(Dataframe) => Dataframe as interface to use Pandas UDF, while running locally only

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Subash Prabanantham
ing something expensive in each UDF call and consider amortizing it with >>>> the scalar iterator UDF pattern. Maybe. >>>> >>>> A pandas UDF is not spark code itself so no there is no tool in spark >>>> to profile it. Conversely any approach to p

[Spark Structured Streaming] Two sink from Single stream

2023-11-15 Thread Subash Prabanantham
Hi Team, I am working on a basic streaming aggregation where I have one file stream source and two write sinks (Hudi table). The only difference is the aggregation performed is different, hence I am using the same spark session to perform both operations. (File Source) --> Agg1 -> DF1 -->