Re: pyspark loop optimization

2022-01-12 Thread Ramesh Natarajan
Sorry for confusing with cume_dist and percent_rank. I was playing around with these to see if the difference in computation made any difference. I must have copied the percent rank accidentally. My requirement is to compute cume_dist. I have a dataframe with a bunch of columns (10+ columns) th

Re: pyspark loop optimization

2022-01-11 Thread Gourav Sengupta
Hi, I am not sure what you are trying to achieve here are cume_dist and percent_rank not different? If am able to follow your question correctly, you are looking for filtering our NULLs before applying the function on the dataframe, and I think it will be fine if you just create another dataframe