Sorry for confusing with cume_dist and percent_rank. I was playing around with
these to see if the difference in computation made any difference. I must have
copied the percent rank accidentally. My requirement is to compute cume_dist.
I have a dataframe with a bunch of columns (10+ columns) th
Hi,
I am not sure what you are trying to achieve here are cume_dist and
percent_rank not different?
If am able to follow your question correctly, you are looking for filtering
our NULLs before applying the function on the dataframe, and I think it
will be fine if you just create another dataframe
I want to compute cume_dist on a bunch of columns in a spark dataframe, but
want to remove NULL values before doing so.
I have this loop in pyspark. While this works, I see the driver runs at
100% while the executors are idle for the most part. I am reading that
running a loop is an anti-pattern a