I have a Dataframe with around 6 billion rows, and about 20 columns. First of all, I want to write this dataframe out to parquet. The, Out of the 20 columns, I have 3 columns of interest, and I want to find how many distinct values of the columns are there in the file. I don’t need the actual distinct values. I just need the count. I knoe that there are around 10-16million distinct values
Before I write the data frame to parquet, I do df.cache. After writing the file out, I do df.countDistinct(“a”, “b”, “c”).collect() When I run this, I see that the memory usage on my driver steadily increases until it starts getting future time outs. I guess it’s spending time in GC. Does countDistinct cause this behavior? Does Spark try to get all 10 million distinct values into the driver? Is countDistinct not recommended for data frames with large number of distinct values? What’s the solution? Should I use approx._count_distinct?