I have a Dataframe with around 6 billion rows, and about 20 columns. First of 
all, I want to write this dataframe out to parquet. The, Out of the 20 columns, 
I have 3 columns of interest, and I want to find how many distinct values of 
the columns are there in the file. I don’t need the actual distinct values. I 
just need the count. I knoe that there are around 10-16million distinct values

Before I write the data frame to parquet, I do df.cache. After writing the file 
out, I do df.countDistinct(“a”, “b”, “c”).collect()

When I run this, I see that the memory usage on my driver steadily increases 
until it starts getting future time outs. I guess it’s spending time in GC. 
Does countDistinct cause this behavior? Does Spark try to get all 10 million 
distinct values into the driver? Is countDistinct not recommended for data 
frames with large number of distinct values?

What’s the solution? Should I use approx._count_distinct?

Reply via email to