Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB.
When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I reduce the dimension to 10^4, then the gc is small. So why gc is so high when the dimension is larger? or this is the reason caused by MLlib?