Hi,
I run the kmeans(MLlib) in a cluster with 12 workers. Every work own
a 128G RAM, 24Core. I run 48 task in one machine. the total data is just
40GB.
When the dimension of the data set is about 10^7, for every task the
duration is about 30s, but the cost for GC is about 20s.
When I reduce the dimension to 10^4, then the gc is small.
So why gc is so high when the dimension is larger? or this is the
reason caused by MLlib?
