I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
set to 15GB.

The observation is that, when input data size is smaller than 15GB, the
performance is quite stable. However, when input data becomes larger than
that, the performance will be extremely unpredictable. For example, for
15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
dramatically different testing results: 27mins, 61mins and 114 mins. (All
settings are the same for the 3 tests, and I will create input data
immediately before running each of the tests to keep OS buffer cache hot.)

Anyone can help to explain this? Thanks very much!

Reply via email to