Hi Burak,

Unfortunately, I am expected to do my work in HDInsight environment which
only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace
it with Spark 1.3.

I think the problem I am observing is caused by kmeans|| initialization
step. I will open another thread to discuss it.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Hi David,
>
> Can you also try with Spark 1.3 if possible? I believe there was a 2x
> improvement on K-Means between 1.2 and 1.3.
>
> Thanks,
> Burak
>
>
>
> On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 <davidshe...@gmail.com>
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I
>> want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster.
>> The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to