Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to force this to be materialized before you run KMeans.
Also, KMeans is optimized to run quickly on both sparse and dense data. The result of PCA is going to be dense, but if your input data has #nnz ~= size(pca data), performance might be about the same. (I haven't actually verified this last point.) Finally, speed is partially going to be dependent on how much data you have relative to scheduler overheads - if your input data is small it could be that the costs of distributing your task are greater than the time spent actually computing - usually this would manifest itself in the stages taking about the same amount of time even though you're passing datasets that have different dimensionality. On Tue, Sep 30, 2014 at 9:00 AM, st553 <sthompson...@gmail.com> wrote: > Thanks for your response Burak it was very helpful. > > I am noticing that if I run PCA before KMeans that the KMeans algorithm > will > actually take longer to run than if I had just run KMeans without PCA. I > was > hoping that by using PCA first it would actually speed up the KMeans > algorithm. > > I have followed the steps you've outlined but Im wondering if I need to > cache/persist the RDD[Vector] rows of the RowMatrix returned after > multiplying. Something like: > > val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) > val cachedRows = newData.rows.persist() > KMeans.run(cachedRows) > cachedRows.unpersist() > > It doesnt seem intuitive to me that a smaller dimensional version of my > data > set would take longer for KMeans... unless Im missing something? > > Thanks! > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473p15409.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >