To properly perform PCA, you must left multiply the resulting DenseMatrix with the original RowMatrix. The result will also be a RowMatrix, therefore you can easily access the values by .values, and train KMeans on that.
Don't forget to Broadcast the DenseMatrix returned from RowMatrix.computePrincipalComponents(), otherwise you'll get an OOME. Here's how to do it in Scala: (didn't run the code, but should be something like this) val data: RowMatrix = ... val bcPrincipalComponents: DenseMatrix = data.context.broadcast(data.computePrincipalComponents()) val newData: RowMatrix = data.multiply(bcPrincipalComponents.value) KMeans.run(newData.values) Best, Burak ----- Original Message ----- From: "st553" <sthompson...@gmail.com> To: u...@spark.incubator.apache.org Sent: Wednesday, September 17, 2014 12:21:38 PM Subject: How to run kmeans after pca? I would like to reduce the dimensionality of my data before running kmeans. The problem I'm having is that both RowMatrix.computePrincipalComponents() and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train() requires an RDD[Vector]. Does MLlib provide a way to do this conversion? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-kmeans-after-pca-tp14473.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org