I think we discussed this a while ago (?) and the problem was the overhead of even verifying the sorted state took too long.
On Thu, Apr 23, 2015 at 3:31 AM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Chunnan, > > There is currently Scala documentation for the constructor parameters: > https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515 > > There is one benefit to not checking for validity (ordering) within the > constructor: If you need to translate between SparseVector and some other > library's type (e.g., Breeze), you can do so with a few reference copies, > rather than iterating through or copying the actual data. It might be good > to provide this check within Vectors.sparse(), but we'd need to check > through MLlib for uses of Vectors.sparse which expect it to be a cheap > operation. What do you think? > > It is documented in the programming guide too: > https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/docs/mllib-data-types.md > But perhaps that should be more prominent. > > If you think it would be helpful, then please do make a JIRA about adding a > check to Vectors.sparse(). > > Joseph > > On Wed, Apr 22, 2015 at 8:29 AM, Chunnan Yao <yaochun...@gmail.com> wrote: > >> Hi all, >> I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This >> really >> confused me today. At first I thought my implementation is wrong. It turns >> out it's an issue in MLlib. Fortunately, I've figured it out. >> >> I suggest to add a hint on user document of MLlib ( as far as I know, there >> have not been such hints yet) that indices of Local Sparse Vector must be >> ordered in ascending manner. Because of ignorance of this point, I spent a >> lot of time looking for reasons why computeSVD of RowMatrix did not run >> correctly on Sparse data. I don't know the influence of Sparse Vector >> without ordered indices on other functions, but I believe it is necessary >> to >> let the users know or fix it. Actually, it's very easy to fix. Just add a >> sortBy function in internal construction of SparseVector. >> >> Here is an example to reproduce the affect of unordered Sparse Vector on >> computeSVD. >> ================================================ >> //in spark-shell, Spark 1.3.1 >> import org.apache.spark.mllib.linalg.distributed.RowMatrix >> import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector, >> Vectors} >> >> val sparseData_ordered = Seq( >> Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), >> Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), >> Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), >> Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) >> ) >> val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered, >> 2)) >> >> val sparseData_not_ordered = Seq( >> Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), >> Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), >> Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), >> Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) >> ) >> val sparseMat_not_ordered = new >> RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) >> >> //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially >> the same matirx >> //however, the computeSVD result of these two matrixes are different. Users >> should be notified about this situation. >> println(sparseMat_ordered.computeSVD(2, >> true).U.rows.collect.mkString("\n")) >> println("===================") >> println(sparseMat_not_ordered.computeSVD(2, >> true).U.rows.collect.mkString("\n")) >> ====================================================== >> The results are: >> ordered: >> [-0.10972870132786407,-0.18850811494220537] >> [-0.44712472003608356,-0.24828866611663725] >> [-0.784520738744303,-0.3080692172910691] >> [-0.4154110101064339,0.8988385762953358] >> >> not ordered: >> [-0.10830447119599484,-0.1559341848984378] >> [-0.4522713511277327,-0.23449829541447448] >> [-0.7962382310594706,-0.3130624059305111] >> [-0.43131320303494614,0.8453864703362308] >> >> Looking into this issue, I can see it's reason locates in >> RowMatrix.scala(line 629). The implementation of Sparse dspr here requires >> ordered indices. Because it is scanning the indices consecutively to skip >> empty columns. >> >> >> >> ----- >> Feel the sparking Spark! >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org