Re: Indices of SparseVector must be ordered while computing SVD

Sean Owen Thu, 23 Apr 2015 03:06:53 -0700

I think we discussed this a while ago (?) and the problem was the
overhead of even verifying the sorted state took too long.


On Thu, Apr 23, 2015 at 3:31 AM, Joseph Bradley <jos...@databricks.com> wrote:
> Hi Chunnan,
>
> There is currently Scala documentation for the constructor parameters:
> https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515
>
> There is one benefit to not checking for validity (ordering) within the
> constructor: If you need to translate between SparseVector and some other
> library's type (e.g., Breeze), you can do so with a few reference copies,
> rather than iterating through or copying the actual data.  It might be good
> to provide this check within Vectors.sparse(), but we'd need to check
> through MLlib for uses of Vectors.sparse which expect it to be a cheap
> operation.  What do you think?
>
> It is documented in the programming guide too:
> https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/docs/mllib-data-types.md
> But perhaps that should be more prominent.
>
> If you think it would be helpful, then please do make a JIRA about adding a
> check to Vectors.sparse().
>
> Joseph
>
> On Wed, Apr 22, 2015 at 8:29 AM, Chunnan Yao <yaochun...@gmail.com> wrote:
>
>> Hi all,
>> I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This
>> really
>> confused me today. At first I thought my implementation is wrong. It turns
>> out it's an issue in MLlib. Fortunately, I've figured it out.
>>
>> I suggest to add a hint on user document of MLlib ( as far as I know, there
>> have not been such hints yet) that  indices of Local Sparse Vector must be
>> ordered in ascending manner. Because of ignorance of this point, I spent a
>> lot of time looking for reasons why computeSVD of RowMatrix did not run
>> correctly on Sparse data. I don't know the influence of Sparse Vector
>> without ordered indices on other functions, but I believe it is necessary
>> to
>> let the users know or fix it. Actually, it's very easy to fix. Just add a
>> sortBy function in internal construction of SparseVector.
>>
>> Here is an example to reproduce the affect of unordered Sparse Vector on
>> computeSVD.
>> ================================================
>> //in spark-shell, Spark 1.3.1
>>  import org.apache.spark.mllib.linalg.distributed.RowMatrix
>>  import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,
>> Vectors}
>>
>>   val sparseData_ordered = Seq(
>>     Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
>>     Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)),
>>     Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
>>     Vectors.sparse(3, Array(0,2), Array(9.0, 1.0))
>>   )
>>   val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,
>> 2))
>>
>>   val sparseData_not_ordered = Seq(
>>     Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
>>     Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)),
>>     Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
>>     Vectors.sparse(3, Array(2,0), Array(1.0,9.0))
>>   )
>>  val sparseMat_not_ordered = new
>> RowMatrix(sc.parallelize(sparseData_not_ordered, 2))
>>
>> //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
>> the same matirx
>> //however, the computeSVD result of these two matrixes are different. Users
>> should be notified about this situation.
>>   println(sparseMat_ordered.computeSVD(2,
>> true).U.rows.collect.mkString("\n"))
>>   println("===================")
>>   println(sparseMat_not_ordered.computeSVD(2,
>> true).U.rows.collect.mkString("\n"))
>> ======================================================
>> The results are:
>> ordered:
>> [-0.10972870132786407,-0.18850811494220537]
>> [-0.44712472003608356,-0.24828866611663725]
>> [-0.784520738744303,-0.3080692172910691]
>> [-0.4154110101064339,0.8988385762953358]
>>
>> not ordered:
>> [-0.10830447119599484,-0.1559341848984378]
>> [-0.4522713511277327,-0.23449829541447448]
>> [-0.7962382310594706,-0.3130624059305111]
>> [-0.43131320303494614,0.8453864703362308]
>>
>> Looking into this issue, I can see it's reason locates in
>> RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
>> ordered indices. Because it is scanning the indices consecutively to skip
>> empty columns.
>>
>>
>>
>> -----
>> Feel the sparking Spark!
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Indices of SparseVector must be ordered while computing SVD

Reply via email to