Hi Reza, That was the fix we needed. After sorting, the transposed entries are gone!
Thanks a bunch, rick On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh <[email protected]> wrote: > Hi Richard, > One reason that could be happening is that the rows of your matrix are > using SparseVectors, but the entries in your vectors aren't sorted by > index. Is that the case? Sparse Vectors > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala> > need sorted indices. > Reza > > On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey <[email protected]> wrote: > >> Hi Reza, >> >> After a bit of digging, I had my previous issue a little bit wrong. We're >> not getting duplicate (i,j) entries, but we are getting transposed entries >> (i,j) and (j,i) with potentially different scores. We assumed the output >> would be a triangular matrix. Still, let me know if that's expected. A >> transposed entry occurs for about 5% of our output entries. >> >> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() >> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >> Array(MatrixEntry(22769,539029,0.00453050595770095)) >> >> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() >> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >> Array(MatrixEntry(539029,22769,0.002265252978850475)) >> >> I saved a subset of vectors to object files that replicates the issue . >> It's about 300mb. Should I try to whittle that down some more? What would >> be the best way to get that to you. >> >> Many thanks, >> Rick >> >> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <[email protected]> wrote: >> >>> This shouldn't be happening, do you have an example to reproduce it? >>> >>> On Thu, May 7, 2015 at 4:17 PM, rbolkey <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I have a question regarding one of the oddities we encountered while >>>> running >>>> mllib's column similarities operation. When we examine the output, we >>>> find >>>> duplicate matrix entries (the same i,j). Sometimes the entries have the >>>> same >>>> value/similarity score, but they're frequently different too. >>>> >>>> Is this a known issue? An artifact of the probabilistic nature of the >>>> output? Which output score should we trust (lower vs higher one when >>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on >>>> a 10 >>>> node cluster. >>>> >>>> Thanks >>>> Rick >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >> >
