Hi Reza, After a bit of digging, I had my previous issue a little bit wrong. We're not getting duplicate (i,j) entries, but we are getting transposed entries (i,j) and (j,i) with potentially different scores. We assumed the output would be a triangular matrix. Still, let me know if that's expected. A transposed entry occurs for about 5% of our output entries.
scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(22769,539029,0.00453050595770095)) scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(539029,22769,0.002265252978850475)) I saved a subset of vectors to object files that replicates the issue . It's about 300mb. Should I try to whittle that down some more? What would be the best way to get that to you. Many thanks, Rick On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <r...@databricks.com> wrote: > This shouldn't be happening, do you have an example to reproduce it? > > On Thu, May 7, 2015 at 4:17 PM, rbolkey <rbol...@gmail.com> wrote: > >> Hi, >> >> I have a question regarding one of the oddities we encountered while >> running >> mllib's column similarities operation. When we examine the output, we find >> duplicate matrix entries (the same i,j). Sometimes the entries have the >> same >> value/similarity score, but they're frequently different too. >> >> Is this a known issue? An artifact of the probabilistic nature of the >> output? Which output score should we trust (lower vs higher one when >> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a >> 10 >> node cluster. >> >> Thanks >> Rick >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >