Hi Reza,

After a bit of digging, I had my previous issue a little bit wrong. We're
not getting duplicate (i,j) entries, but we are getting transposed entries
(i,j) and (j,i) with potentially different scores. We assumed the output
would be a triangular matrix. Still, let me know if that's expected. A
transposed entry occurs for about 5% of our output entries.

scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
Array(MatrixEntry(22769,539029,0.00453050595770095))

scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
Array(MatrixEntry(539029,22769,0.002265252978850475))

I saved a subset of vectors to object files that replicates the issue .
It's about 300mb. Should I try to whittle that down some more? What would
be the best way to get that to you.

Many thanks,
Rick

On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <r...@databricks.com> wrote:

> This shouldn't be happening, do you have an example to reproduce it?
>
> On Thu, May 7, 2015 at 4:17 PM, rbolkey <rbol...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question regarding one of the oddities we encountered while
>> running
>> mllib's column similarities operation. When we examine the output, we find
>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>> same
>> value/similarity score, but they're frequently different too.
>>
>> Is this a known issue? An artifact of the probabilistic nature of the
>> output? Which output score should we trust (lower vs higher one when
>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a
>> 10
>> node cluster.
>>
>> Thanks
>> Rick
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to