Re: Duplicate entries in output of mllib column similarities

Richard Bolkey Tue, 12 May 2015 07:43:23 -0700

Hi Reza,

That was the fix we needed. After sorting, the transposed entries are gone!


Thanks a bunch,
rick

On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh <[email protected]> wrote:

> Hi Richard,
> One reason that could be happening is that the rows of your matrix are
> using SparseVectors, but the entries in your vectors aren't sorted by
> index. Is that the case? Sparse Vectors
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
> need sorted indices.
> Reza
>
> On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey <[email protected]> wrote:
>
>> Hi Reza,
>>
>> After a bit of digging, I had my previous issue a little bit wrong. We're
>> not getting duplicate (i,j) entries, but we are getting transposed entries
>> (i,j) and (j,i) with potentially different scores. We assumed the output
>> would be a triangular matrix. Still, let me know if that's expected. A
>> transposed entry occurs for about 5% of our output entries.
>>
>> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
>> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>> Array(MatrixEntry(22769,539029,0.00453050595770095))
>>
>> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
>> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>> Array(MatrixEntry(539029,22769,0.002265252978850475))
>>
>> I saved a subset of vectors to object files that replicates the issue .
>> It's about 300mb. Should I try to whittle that down some more? What would
>> be the best way to get that to you.
>>
>> Many thanks,
>> Rick
>>
>> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <[email protected]> wrote:
>>
>>> This shouldn't be happening, do you have an example to reproduce it?
>>>
>>> On Thu, May 7, 2015 at 4:17 PM, rbolkey <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a question regarding one of the oddities we encountered while
>>>> running
>>>> mllib's column similarities operation. When we examine the output, we
>>>> find
>>>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>>>> same
>>>> value/similarity score, but they're frequently different too.
>>>>
>>>> Is this a known issue? An artifact of the probabilistic nature of the
>>>> output? Which output score should we trust (lower vs higher one when
>>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on
>>>> a 10
>>>> node cluster.
>>>>
>>>> Thanks
>>>> Rick
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>
>

Re: Duplicate entries in output of mllib column similarities

Reply via email to