+1 As Robin noted, this patch will affect some of the clustering code and it will conflict with the changes I've been working for MAHOUT-236. On balance, fixing the whole Vector equivalence mess seems prudent and I will deal with the rework. You've done a pile of work here and I think factoring out the names into a separate wrapper makes sense, so commit it and I will sort out the conflicts.

On 4/20/10 3:38 AM, Sean Owen (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858807#action_12858807
 ]

Sean Owen commented on MAHOUT-379:
----------------------------------

I'd like to commit this patch as it addresses a couple issues, but it's a big 
one. Deserves some looking if you have a moment. We can retroactively undo some 
elements later, but best to have a glance now.

SequentialAccessSparseVector.equals does not agree with 
AbstractVector.equivalent
---------------------------------------------------------------------------------

                 Key: MAHOUT-379
                 URL: https://issues.apache.org/jira/browse/MAHOUT-379
             Project: Mahout
          Issue Type: Bug
          Components: Math
    Affects Versions: 0.4
            Reporter: Danny Leshem
            Assignee: Sean Owen
            Priority: Minor
             Fix For: 0.3

         Attachments: MAHOUT-379.patch, MAHOUT-379.patch, MAHOUT-379.patch


When a SequentialAccessSparseVector is serialized and deserialized using 
VectorWritable, the result vector and the original vector are equivalent, yet 
equals returns false.
The following unit-test reproduces the problem:
{code}
@Test
public void testSequentialAccessSparseVectorEquals() throws Exception {
     final Vector v = new SequentialAccessSparseVector(1);
     final VectorWritable vectorWritable = new VectorWritable(v);
     final VectorWritable vectorWritable2 = new VectorWritable();
     writeAndRead(vectorWritable, vectorWritable2);
     final Vector v2 = vectorWritable2.get();
     assertTrue(AbstractVector.equivalent(v, v2));
     assertEquals(v, v2); // This line fails!
}
private void writeAndRead(Writable toWrite, Writable toRead) throws IOException 
{
     final ByteArrayOutputStream baos = new ByteArrayOutputStream();
     final DataOutputStream dos = new DataOutputStream(baos);
     toWrite.write(dos);
     final ByteArrayInputStream bais = new 
ByteArrayInputStream(baos.toByteArray());
     final DataInputStream dis = new DataInputStream(bais);
     toRead.readFields(dis);
}
{code}
The problem seems to be that the original vector name is null, while the new 
vector's name is an empty string. The same issue probably also happens with 
RandomAccessSparseVector.
SequentialAccessSparseVectorWritable (line 40):
{code}
dataOutput.writeUTF(getName() == null ? "" : getName());
{code}
RandomAccessSparseVectorWritable (line 42):
{code}
dataOutput.writeUTF(this.getName() == null ? "" : this.getName());
{code}
The simplest fix is probably to change the default Vector's name from null to 
the empty string.

Reply via email to