4.0 and 4.1 FieldCacheImpl.DocTermsImpl.exists(docid) possibly broken

Doron Cohen Thu, 18 Jul 2013 04:51:18 -0700

Hi, just an FYI - may be helpful for anyone obliged to use 4.0.0 or 4.1.0 -
it seems that this method is actually doing the opposite of its intention.


I did not find mentions of this in the lists or elsewhere.

This is the code for o.a.l.search.FieldCacheImpl.DocTermsImpl.exists(int):
public boolean exists(int docID) {
      return docToOffset.get(docID) == 0;
}

Its description says: "Returns true if this doc has this field and is not
deleted".
But it returns true for docs not containing the field and false for those
that do contain it.

A simple workaround is to not to call this method before calling getTerm()
but rather just rely on getTerm()  logic: "... returns the same BytesRef,
or an empty (length=0) BytesRef if the doc did not have this field or was
deleted."

So usage code can be like this:
DocTerms values =  FieldCache.DEFAULT.getTerms(reader, FIELD_NAME);
 BytesRef term = new BytesRef();
for (int docid=0; docid<values.size(); docid++) {
term = values.getTerm(docid, term);
 if (term.length>0) {
doSomethingWith(term.utf8ToString());
}
 }
FieldCache.DEFAULT.purge(reader);

I am not sure about the overhead of this comparing to first checking
exists(), but it at least work correctly.

The code for exists() was as above until R1442497 (
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?revision=1442497&view=markup)
and then in R1443717 (
http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java?r1=1442497&r2=1443717&diff_format=h)
the API was change as part of LUCENE-4547 (DocValues improvements) which
was included in 4.2.

Simple code to demonstrate this (here with 4.1 but same results with 4.0):

RAMDirectory d = new RAMDirectory();
IndexWriter w = new IndexWriter(d, new IndexWriterConfig(Version.LUCENE_41,
new SimpleAnalyzer(Version.LUCENE_41)));
 w.addDocument(new Document()); // Empty doc (0, 0)
 Document doc = new Document(); // Real doc (1, 1)
doc.add(new StringField("f1", "v1", Store.NO));
w.addDocument(doc);
 w.addDocument(new Document()); // Empty doc (2, 2)
w.addDocument(new Document()); // Empty doc (3, 3)
w.commit(); // Commit - so we'll have two atomic readers
 doc = new Document(); // RealDoc (0, 4)
doc.add(new StringField("f1", "v2", Store.NO));
w.addDocument(doc);
w.addDocument(new Document()); // Empty doc (1, 5)
w.close();

IndexReader r = DirectoryReader.open(d);
BytesRef br = new BytesRef();
for (AtomicReaderContext leaf : r.leaves()) {
System.out.println("--- new atomic reader");
AtomicReader reader = leaf.reader();
DocTerms a = FieldCache.DEFAULT.getTerms(reader, "f1");
for (int i = 0; i < reader.maxDoc(); ++i) {
int n = leaf.docBase + i;
System.out.println(n+" exists: "+a.exists(i));
br = a.getTerm(i, br);
if (br.length > 0) {
System.out.println(n + "  "+br.utf8ToString());
}
}
}

The result printing:

      --- new atomic reader
      0 exists: true
      1 exists: false
      1  v1
      2 exists: true
      3 exists: true
      --- new atomic reader
     4 exists: false
     4  v2
     5 exists: true

Indeed, exists() results are wrong.

So again, just an FYI, as this API no longer exists...

Regards,
Doron

4.0 and 4.1 FieldCacheImpl.DocTermsImpl.exists(docid) possibly broken

Reply via email to