Luke Kot-Zaniewski created SOLR-18125:
-----------------------------------------

             Summary: LukeRequestHandler's getFirstLiveDoc is (Doubly) Broken
                 Key: SOLR-18125
                 URL: https://issues.apache.org/jira/browse/SOLR-18125
             Project: Solr
          Issue Type: Bug
            Reporter: Luke Kot-Zaniewski


Funny little expedition as I was testing a patch I had to support distrib=true 
for LukeRequestHandler. It turns out the logic to fetch the first live doc from 
the index for a particular field (used to populate the index flags for that 
field) is broken in two surprising ways. As far as I can tell it's been broken 
since 2012. The net effect is you'll just randomly not get index flags for 
fields as I show in the test.

Timeline:
 # LUCENE-4355 Before [this 
change|https://github.com/apache/solr/commit/dc02cdd38450f6a269964bc5d24584994578fa54]
 docsEnum = reader.termDocsEnum(liveDocs, field, termText, offset) apparently 
would return null, perhaps when there were no live docs for this particular 
term. There used to be a guard in the loop condition which ensured that it 
would exit on the first non-null docsEnum it encountered across all terms. Not 
entirely sure why exactly this extra condition was necessary; it would only 
matter if the first non-null docsEnum also didn't have any docs. Either away, 
reaching such a state apparently warranted ending the search altogether. 
LUCENE-4355 changed the semantics to termsEnum.docs(liveDocs, docsEnum, offset) 
which could _not_ return null, as much is obvious from the fact that the null 
guard disappeared. However, the docsEnum == null remained in the loop 
condition, almost certainly by mistake.
 # LUCENE-6553 [this massive 
change|https://github.com/apache/solr/commit/d671dd8d890a8e5eb56cbcd94873c3057745a17f]
 seemed to introduce the exact opposite liveDocs check you'd expect from this 
method (somehow exiting the loop with no result when a live doc _was_ 
encountered). I suspect it is, again, an oversight given how large the change 
was and how little discussion there was around it.

LUCENE-4355 Diff:
{code:java}
       -      Term term = new Term(fieldName, text);
       -      docsEnum = reader.termDocsEnum(reader.getLiveDocs(),
       -          term.field(),
       -          new BytesRef(term.text()),
       -          0);
       -      if (docsEnum != null) {
       -        int docId;
       -        if ((docId = docsEnum.nextDoc()) != 
DocIdSetIterator.NO_MORE_DOCS) {
       -          return reader.document(docId);
       -        }
       -      }
       +      docsEnum = termsEnum.docs(reader.getLiveDocs(), docsEnum, 0);
       +      if (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
       +        return reader.document(docsEnum.docID());
       +      }
{code}
LUCENE-6553 diff
{code:java}
      postingsEnum = termsEnum.postings(reader.getLiveDocs(), postingsEnum, 
PostingsEnum.NONE);
      postingsEnum = termsEnum.postings(postingsEnum, PostingsEnum.NONE);
      final Bits liveDocs = reader.getLiveDocs();
      if (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
        if (liveDocs != null && liveDocs.get(postingsEnum.docID())) {
          continue;
        }
        return reader.document(postingsEnum.docID());
      }
    }
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to