Luke Kot-Zaniewski created SOLR-18125:
-----------------------------------------
Summary: LukeRequestHandler's getFirstLiveDoc is (Doubly) Broken
Key: SOLR-18125
URL: https://issues.apache.org/jira/browse/SOLR-18125
Project: Solr
Issue Type: Bug
Reporter: Luke Kot-Zaniewski
Funny little expedition as I was testing a patch I had to support distrib=true
for LukeRequestHandler. It turns out the logic to fetch the first live doc from
the index for a particular field (used to populate the index flags for that
field) is broken in two surprising ways. As far as I can tell it's been broken
since 2012. The net effect is you'll just randomly not get index flags for
fields as I show in the test.
Timeline:
# LUCENE-4355 Before [this
change|https://github.com/apache/solr/commit/dc02cdd38450f6a269964bc5d24584994578fa54]
docsEnum = reader.termDocsEnum(liveDocs, field, termText, offset) apparently
would return null, perhaps when there were no live docs for this particular
term. There used to be a guard in the loop condition which ensured that it
would exit on the first non-null docsEnum it encountered across all terms. Not
entirely sure why exactly this extra condition was necessary; it would only
matter if the first non-null docsEnum also didn't have any docs. Either away,
reaching such a state apparently warranted ending the search altogether.
LUCENE-4355 changed the semantics to termsEnum.docs(liveDocs, docsEnum, offset)
which could _not_ return null, as much is obvious from the fact that the null
guard disappeared. However, the docsEnum == null remained in the loop
condition, almost certainly by mistake.
# LUCENE-6553 [this massive
change|https://github.com/apache/solr/commit/d671dd8d890a8e5eb56cbcd94873c3057745a17f]
seemed to introduce the exact opposite liveDocs check you'd expect from this
method (somehow exiting the loop with no result when a live doc _was_
encountered). I suspect it is, again, an oversight given how large the change
was and how little discussion there was around it.
LUCENE-4355 Diff:
{code:java}
- Term term = new Term(fieldName, text);
- docsEnum = reader.termDocsEnum(reader.getLiveDocs(),
- term.field(),
- new BytesRef(term.text()),
- 0);
- if (docsEnum != null) {
- int docId;
- if ((docId = docsEnum.nextDoc()) !=
DocIdSetIterator.NO_MORE_DOCS) {
- return reader.document(docId);
- }
- }
+ docsEnum = termsEnum.docs(reader.getLiveDocs(), docsEnum, 0);
+ if (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+ return reader.document(docsEnum.docID());
+ }
{code}
LUCENE-6553 diff
{code:java}
postingsEnum = termsEnum.postings(reader.getLiveDocs(), postingsEnum,
PostingsEnum.NONE);
postingsEnum = termsEnum.postings(postingsEnum, PostingsEnum.NONE);
final Bits liveDocs = reader.getLiveDocs();
if (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
if (liveDocs != null && liveDocs.get(postingsEnum.docID())) {
continue;
}
return reader.document(postingsEnum.docID());
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]