[
https://issues.apache.org/jira/browse/LUCENE-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428670#comment-16428670
]
Dawid Weiss commented on LUCENE-8221:
-------------------------------------
I don't mind changing the formula (even if I disagree that catering for
internal representation of deleted documents justifies this), but not as part
of this issue. Changing the formula will change the results people get from
MLT: this should go into a major release, not a point release; what I patched
was a trivial overflow problem that doesn't touch any internals.
bq. And so is the range check in your patch, because percentage can be larger
than 100% with the broken numDocs formula used here. When a percentage can be
bigger than 100, man that's your first sign that shit is wrong!
To me the percentage remains within 0-100% with numDocs; you compute the
threshold against the current state of your index (live documents). The
computed value of the cutoff threshold is correct, it is the comparison against
docFreq that isn't sound here because docFreq doesn't have deleted documents
information. I don't quite understand the way you perceive only one of those as
"correct" vs. "utter shit" and I don't think I want to explore this subject
further.
Is it ok if I apply the overflow fix against 7.x, master and create a new issue
cutting over to maxDoc (everywhere in mlt) and apply it to master only? If no,
speak up.
> MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes
> -----------------------------------------------------------------------
>
> Key: LUCENE-8221
> URL: https://issues.apache.org/jira/browse/LUCENE-8221
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Attachments: LUCENE-8221.patch
>
>
> {code}
> public void setMaxDocFreqPct(int maxPercentage) {
> this.maxDocFreq = maxPercentage * ir.numDocs() / 100;
> }
> {code}
> The above overflows integer range into negative numbers on even fairly small
> indexes (for maxPercentage = 75, it happens for just over 28 million
> documents.
> We should make the computations on long range so that it doesn't overflow and
> have a more strict argument validation.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]