[
https://issues.apache.org/jira/browse/LUCENE-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413517#comment-16413517
]
Dawid Weiss commented on LUCENE-8221:
-------------------------------------
Here is what I think.
>From information retrieval point of view it's a fairly common threshold: a
>percentage of documents present in the collection at the time. I didn't add
>this method originally, but I understand its intent and it's clear to me. In
>the end, this method computes and sets an absolute document frequency cutoff
>value. It's simple and correlated with the number of documents present in the
>index at the time when you make the call.
If the number of documents is 0, nothing really happens. What I'd consider and
odd behavior is if this method could fluctuate depending on how many deletes or
merges you had up to the point of invoking it... It'd confuse me (and I guess
others not familiar with Lucene indexing internals) a lot.
Finally, this issue is about fixing the overflow, really, not changing how it
works. If you'd like to change the implementation to maxDoc can we discuss this
as part of another issue (and you can probably sense from the above paragraph I
don't quite agree maxDoc is a better choice here, it seems counter-intuitive to
me, even if more consistent with docFreq use later on).
> MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger indexes
> -----------------------------------------------------------------------
>
> Key: LUCENE-8221
> URL: https://issues.apache.org/jira/browse/LUCENE-8221
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Priority: Minor
> Attachments: LUCENE-8221.patch
>
>
> {code}
> public void setMaxDocFreqPct(int maxPercentage) {
> this.maxDocFreq = maxPercentage * ir.numDocs() / 100;
> }
> {code}
> The above overflows integer range into negative numbers on even fairly small
> indexes (for maxPercentage = 75, it happens for just over 28 million
> documents.
> We should make the computations on long range so that it doesn't overflow and
> have a more strict argument validation.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]