[
https://issues.apache.org/jira/browse/LUCENE-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304050#comment-16304050
]
Adrien Grand commented on LUCENE-8087:
--------------------------------------
bq. Maybe it doesnt belong in the terms dict? Thinking about the poisonous
docs, that kinda implies we should be looking at skipdata or similar? I don't
know the new disjunction stuff well, but seems like instead of advance()'ing to
the first docid > N, you instead want advance() to work differently.
Agreed. I think this is needed anyway if we want to be able to speed up top-k
selection on term queries?
bq. seems hard without baking in the similarity's logic at index time.
I guess the only way to avoid recording similarity-specific information is to
record all competitive (freq,norm) pairs for every block of X documents. X
would likely need to be quite large since we would need to compute the score
for every pair to know the best score in the block.
> Record per-term max term frequencies
> ------------------------------------
>
> Key: LUCENE-8087
> URL: https://issues.apache.org/jira/browse/LUCENE-8087
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-8087.patch
>
>
> I was mostly interested in doing that in order to get better score upper
> bounds for LUCENE-4100. However this doesn't help, at least with the tasks
> that we have for wikimedium10m. I dug this a bit, and this is due to the fact
> that the upper bound is not much better if we can't make assumptions about
> the value of the length. Ideally we'd need something like the maximum term
> frequency for each norm value. I'll post the patch in case someone has
> another use-case for per-term max term frequencies.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]