[
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974223#action_12974223
]
Robert Muir commented on LUCENE-2829:
-------------------------------------
bq. edit: and as robert previously pointed out, if we cached misses as well,
then we could avoid needless seeks on segments that don't contain the term.
True, this is a good idea, just a little tricker:
* In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting
to true.
* FilteredTermsEnum passes false here, so the multitermqueries don't populate
the cache with
garbage while enumerating (eg foo*), only explicitly at the end with
cacheTerm() (per-segment)
for the ones that were actually accepted. They sum up their docFreq
themselves to prevent the
first wasted seek in TermQuery.
* So this solution would make MTQ worse, as it would cause them to trash the
caches in the
second wasted seek (the docsenum) where they do not today, with negative
entries for the
segments where the term doesn't exist. Today they do this wasted seek, but
they don't
trash the cache here. The only solution to prevent that is the
PerReaderTermState
(or something equally complicated).
* We would have to look at other places where negative entries would hurt, for
example
rebuilding spellcheck indexes uses this 'termExists()' method implemented
with docFreq.
So we would have to likely change spellcheck's code to use a TermsEnum and
seek(term, false)... using a termsenum in parallel with the spellcheck
dictionary would
obviously be more efficient for the index-based spellcheck case (forget about
caching)
versus docFreq()'ing every term... *but* we cannot assume the spellcheck
"Dictionary"
is actually in term order, (imagine the File-based dictionary case), so we
can't
implement this today.
On 3.x i think its slightly less complicated as there is already a hack in the
cache to
prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all
the MTQs
just enumerate sequentially anyway... (except NRQ which doesn't enum many terms
anyway, likely not a problem).
But we would have to at least fix the spellcheck case there too I think.
Not saying I don't like your idea... just saying there's more work to do it.
> improve termquery "pk lookup" performance
> -----------------------------------------
>
> Key: LUCENE-2829
> URL: https://issues.apache.org/jira/browse/LUCENE-2829
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Attachments: LUCENE-2829.patch
>
>
> For things that are like primary keys and don't exist in some segments (worst
> case is primary/unique key that only exists in 1)
> we do wasted seeks.
> While LUCENE-2694 tries to solve some of this issue with TermState, I'm
> concerned we could every backport that to 3.1 for example.
> This is a simpler solution here just to solve this one problem in
> termquery... we could just revert it in trunk when we resolve LUCENE-2694,
> but I don't think we should leave things as they are in 3.x
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]