[jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance

Robert Muir (JIRA) Wed, 22 Dec 2010 06:47:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974223#action_12974223
 ]


Robert Muir commented on LUCENE-2829:
-------------------------------------

bq. edit: and as robert previously pointed out, if we cached misses as well, 
then we could avoid needless seeks on segments that don't contain the term.

True, this is a good idea, just a little tricker:
* In trunk, we have TermsEnum.seek(BytesRef text, boolean useCache), defaulting 
to true.
* FilteredTermsEnum passes false here, so the multitermqueries don't populate 
the cache with 
  garbage while enumerating (eg foo*),  only explicitly at the end with 
cacheTerm() (per-segment) 
  for the ones that were actually accepted. They sum up their docFreq 
themselves to prevent the 
  first wasted seek in TermQuery. 
* So this solution would make MTQ worse, as it would cause them to trash the 
caches in the 
  second wasted seek (the docsenum) where they do not today, with negative 
entries for the 
  segments where the term doesn't exist. Today they do this wasted seek, but 
they don't 
  trash the cache here. The only solution to prevent that is the 
PerReaderTermState 
  (or something equally complicated).
* We would have to look at other places where negative entries would hurt, for 
example 
  rebuilding spellcheck indexes uses this 'termExists()' method implemented 
with docFreq. 
  So we would have to likely change spellcheck's code to use a TermsEnum and 
  seek(term, false)... using a termsenum in parallel with the spellcheck 
dictionary would 
  obviously be more efficient for the index-based spellcheck case (forget about 
caching)
  versus docFreq()'ing every term... *but* we cannot assume the spellcheck 
"Dictionary" 
  is actually in term order, (imagine the File-based dictionary case), so we 
can't 
  implement this today.

On 3.x i think its slightly less complicated as there is already a hack in the 
cache to 
prevent sequential termsenums from trashing it (e.g. foo*), and pretty much all 
the MTQs 
just enumerate sequentially anyway... (except NRQ which doesn't enum many terms 
anyway, likely not a problem).

But we would have to at least fix the spellcheck case there too I think.

Not saying I don't like your idea... just saying there's more work to do it.


> improve termquery "pk lookup" performance
> -----------------------------------------
>
>                 Key: LUCENE-2829
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2829
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Robert Muir
>         Attachments: LUCENE-2829.patch
>
>
> For things that are like primary keys and don't exist in some segments (worst 
> case is primary/unique key that only exists in 1)
> we do wasted seeks.
> While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
> concerned we could every backport that to 3.1 for example.
> This is a simpler solution here just to solve this one problem in 
> termquery... we could just revert it in trunk when we resolve LUCENE-2694,
> but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance

Reply via email to