[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes

Simon Willnauer (JIRA) Wed, 15 Sep 2010 06:02:09 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909717#action_12909717
 ]


Simon Willnauer commented on LUCENE-2588:
-----------------------------------------

bq. Should we really change StandardCodec to support this [non-binary order]?
I'm not sure if we should do, but we should at least document the limitation. 
People who work with that level do also read doc strings - if they don't let 
them be doomed but if you run into the bug we had in LUCENE-2622 you will have 
a super hard time to figure out what is going on without knowing lucene very 
very well. 


bq. Can't we just fix the test not to use StandardCodec? I mean we aren't 
taking any feature away here. 

+1 I think we should fix this test ASAP with either using byte sort order or 
add some MockCodec (what robert has suggested). 


> terms index should not store useless suffixes
> ---------------------------------------------
>
>                 Key: LUCENE-2588
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2588
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2588.patch, LUCENE-2588.patch
>
>
> This idea came up when discussing w/ Robert how to improve our terms index...
> The terms dict index today simply grabs whatever term was at a 0 mod 128 
> index (by default).
> But this is wasteful because you often don't need the suffix of the term at 
> that point.
> EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, 
> instead of storing that full term you only need to store ab.  The suffix is 
> useless, and uses up RAM since we load the terms index into RAM.
> The patch is very simple.  The optimization is particularly easy because 
> terms are now byte[] and we sort in binary order.
> I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index 
> (tii) file from 3.9 MB -> 3.3 MB = 16% smaller (using StandardAnalyzer, 
> indexing body field tokenized but title / date fields untokenized).  I expect 
> on noisier terms dicts, especially ones w/ bad terms accidentally indexed, 
> that the savings will be even more.
> In the future we could do crazier things.  EG there's no real reason why the 
> indexed terms must be regular (every N terms), so, we could instead pick 
> terms more carefully, say "approximately" every N, but favor terms that have 
> a smaller net prefix.  We can also index more sparsely in regions where the 
> net docFreq is lowish, since we can afford somewhat higher seek+scan time to 
> these terms since enuming their docs will be much faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes

Reply via email to