[
https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910072#action_12910072
]
Michael McCandless commented on LUCENE-2588:
--------------------------------------------
After I commit the simple renaming of standard codec's terms dicts
(LUCENE-2647), I plan to make this suffix-stripping opto private to
StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method
that can alter the indexed term before it's written).
Since StandardCodec hardwires the term sort to unicode order, the opto is safe
there.
In general, if a codec uses a different term sort (such as this test's codec)
it's conceivable a different opto could apply. EG I think this test could
prune suffix based on the term after the index term. But, it makes no sense to
spend time exploring this until a "real" use case arrives... this is just a
simple test to assert that a codec is in fact free to customize the sort order.
Also, there are other fun optos we could explore w/ terms index. EG we could
"wiggle" the index term selection a bit, so it wouldn't be fixed to every N, to
try to find terms that are small after removing the useless suffix.
Separately, we could choose index terms according to docFreq -- eg one simple
policy would be to plant an index term on term X if either 1) term X's docFreq
is over a threshold, or, 2) it's been > N terms since the last indexed terms.
This could be a powerful way to even further reduce RAM usage of the terms
index, because it'd ensure that high cost terms (ie, many docs/freqs/positions
to visit) are in fact fast to lookup. The low freq terms can afford a higher
seek time since it'll be so fast to enum the docs.
> terms index should not store useless suffixes
> ---------------------------------------------
>
> Key: LUCENE-2588
> URL: https://issues.apache.org/jira/browse/LUCENE-2588
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2588.patch, LUCENE-2588.patch
>
>
> This idea came up when discussing w/ Robert how to improve our terms index...
> The terms dict index today simply grabs whatever term was at a 0 mod 128
> index (by default).
> But this is wasteful because you often don't need the suffix of the term at
> that point.
> EG if the 127th term is aa and the 128th (indexed) term is abcd123456789,
> instead of storing that full term you only need to store ab. The suffix is
> useless, and uses up RAM since we load the terms index into RAM.
> The patch is very simple. The optimization is particularly easy because
> terms are now byte[] and we sort in binary order.
> I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index
> (tii) file from 3.9 MB -> 3.3 MB = 16% smaller (using StandardAnalyzer,
> indexing body field tokenized but title / date fields untokenized). I expect
> on noisier terms dicts, especially ones w/ bad terms accidentally indexed,
> that the savings will be even more.
> In the future we could do crazier things. EG there's no real reason why the
> indexed terms must be regular (every N terms), so, we could instead pick
> terms more carefully, say "approximately" every N, but favor terms that have
> a smaller net prefix. We can also index more sparsely in regions where the
> net docFreq is lowish, since we can afford somewhat higher seek+scan time to
> these terms since enuming their docs will be much faster.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]