[
https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910140#action_12910140
]
Robert Muir commented on LUCENE-2588:
-------------------------------------
{quote}
Also, there are other fun optos we could explore w/ terms index. EG we could
"wiggle" the index term selection a bit, so it wouldn't be fixed to every N, to
try to find terms that are small after removing the useless suffix. Separately,
we could choose index terms according to docFreq - eg one simple policy would
be to plant an index term on term X if either 1) term X's docFreq is over a
threshold, or, 2) it's been > N terms since the last indexed terms. This could
be a powerful way to even further reduce RAM usage of the terms index, because
it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are
in fact fast to lookup. The low freq terms can afford a higher seek time since
it'll be so fast to enum the docs.
{quote}
it would be great to come up with a heuristic that balances all 3 of these:
Because selecting % 32 is silly if it would give you "abracadabra" when the
previous term is "a" and a fudge would give you a smaller index term (of course
it depends too, on what the next index term would be, and the docfreq
optimization too).
It sounds tricky, but right now we are just selecting index terms with no basis
at all (essentially random). then we are trying to deal with bad selections by
trimming wasted suffixes, etc.
> terms index should not store useless suffixes
> ---------------------------------------------
>
> Key: LUCENE-2588
> URL: https://issues.apache.org/jira/browse/LUCENE-2588
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2588.patch, LUCENE-2588.patch
>
>
> This idea came up when discussing w/ Robert how to improve our terms index...
> The terms dict index today simply grabs whatever term was at a 0 mod 128
> index (by default).
> But this is wasteful because you often don't need the suffix of the term at
> that point.
> EG if the 127th term is aa and the 128th (indexed) term is abcd123456789,
> instead of storing that full term you only need to store ab. The suffix is
> useless, and uses up RAM since we load the terms index into RAM.
> The patch is very simple. The optimization is particularly easy because
> terms are now byte[] and we sort in binary order.
> I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index
> (tii) file from 3.9 MB -> 3.3 MB = 16% smaller (using StandardAnalyzer,
> indexing body field tokenized but title / date fields untokenized). I expect
> on noisier terms dicts, especially ones w/ bad terms accidentally indexed,
> that the savings will be even more.
> In the future we could do crazier things. EG there's no real reason why the
> indexed terms must be regular (every N terms), so, we could instead pick
> terms more carefully, say "approximately" every N, but favor terms that have
> a smaller net prefix. We can also index more sparsely in regions where the
> net docFreq is lowish, since we can afford somewhat higher seek+scan time to
> these terms since enuming their docs will be much faster.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]