[
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707709#comment-13707709
]
Han Jiang commented on LUCENE-3069:
-----------------------------------
I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M:
Here is the bit width summary for "body" field:
||bit||#(df==ttf)||#df||#ttf||
| 1| 43532656 | 48860170| 43532656|
| 2| 10328824 | 13979539| 16200377|
| 3| 2682453 | 5032450| 6532755|
| 4| 836109 | 2471794| 3134437|
| 5| 262696 | 1324704| 1718862|
| 6| 86487 | 755797| 990563|
| 7| 29276 | 442974| 571996|
| 8| 11257 | 263874| 339382|
| 9| 4627 | 161402| 205662|
|10| 2060 | 102198| 128034|
|11| 979 | 63955| 79531|
|12| 386 | 39377| 48805|
|13| 170 | 24321| 30113|
|14| 65 | 14686| 18437|
|15| 10 | 9055| 10918|
|16| 2 | 5229| 6821|
|17| 0 | 2669| 3595|
|18| 0 | 1312| 1897|
|19| 0 | 696| 914|
|20| 0 | 209| 509|
|21| 0 | 44| 148|
|22| 0 | 4| 38|
|23| 0 | 0| 8|
|24| 0 | 0| 1|
|25| 0 | 0| 0|
|26| 0 | 0| 0|
|27| 0 | 0| 0|
|28| 0 | 0| 0|
|29| 0 | 0| 0|
|30| 0 | 0| 0|
|31| 0 | 0| 0|
|32| 0 | 0| 0|
So we have 66.4% docFreq with df==1, and 78.5% with df==ttf.
Considering different bit size, for df+ttf encoding,
totally it saves 57.3MB from 148.7MB, using following estimation:
{noformat}
old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber)
new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) *
vIntByteSize(rownumber)
{noformat}
By the way, I am quite lured to omit frq blocks in Luene41PostingsReader.
When we know that df==ttf, we can always make sure the in-doc frq==1. So for
example,
when bit width ranges from 2 to 8(inclusive), since df is not large enough to
create ForBlocks,
we have to VInt encode each in-doc freq. For this 'body' field, I think the
index size we can reduce
is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is
usually small).
For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for
df+ttf only).
While the vInt frq block we can omit from PBF is about 95.8MB, I suppose.
I'll test this later.
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Simon Willnauer
> Assignee: Han Jiang
> Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a
> delta codec file for scanning to terms. Some environments have enough memory
> available to keep the entire FST based term dict in memory. We should add a
> TermDictionary implementation that encodes all needed information for each
> term into the FST (custom fst.Output) and builds a FST from the entire term
> not just the delta.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]