[jira] [Commented] (LUCENE-4509) Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl

Adrien Grand (JIRA) Fri, 26 Oct 2012 14:29:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485211#comment-13485211
 ]


Adrien Grand commented on LUCENE-4509:
--------------------------------------

bq. Well you say you use a separate packed ints structure for the offsets 
right? so these would all be zero?

These are absolute offsets in the fields data file. For example, when looking 
up a document, it first performs a binary search in the first array (the one 
that contains the first document IDs of every chunk). The resulting index is 
used to find the start offset of the chunk of compressed documents thanks to 
the second array. When you read data starting at this offset in the fields data 
file, there is first a packed ints array that stores the uncompressed length of 
every document in the chunk, and then the compressed data. I'll add file 
formats docs soon...
                
> Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-4509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4509
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: core/store
>            Reporter: Adrien Grand
>            Priority: Minor
>
> What would you think of making CompressingStoredFieldsFormat the new default 
> StoredFieldsFormat?
> Stored fields compression has many benefits :
>  - it makes the I/O cache work for us,
>  - file-based index replication/backup becomes cheaper.
> Things to know:
>  - even with incompressible data, there is less than 0.5% overhead with LZ4,
>  - LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires 
> ~ 256kB,
>  - LZ4 uncompression has almost no memory overhead,
>  - on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.
> I think we could use the same default parameters as in CompressingCodec :
>  - LZ4 compression,
>  - in-memory stored fields index that is very memory-efficient (less than 12 
> bytes per block of compressed docs) and uses binary search to locate 
> documents in the fields data file,
>  - 16 kB blocks (small enough so that there is no major slow down when the 
> whole index would fit into the I/O cache anyway, and large enough to provide 
> interesting compression ratios ; for example Robert got a 0.35 compression 
> ratio with the geonames.org database).
> Any concerns?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4509) Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl

Reply via email to