[jira] [Commented] (LUCENE-4509) Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl

Robert Muir (JIRA) Mon, 12 Nov 2012 19:03:16 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495889#comment-13495889
 ]


Robert Muir commented on LUCENE-4509:
-------------------------------------

Docs look good, +1 to commit.

A few suggestions:
* under known limitations maybe replace documents with "individual documents" 
to make it clear you are talking about 2 gigabyte documents and not files? I 
think someone was confused on that already a little bit.
* rather than repeating the formulas for signed vlong (zigzag), we could link 
to it? https://developers.google.com/protocol-buffers/docs/encoding#types
* separately if we find ourselves using this more often, maybe we should just 
add it to DataOutput/Input (the vlong version would be enough). We
  already use this in kuromoji's ConnectionCosts.java too...

                
> Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-4509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4509
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4509.patch, LUCENE-4509.patch
>
>
> What would you think of making CompressingStoredFieldsFormat the new default 
> StoredFieldsFormat?
> Stored fields compression has many benefits :
>  - it makes the I/O cache work for us,
>  - file-based index replication/backup becomes cheaper.
> Things to know:
>  - even with incompressible data, there is less than 0.5% overhead with LZ4,
>  - LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires 
> ~ 256kB,
>  - LZ4 uncompression has almost no memory overhead,
>  - on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.
> I think we could use the same default parameters as in CompressingCodec :
>  - LZ4 compression,
>  - in-memory stored fields index that is very memory-efficient (less than 12 
> bytes per block of compressed docs) and uses binary search to locate 
> documents in the fields data file,
>  - 16 kB blocks (small enough so that there is no major slow down when the 
> whole index would fit into the I/O cache anyway, and large enough to provide 
> interesting compression ratios ; for example Robert got a 0.35 compression 
> ratio with the geonames.org database).
> Any concerns?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4509) Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl

Reply via email to