[
https://issues.apache.org/jira/browse/LUCENE-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485187#comment-13485187
]
Adrien Grand commented on LUCENE-4509:
--------------------------------------
bq. How would this work with laaaarge documents that might be > 16KB in size?
Actually 16kB is the minimum size of an uncompressed chunk of documents.
CompressingStoredFieldsWriter fills a buffer with documents until its size is
>= 16kb, compresses it and then flushes to disk. If all documents are greater
than 16kB then all chunks will contain exactly one document.
It also means you could end up having a chunk that is made of 15 documents of
1kb and 1 document of 256kb. (And in this case there is no performance problem
for the 15 first documents given that uncompression stops as soon as enough
data has been uncompressed.)
bq. Does this mean with the default CompressingStoredFieldsIndex setting that
now he pays 12-bytes/doc in RAM (because docsize > blocksize)? If so, lets
think of ways to optimize that case.
Probably less than 12. The default CompressingStoredFieldsIndex impl uses two
packed ints arrays of size numChunks (the number of chunks, <= numDocs). The
first array stores the doc ID of the first document of the chunk while the
second array stores the start offset of the chunk of documents in the fields
data file.
So if your fields data file is fdtBytes bytes, the actual memory usage is ~
{{numChunks * (ceil(log2(numDocs)) + ceil(log2(fdtBytes))) / 8}}.
For example, if there are 10M documents of 16kB (fdtBytes ~= 160GB), we'll have
numChunks == numDocs and a memory usage per document of (24 + 38) / 8 = 7.75 =>
~ 77.5 MB of memory overall.
bq. 100GB of compressed stored fields == 6.25M index entries == 75MB RAM
Thanks for the figures, Yonik! Did you use RamUsageEstimator to compute the
amount of used memory?
> Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl
> --------------------------------------------------------------------------
>
> Key: LUCENE-4509
> URL: https://issues.apache.org/jira/browse/LUCENE-4509
> Project: Lucene - Core
> Issue Type: Wish
> Components: core/store
> Reporter: Adrien Grand
> Priority: Minor
>
> What would you think of making CompressingStoredFieldsFormat the new default
> StoredFieldsFormat?
> Stored fields compression has many benefitsĀ :
> - it makes the I/O cache work for us,
> - file-based index replication/backup becomes cheaper.
> Things to know:
> - even with incompressible data, there is less than 0.5% overhead with LZ4,
> - LZ4 compression requires ~ 16kB of memory and LZ4 HC compression requires
> ~ 256kB,
> - LZ4 uncompression has almost no memory overhead,
> - on my low-end laptop, the LZ4 impl in Lucene uncompresses at ~ 300mB/s.
> I think we could use the same default parameters as in CompressingCodec :
> - LZ4 compression,
> - in-memory stored fields index that is very memory-efficient (less than 12
> bytes per block of compressed docs) and uses binary search to locate
> documents in the fields data file,
> - 16 kB blocks (small enough so that there is no major slow down when the
> whole index would fit into the I/O cache anyway, and large enough to provide
> interesting compression ratiosĀ ; for example Robert got a 0.35 compression
> ratio with the geonames.org database).
> Any concerns?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]