[
https://issues.apache.org/jira/browse/LUCENE-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6115:
---------------------------------
Attachment: LUCENE-6115.patch
Here is a patch. Here are the differences with today:
- when doing random access, the header is always completely decoded (while we
previously sometimes stopped eagerly)
- when merging, we make sure to only decompress a given block of documents a
single time
- checksums are verified up-front instead of on-the-fly (like most other
formats actually)
- I made the logic for large blocks more robust. Up to now, if you were
storing a document that was so large that it would grow larger than 2x the
chunk size, then it would be splitted into 16KB (exact this time) slices in
order to make decompressing a bit more memory-efficient. The reader used to
duplicate the writer logic (if block_len >= 2 * chunk_size) but this info is
now encoded in the stream.
I did some benchmarking and there were no significant differences in terms of
indexing speed or read speed, so having to read the whole header every time
does not seem to hurt (since the bottleneck is likely decompressing documents).
I tried to remove the specialized merging to see if it was still needed and
unfortunately it seems so, I got merging times that were about 20% slower
without specialized merging. (In that case specialized merging still
decompresses and recompresses all the time, it only saves some decoding and
reuses directly the serialized bytes of each document.)
> Add getMergeInstance to CompressingStoredFieldsReader
> -----------------------------------------------------
>
> Key: LUCENE-6115
> URL: https://issues.apache.org/jira/browse/LUCENE-6115
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6115.patch
>
>
> CompressingStoredFieldsReader is currently terrible at merging with different
> codecs or wrapped readers since it does not keep state. So if you want to get
> 5 documents that come from the same block, it means that you will have to
> decode the block header and decompress 5 times. It has some optimizations so
> that if you want to get the 2nd doc of the block then it will stop
> decompressing soon after the 2nd document, but it doesn't help much with
> merging since we want all documents.
> We should implement getMergeInstance and have a different behaviour when
> merging by decompressing everything up-front and then reusing for all
> documents of the block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]