[
https://issues.apache.org/jira/browse/LUCENE-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-5602:
---------------------------------
Attachment: LUCENE-5602.patch
Thanks Robert for adding the version check. I think we should indeed never copy
raw bytes across versions.
The clone is necessary because otherwise, the naive merge routine that is used
eg. when chunks contain deleted documents is going to seek on the underlying
IndexInput without notifying the checksuming wrapper about it.
I tried to fix it by using a checksuming index input for this naive merge
routine as well, but it doesn't work since for every document it needs to seek
to the start offset of the block, which implies seeking backward if you need to
merge two term vectors that are in the same block.
I added a minor modification to Robert's patch in order to make sure that the
index input of the vectors reader and the clone that is used for checksuming
are moved in parallel. Otherwise if you stop using bulk merging after a few
documents (eg. if deletions shift chunks in such a way that all of them need to
be rebuilt), the checksum input might need to read the whole file when checking
integrity.
> always verify term vectors on bulk merge
> ----------------------------------------
>
> Key: LUCENE-5602
> URL: https://issues.apache.org/jira/browse/LUCENE-5602
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-5602.patch, LUCENE-5602.patch
>
>
> Similar to LUCENE-5580: its scary we just copy bytes and dont verify
> anything. we should at least verify in the bulk merge case.
> This is just a quick hack: it would be nice, though scary to remove the
> clone() of the index input...
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]