[ 
https://issues.apache.org/jira/browse/LUCENE-5602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5602:
---------------------------------

    Attachment: LUCENE-5602.patch

Thanks Robert for adding the version check. I think we should indeed never copy 
raw bytes across versions.

The clone is necessary because otherwise, the naive merge routine that is used 
eg. when chunks contain deleted documents is going to seek on the underlying 
IndexInput without notifying the checksuming wrapper about it.

I tried to fix it by using a checksuming index input for this naive merge 
routine as well, but it doesn't work since for every document it needs to seek 
to the start offset of the block, which implies seeking backward if you need to 
merge two term vectors that are in the same block.

I added a minor modification to Robert's patch in order to make sure that the 
index input of the vectors reader and the clone that is used for checksuming 
are moved in parallel. Otherwise if you stop using bulk merging after a few 
documents (eg. if deletions shift chunks in such a way that all of them need to 
be rebuilt), the checksum input might need to read the whole file when checking 
integrity.

> always verify term vectors on bulk merge
> ----------------------------------------
>
>                 Key: LUCENE-5602
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5602
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-5602.patch, LUCENE-5602.patch
>
>
> Similar to LUCENE-5580: its scary we just copy bytes and dont verify 
> anything. we should at least verify in the bulk merge case.
> This is just a quick hack: it would be nice, though scary to remove the 
> clone() of the index input... 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to