Crazy. It would be helpful if you could provide a repro of this as a
small Java program one could run on a (small) NTFS partition (perhaps
beasting it over and over to simulate this effect?).

Dawid

On Tue, Nov 8, 2016 at 5:04 AM, Thomas Kappler
<[email protected]> wrote:
> Hi all,
>
>
>
> We’re occasionally observing corrupted indexes in production, on Windows
> Server. We tracked it down to the way NTFS behaves in case of partial
> writes.
>
>
>
> When the disk or the machine fail during a flush, it’s possible on NTFS that
> the file being written to has already been extended to the new length, but
> the content is not visible yet. For security reasons NTFS will return all 0s
> for content when reading past the last successfully written point after the
> system restarts.
>
>
>
> Lucene's commit code relies on committing an updated .gen file as the last
> step of index flush/update. In this case, the file is there, but contains
> 0s, making it unreadable for Lucene. Failures at this point leave the index
> in a state that's not readable.
>
>
>
> We think that the safest approach, which is robust to reordered writes, is
> to consider a gen file with all zeroes the same as a non-existing gen file.
> This assumes that by the time the gen file is fsync'ed all other files have
> been flushed to disk explicitly. If that's not the case, then there's still
> exposure to reordered writes.
>
>
>
> I don’t have a repro at this point. Before digging deeper into this I wanted
> to see what the Lucene devs think. Does the proposed fix make sense? Any
> ideas on how to set up a reproducible test for this issue?
>
>
>
> We verified this on Elasticsearch 1.7.1 which uses Lucene 4.10.4. Are there
> significant changes to this area in newer Lucene versions?
>
>
>
> // Thomas
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to