[ 
https://issues.apache.org/jira/browse/KAFKA-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Armando García Sancio resolved KAFKA-15312.
------------------------------------------------
    Resolution: Fixed

> FileRawSnapshotWriter must flush before atomic move
> ---------------------------------------------------
>
>                 Key: KAFKA-15312
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15312
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>            Reporter: José Armando García Sancio
>            Assignee: José Armando García Sancio
>            Priority: Major
>             Fix For: 3.3.3, 3.6.0, 3.4.2, 3.5.2
>
>
> On ext4 file systems it is possible for KRaft to create zero-length snapshot 
> files. Not all file system fsync to disk on close. For KRaft to guarantee 
> that the data has made it to disk before calling rename, it needs to make 
> sure that the file has been fsync.
> We have seen cases were the snapshot file has zero-length data on ext4 file 
> system.
> {quote} "Delayed allocation" means that the filesystem tries to delay the 
> allocation of physical disk blocks for written data for as long as possible. 
> This policy brings some important performance benefits. Many files are 
> short-lived; delayed allocation can keep the system from writing fleeting 
> temporary files to disk at all. And, for longer-lived files, delayed 
> allocation allows the kernel to accumulate more data and to allocate the 
> blocks for data contiguously, speeding up both the write and any subsequent 
> reads of that data. It's an important optimization which is found in most 
> contemporary filesystems.
> But, if blocks have not been allocated for a file, there is no need to write 
> them quickly as a security measure. Since the blocks do not yet exist, it is 
> not possible to read somebody else's data from them. So ext4 will not 
> (cannot) write out unallocated blocks as part of the next journal commit 
> cycle. Those blocks will, instead, wait until the kernel decides to flush 
> them out; at that point, physical blocks will be allocated on disk and the 
> data will be made persistent. The kernel doesn't like to let file data sit 
> unwritten for too long, but it can still take a minute or so (with the 
> default settings) for that data to be flushed - far longer than the five 
> seconds normally seen with ext3. And that is why a crash can cause the loss 
> of quite a bit more data when ext4 is being used. 
> {quote}
> from: [https://lwn.net/Articles/322823/]
> {quote}auto_da_alloc ( * ), noauto_da_alloc
> Many broken applications don't use fsync() when replacing existing files via 
> patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ 
> rename("foo.new", "foo"), or worse yet, fd = open("foo", 
> O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will 
> detect the replace-via-rename and replace-via-truncate patterns and force 
> that any delayed allocation blocks are allocated such that at the next 
> journal commit, in the default data=ordered mode, the data blocks of the new 
> file are forced to disk before the rename() operation is committed. This 
> provides roughly the same level of guarantees as ext3, and avoids the 
> "zero-length" problem that can happen when a system crashes before the 
> delayed allocation blocks are forced to disk.
> {quote}
> from: [https://www.kernel.org/doc/html/latest/admin-guide/ext4.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to