[ https://issues.apache.org/jira/browse/KAFKA-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
José Armando García Sancio resolved KAFKA-15312. ------------------------------------------------ Resolution: Fixed > FileRawSnapshotWriter must flush before atomic move > --------------------------------------------------- > > Key: KAFKA-15312 > URL: https://issues.apache.org/jira/browse/KAFKA-15312 > Project: Kafka > Issue Type: Bug > Components: kraft > Reporter: José Armando García Sancio > Assignee: José Armando García Sancio > Priority: Major > Fix For: 3.3.3, 3.6.0, 3.4.2, 3.5.2 > > > On ext4 file systems it is possible for KRaft to create zero-length snapshot > files. Not all file system fsync to disk on close. For KRaft to guarantee > that the data has made it to disk before calling rename, it needs to make > sure that the file has been fsync. > We have seen cases were the snapshot file has zero-length data on ext4 file > system. > {quote} "Delayed allocation" means that the filesystem tries to delay the > allocation of physical disk blocks for written data for as long as possible. > This policy brings some important performance benefits. Many files are > short-lived; delayed allocation can keep the system from writing fleeting > temporary files to disk at all. And, for longer-lived files, delayed > allocation allows the kernel to accumulate more data and to allocate the > blocks for data contiguously, speeding up both the write and any subsequent > reads of that data. It's an important optimization which is found in most > contemporary filesystems. > But, if blocks have not been allocated for a file, there is no need to write > them quickly as a security measure. Since the blocks do not yet exist, it is > not possible to read somebody else's data from them. So ext4 will not > (cannot) write out unallocated blocks as part of the next journal commit > cycle. Those blocks will, instead, wait until the kernel decides to flush > them out; at that point, physical blocks will be allocated on disk and the > data will be made persistent. The kernel doesn't like to let file data sit > unwritten for too long, but it can still take a minute or so (with the > default settings) for that data to be flushed - far longer than the five > seconds normally seen with ext3. And that is why a crash can cause the loss > of quite a bit more data when ext4 is being used. > {quote} > from: [https://lwn.net/Articles/322823/] > {quote}auto_da_alloc ( * ), noauto_da_alloc > Many broken applications don't use fsync() when replacing existing files via > patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ > rename("foo.new", "foo"), or worse yet, fd = open("foo", > O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will > detect the replace-via-rename and replace-via-truncate patterns and force > that any delayed allocation blocks are allocated such that at the next > journal commit, in the default data=ordered mode, the data blocks of the new > file are forced to disk before the rename() operation is committed. This > provides roughly the same level of guarantees as ext3, and avoids the > "zero-length" problem that can happen when a system crashes before the > delayed allocation blocks are forced to disk. > {quote} > from: [https://www.kernel.org/doc/html/latest/admin-guide/ext4.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010)