Hi,

On 12/24/24 18:54, Michael Tokarev wrote:

The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state.

The thing it protects against is a missing ordering of write() to the contents of an inode, and a rename() updating the name referring to it.

These are unrelated operations even in other file systems, unless you use data journaling ("data=journaled") to force all operations to the journal, in order. Normally ("data=ordered") you only get the metadata update marking the data valid after the data has been written, but with no ordering relative to the file name change.

The order of operation needs to be

1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file

When we reach step 4, the data needs to be written to disk and the metadata in the inode referenced by the .dpkg-new file updated, otherwise we atomically replace the existing file with one that is not yet guaranteed to be written out.

We get two assurances from the file system here:

1. the file will not contain garbage data -- the number of bytes marked valid will be less than or equal to the number of bytes actually written. The number of valid bytes will be zero initially, and only after data has been written out, the metadata update to change it to the final value is added to the journal.

2. creation of the inode itself will be written into the journal before the rename operation, so the file never vanishes.

What this does not protect against is the file pointing to a zero-size inode. The only way to avoid that is either data journaling, which is horribly slow and creates extra writes, or fsync().

Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.

This should not make any difference in the number of write operations necessary, and only affect ordering. The data, metadata journal and metadata update still have to be written.

The only way this could be improved is with a filesystem level transaction, where we can ask the file system to perform the entire update atomically -- then all the metadata updates can be queued in RAM, held back until the data has been synchronized by the kernel in the background, and then added to the journal in one go. I would expect that with such a file system, fsync() becomes cheap, because it would just be added to the transaction, and if the kernel gets around to writing the data before the entire transaction is synchronized at the end, it becomes a no-op.

This assumes that maintainer scripts can be included in the transaction (otherwise we need to flush the transaction before invoking a maintainer script), and that no external process records the successful execution and expects it to be persisted (apt makes no such assumption, because it reads the dpkg status, so this is safe, but e.g. puppet might become confused if an operation it marked as successful is rolled back by a power loss).

What could make sense is more aggressively promoting this option for containers and similar throwaway installations where there is a guarantee that a power loss will have the entire workspace thrown away, such as when working in a CI environment.

However, even that is not guaranteed: if I create a Docker image for reuse, Docker will mark the image creation as successful when the command returns. Again, there is no ordering guarantee between the container contents and the database recording the success of the operation outside.

So no, we cannot drop the fsync(). :\

   Simon

Reply via email to