On Tue, Dec 24, 2024 at 11:10:27PM +0900, Simon Richter wrote: > Hi, > > On 12/24/24 18:54, Michael Tokarev wrote: > > > The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs > > issues, where a power-cut in the middle of filesystem metadata > > operation (which dpkg does a lot) might result in in unconsistent > > filesystem state. > > The thing it protects against is a missing ordering of write() to the > contents of an inode, and a rename() updating the name referring to it. > > These are unrelated operations even in other file systems, unless you use > data journaling ("data=journaled") to force all operations to the journal, > in order. Normally ("data=ordered") you only get the metadata update marking > the data valid after the data has been written, but with no ordering > relative to the file name change. > > The order of operation needs to be > > 1. create .dpkg-new file > 2. write data to .dpkg-new file > 3. link existing file to .dpkg-old > 4. rename .dpkg-new file over final file name > 5. clean up .dpkg-old file > > When we reach step 4, the data needs to be written to disk and the metadata > in the inode referenced by the .dpkg-new file updated, otherwise we > atomically replace the existing file with one that is not yet guaranteed to > be written out. > > We get two assurances from the file system here: > > 1. the file will not contain garbage data -- the number of bytes marked > valid will be less than or equal to the number of bytes actually written. > The number of valid bytes will be zero initially, and only after data has > been written out, the metadata update to change it to the final value is > added to the journal. > > 2. creation of the inode itself will be written into the journal before the > rename operation, so the file never vanishes. > > What this does not protect against is the file pointing to a zero-size > inode. The only way to avoid that is either data journaling, which is > horribly slow and creates extra writes, or fsync(). > > > Today, doing an fsync() really hurts, - with SSDs/flash it reduces > > the lifetime of the storage, for many modern filesystems it is a > > costly operation which bloats the metadata tree significantly, > > resulting in all further operations becomes inefficient. > > This should not make any difference in the number of write operations > necessary, and only affect ordering. The data, metadata journal and metadata > update still have to be written. > > The only way this could be improved is with a filesystem level transaction, > where we can ask the file system to perform the entire update atomically -- > then all the metadata updates can be queued in RAM, held back until the data > has been synchronized by the kernel in the background, and then added to the > journal in one go. I would expect that with such a file system, fsync() > becomes cheap, because it would just be added to the transaction, and if the > kernel gets around to writing the data before the entire transaction is > synchronized at the end, it becomes a no-op. > > This assumes that maintainer scripts can be included in the transaction > (otherwise we need to flush the transaction before invoking a maintainer > script), and that no external process records the successful execution and > expects it to be persisted (apt makes no such assumption, because it reads > the dpkg status, so this is safe, but e.g. puppet might become confused if > an operation it marked as successful is rolled back by a power loss). > > What could make sense is more aggressively promoting this option for > containers and similar throwaway installations where there is a guarantee > that a power loss will have the entire workspace thrown away, such as when > working in a CI environment. > > However, even that is not guaranteed: if I create a Docker image for reuse, > Docker will mark the image creation as successful when the command returns. > Again, there is no ordering guarantee between the container contents and the > database recording the success of the operation outside. > > So no, we cannot drop the fsync(). :\
I do have a plan, namely merge the btrfs snapshot integration into apt, if we took a snapshot, we run dpkg with --force-unsafe-io. The cool solution would be to take the snapshot, run dpkg inside it, and then switch it but one step after the other, that's still very much WIP. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en