On Tue, Dec 24, 2024 at 11:10:27PM +0900, Simon Richter wrote:
> Hi,
> 
> On 12/24/24 18:54, Michael Tokarev wrote:
> 
> > The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
> > issues, where a power-cut in the middle of filesystem metadata
> > operation (which dpkg does a lot) might result in in unconsistent
> > filesystem state.
> 
> The thing it protects against is a missing ordering of write() to the
> contents of an inode, and a rename() updating the name referring to it.
> 
> These are unrelated operations even in other file systems, unless you use
> data journaling ("data=journaled") to force all operations to the journal,
> in order. Normally ("data=ordered") you only get the metadata update marking
> the data valid after the data has been written, but with no ordering
> relative to the file name change.
> 
> The order of operation needs to be
> 
> 1. create .dpkg-new file
> 2. write data to .dpkg-new file
> 3. link existing file to .dpkg-old
> 4. rename .dpkg-new file over final file name
> 5. clean up .dpkg-old file
> 
> When we reach step 4, the data needs to be written to disk and the metadata
> in the inode referenced by the .dpkg-new file updated, otherwise we
> atomically replace the existing file with one that is not yet guaranteed to
> be written out.
> 
> We get two assurances from the file system here:
> 
> 1. the file will not contain garbage data -- the number of bytes marked
> valid will be less than or equal to the number of bytes actually written.
> The number of valid bytes will be zero initially, and only after data has
> been written out, the metadata update to change it to the final value is
> added to the journal.
> 
> 2. creation of the inode itself will be written into the journal before the
> rename operation, so the file never vanishes.
> 
> What this does not protect against is the file pointing to a zero-size
> inode. The only way to avoid that is either data journaling, which is
> horribly slow and creates extra writes, or fsync().
> 
> > Today, doing an fsync() really hurts, - with SSDs/flash it reduces
> > the lifetime of the storage, for many modern filesystems it is a
> > costly operation which bloats the metadata tree significantly,
> > resulting in all further operations becomes inefficient.
> 
> This should not make any difference in the number of write operations
> necessary, and only affect ordering. The data, metadata journal and metadata
> update still have to be written.
> 
> The only way this could be improved is with a filesystem level transaction,
> where we can ask the file system to perform the entire update atomically --
> then all the metadata updates can be queued in RAM, held back until the data
> has been synchronized by the kernel in the background, and then added to the
> journal in one go. I would expect that with such a file system, fsync()
> becomes cheap, because it would just be added to the transaction, and if the
> kernel gets around to writing the data before the entire transaction is
> synchronized at the end, it becomes a no-op.
> 
> This assumes that maintainer scripts can be included in the transaction
> (otherwise we need to flush the transaction before invoking a maintainer
> script), and that no external process records the successful execution and
> expects it to be persisted (apt makes no such assumption, because it reads
> the dpkg status, so this is safe, but e.g. puppet might become confused if
> an operation it marked as successful is rolled back by a power loss).
> 
> What could make sense is more aggressively promoting this option for
> containers and similar throwaway installations where there is a guarantee
> that a power loss will have the entire workspace thrown away, such as when
> working in a CI environment.
> 
> However, even that is not guaranteed: if I create a Docker image for reuse,
> Docker will mark the image creation as successful when the command returns.
> Again, there is no ordering guarantee between the container contents and the
> database recording the success of the operation outside.
> 
> So no, we cannot drop the fsync(). :\

I do have a plan, namely merge the btrfs snapshot integration into apt,
if we took a snapshot, we run dpkg with --force-unsafe-io.

The cool solution would be to take the snapshot, run dpkg inside it,
and then switch it but one step after the other, that's still very
much WIP.

-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en

Reply via email to