Hi Julian,

How would that work for non-BTRFS systems, and if not, will that make Debian a BTRFS-only system?

I'm personally fine with "This works faster in BTRFS, because we implemented X", but not with "Debian only works on BTRFS".

Cheers,

Hakan

On 12/27/24 11:18 AM, Julian Andres Klode wrote:
On Tue, Dec 24, 2024 at 11:10:27PM +0900, Simon Richter wrote:
Hi,

On 12/24/24 18:54, Michael Tokarev wrote:

The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state.

The thing it protects against is a missing ordering of write() to the
contents of an inode, and a rename() updating the name referring to it.

These are unrelated operations even in other file systems, unless you use
data journaling ("data=journaled") to force all operations to the journal,
in order. Normally ("data=ordered") you only get the metadata update marking
the data valid after the data has been written, but with no ordering
relative to the file name change.

The order of operation needs to be

1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file

When we reach step 4, the data needs to be written to disk and the metadata
in the inode referenced by the .dpkg-new file updated, otherwise we
atomically replace the existing file with one that is not yet guaranteed to
be written out.

We get two assurances from the file system here:

1. the file will not contain garbage data -- the number of bytes marked
valid will be less than or equal to the number of bytes actually written.
The number of valid bytes will be zero initially, and only after data has
been written out, the metadata update to change it to the final value is
added to the journal.

2. creation of the inode itself will be written into the journal before the
rename operation, so the file never vanishes.

What this does not protect against is the file pointing to a zero-size
inode. The only way to avoid that is either data journaling, which is
horribly slow and creates extra writes, or fsync().

Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.

This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and metadata
update still have to be written.

The only way this could be improved is with a filesystem level transaction,
where we can ask the file system to perform the entire update atomically --
then all the metadata updates can be queued in RAM, held back until the data
has been synchronized by the kernel in the background, and then added to the
journal in one go. I would expect that with such a file system, fsync()
becomes cheap, because it would just be added to the transaction, and if the
kernel gets around to writing the data before the entire transaction is
synchronized at the end, it becomes a no-op.

This assumes that maintainer scripts can be included in the transaction
(otherwise we need to flush the transaction before invoking a maintainer
script), and that no external process records the successful execution and
expects it to be persisted (apt makes no such assumption, because it reads
the dpkg status, so this is safe, but e.g. puppet might become confused if
an operation it marked as successful is rolled back by a power loss).

What could make sense is more aggressively promoting this option for
containers and similar throwaway installations where there is a guarantee
that a power loss will have the entire workspace thrown away, such as when
working in a CI environment.

However, even that is not guaranteed: if I create a Docker image for reuse,
Docker will mark the image creation as successful when the command returns.
Again, there is no ordering guarantee between the container contents and the
database recording the success of the operation outside.

So no, we cannot drop the fsync(). :\

I do have a plan, namely merge the btrfs snapshot integration into apt,
if we took a snapshot, we run dpkg with --force-unsafe-io.

The cool solution would be to take the snapshot, run dpkg inside it,
and then switch it but one step after the other, that's still very
much WIP.


Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to