Hi Julian,How would that work for non-BTRFS systems, and if not, will that make Debian a BTRFS-only system?
I'm personally fine with "This works faster in BTRFS, because we implemented X", but not with "Debian only works on BTRFS".
Cheers, Hakan On 12/27/24 11:18 AM, Julian Andres Klode wrote:
On Tue, Dec 24, 2024 at 11:10:27PM +0900, Simon Richter wrote:Hi, On 12/24/24 18:54, Michael Tokarev wrote:The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs issues, where a power-cut in the middle of filesystem metadata operation (which dpkg does a lot) might result in in unconsistent filesystem state.The thing it protects against is a missing ordering of write() to the contents of an inode, and a rename() updating the name referring to it. These are unrelated operations even in other file systems, unless you use data journaling ("data=journaled") to force all operations to the journal, in order. Normally ("data=ordered") you only get the metadata update marking the data valid after the data has been written, but with no ordering relative to the file name change. The order of operation needs to be 1. create .dpkg-new file 2. write data to .dpkg-new file 3. link existing file to .dpkg-old 4. rename .dpkg-new file over final file name 5. clean up .dpkg-old file When we reach step 4, the data needs to be written to disk and the metadata in the inode referenced by the .dpkg-new file updated, otherwise we atomically replace the existing file with one that is not yet guaranteed to be written out. We get two assurances from the file system here: 1. the file will not contain garbage data -- the number of bytes marked valid will be less than or equal to the number of bytes actually written. The number of valid bytes will be zero initially, and only after data has been written out, the metadata update to change it to the final value is added to the journal. 2. creation of the inode itself will be written into the journal before the rename operation, so the file never vanishes. What this does not protect against is the file pointing to a zero-size inode. The only way to avoid that is either data journaling, which is horribly slow and creates extra writes, or fsync().Today, doing an fsync() really hurts, - with SSDs/flash it reduces the lifetime of the storage, for many modern filesystems it is a costly operation which bloats the metadata tree significantly, resulting in all further operations becomes inefficient.This should not make any difference in the number of write operations necessary, and only affect ordering. The data, metadata journal and metadata update still have to be written. The only way this could be improved is with a filesystem level transaction, where we can ask the file system to perform the entire update atomically -- then all the metadata updates can be queued in RAM, held back until the data has been synchronized by the kernel in the background, and then added to the journal in one go. I would expect that with such a file system, fsync() becomes cheap, because it would just be added to the transaction, and if the kernel gets around to writing the data before the entire transaction is synchronized at the end, it becomes a no-op. This assumes that maintainer scripts can be included in the transaction (otherwise we need to flush the transaction before invoking a maintainer script), and that no external process records the successful execution and expects it to be persisted (apt makes no such assumption, because it reads the dpkg status, so this is safe, but e.g. puppet might become confused if an operation it marked as successful is rolled back by a power loss). What could make sense is more aggressively promoting this option for containers and similar throwaway installations where there is a guarantee that a power loss will have the entire workspace thrown away, such as when working in a CI environment. However, even that is not guaranteed: if I create a Docker image for reuse, Docker will mark the image creation as successful when the command returns. Again, there is no ordering guarantee between the container contents and the database recording the success of the operation outside. So no, we cannot drop the fsync(). :\I do have a plan, namely merge the btrfs snapshot integration into apt, if we took a snapshot, we run dpkg with --force-unsafe-io. The cool solution would be to take the snapshot, run dpkg inside it, and then switch it but one step after the other, that's still very much WIP.
OpenPGP_signature.asc
Description: OpenPGP digital signature