Hi,
On 12/24/24 18:54, Michael Tokarev wrote:
The no-unsafe-io workaround in dpkg was needed for 2005-era ext2fs
issues, where a power-cut in the middle of filesystem metadata
operation (which dpkg does a lot) might result in in unconsistent
filesystem state.
The thing it protects against is a missing ordering of write() to the
contents of an inode, and a rename() updating the name referring to it.
These are unrelated operations even in other file systems, unless you
use data journaling ("data=journaled") to force all operations to the
journal, in order. Normally ("data=ordered") you only get the metadata
update marking the data valid after the data has been written, but with
no ordering relative to the file name change.
The order of operation needs to be
1. create .dpkg-new file
2. write data to .dpkg-new file
3. link existing file to .dpkg-old
4. rename .dpkg-new file over final file name
5. clean up .dpkg-old file
When we reach step 4, the data needs to be written to disk and the
metadata in the inode referenced by the .dpkg-new file updated,
otherwise we atomically replace the existing file with one that is not
yet guaranteed to be written out.
We get two assurances from the file system here:
1. the file will not contain garbage data -- the number of bytes marked
valid will be less than or equal to the number of bytes actually
written. The number of valid bytes will be zero initially, and only
after data has been written out, the metadata update to change it to the
final value is added to the journal.
2. creation of the inode itself will be written into the journal before
the rename operation, so the file never vanishes.
What this does not protect against is the file pointing to a zero-size
inode. The only way to avoid that is either data journaling, which is
horribly slow and creates extra writes, or fsync().
Today, doing an fsync() really hurts, - with SSDs/flash it reduces
the lifetime of the storage, for many modern filesystems it is a
costly operation which bloats the metadata tree significantly,
resulting in all further operations becomes inefficient.
This should not make any difference in the number of write operations
necessary, and only affect ordering. The data, metadata journal and
metadata update still have to be written.
The only way this could be improved is with a filesystem level
transaction, where we can ask the file system to perform the entire
update atomically -- then all the metadata updates can be queued in RAM,
held back until the data has been synchronized by the kernel in the
background, and then added to the journal in one go. I would expect that
with such a file system, fsync() becomes cheap, because it would just be
added to the transaction, and if the kernel gets around to writing the
data before the entire transaction is synchronized at the end, it
becomes a no-op.
This assumes that maintainer scripts can be included in the transaction
(otherwise we need to flush the transaction before invoking a maintainer
script), and that no external process records the successful execution
and expects it to be persisted (apt makes no such assumption, because it
reads the dpkg status, so this is safe, but e.g. puppet might become
confused if an operation it marked as successful is rolled back by a
power loss).
What could make sense is more aggressively promoting this option for
containers and similar throwaway installations where there is a
guarantee that a power loss will have the entire workspace thrown away,
such as when working in a CI environment.
However, even that is not guaranteed: if I create a Docker image for
reuse, Docker will mark the image creation as successful when the
command returns. Again, there is no ordering guarantee between the
container contents and the database recording the success of the
operation outside.
So no, we cannot drop the fsync(). :\
Simon