Ian Jackson wrote: > Jamie Lokier writes ("Re: [Qemu-devel] [PATCH] ide.c make write cacheing > controllable by guest"): > > I'm imagining that fdatasync() will flush the necessary metadata, > > including file size, when a file is extended. As would O_DSYNC. > > Do you have a reference to support this supposition ?
Not a _standard_, of course, as you found with SuSv3. More a folk understanding, which admittedly might be lacking in some implementations (like Linux perhaps...). Take a look at your references. > HP-UX 11i's fdatasync manpage: > > fdatasync() causes all modified data and file attributes of fildes ^^^^^^^^^^^^^^^^^^^^^^^^^ > required to retrieve the data to be written to disk. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ That means size, bitmap updates, block pointers, extents etc. needed to retrieve the data. > The glibc info manual: > > Sometimes it is not even necessary to write all data associated with > a file descriptor. E.g., in database files which do not change in size > it is enough to write all the file content data to the device. A bit more from Glibc: Meta-information, like the modification time etc., are not that important and leaving such information uncommitted does not prevent a successful recovering of the file in case of a problem. When a call to the `fdatasync' function returns, it is ensured that all of the file data is written to the device. For all pending I/O operations, the parts guaranteeing data integrity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ finished. ^^^^^^^^ Draw your own conclusion. > The Solaris manpage says that fdatasync does the same as O_DSYNC, That's right, it's the common meaning of O_DSYNC. > and it calls the service "synchronized I/O data integrity > completion" which is defined by the `Programming Interfaces Guide' > to include this: > > * For writes, the operation has been completed, or diagnosed if > unsuccessful. The write operation succeeds when the data specified > in the write request is successfully transferred. Furthermore, all > file system information required to retrieve the data must be ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > successfully transferred. ^^^^^^^^^^^^^^^^^^^^^^^^ That's quite clear. > But then the next bullet point is this: > > * File attributes that are not necessary for data retrieval are not > transferred prior to returning to the calling process. > > which says `are not transferred' when it ought to say `are not > necessarily transferred' so it may be unwise to rely on the precise > wording. That's fine and consistent with the previous text. It means size increase, bitmaps, pointers, extents etc. are written (those are the attributes necessary for data retrieval). Attributes like modification time, access time, change time, permissions etc. are not (necessarily) transferred. You're right it should say "not necessarily", but that's implicit: they can be transferred at any time anyway, by normal background writeback. > I looked at various other manpages but they all say useless things > like `metadata such as modification time' which leaves open the > question of whether the file size is included. I agree it's a bit ambiguous. My understanding is that _increases_ in size are included, by convention as much as anything, since the larger size is necessary to retrieve the data later. This is supported by the fact that O_DSYNC has a tendancy to become very slow on some systems when extending a file, compared with writing in place. > If the size is supposed to be included then the OS is required to keep > a flag to say whether the file has been extended so that it knows that > the next fdatasync ought really to be an fsync and write the inode > too. (In a traditional filesystem structure.) That's right. > Or perhaps fsck needs > to extend the file as necessary to include the data blocks past the > nominal end of file. Well, in general, if your system is such that fsck following a crash is part of normal filesystem operations, then fsck could be allowed to do a lot more than extend the size attribute. That doesn't matter to the application, though. What matters is that it writes data (including extending the file), calls fdatasync() (or uses O_DSYNC), and when the fdatasync returns it knows after a crash and recovery that it will be able to retrieve that data with the appropriate confidence level. > This seems like rather a minefield. The implementation details seem like a minefield, but the intent and documentation and tradition of fdatasync() seems quite clear to me. However, I suppose you might want to be careful and check, when deploying your new database which depends on fdatasync(), if the target systems really do sync size changes :-) It's easy enough to check, as it greatly slows down extending writes. But I suppose, for an app writer, as you know it's going to involve a slower than normal write anyway, it's also easy enough to extend by a big chunk then call fsync() once, if you prefer to not have to trust fdatasync() on this. -- Jamie