On Thu, Jun 18, 2020 at 05:28:11PM +0300, Reco wrote:
Hi.
On Thu, Jun 18, 2020 at 08:57:48AM -0400, Michael Stone wrote:
On Thu, Jun 18, 2020 at 08:50:49AM +0300, Reco wrote:
> On Wed, Jun 17, 2020 at 05:54:51PM -0400, Michael Stone wrote:
> > On Wed, Jun 17, 2020 at 11:45:53PM +0300, Reco wrote:
> > > Long story short, if you need a primitive I/O benchmark, you're better
> > > with both dsync and nocache.
> >
> > Not unless that's your actual workload, IMO. Almost nothing does sync i/o;
>
> Almost everything does (see my previous e-mails). No everything does it
> with O_DSYNC, that's true.
You're not using the words like most people use them, which does certainly
confuse the conversation.
Earlier this thread someone posted a link to Wikipedia article on the
matter. Whatever terminology I'm using is consistent with it.
Qualifies for "common terminology" IMO.
It would really be better to just drop any kind of metaphysical argument
about what to call things and just focus on command lines and other
concrete examples. Again, you seem fixated on certain APIs and then
making leaps in other contexts where the distinctions you're trying to
make don't apply.
writing one block at a time is *really* *really* bad for performance.
True. But also it's good for the integrity of written data, which is why
(presumably) sqlite upstream did it.
strace -e open,openat,fsync,fdatasync sqlite3 test.sqlite3
[snip]
SQLite version 3.32.2 2020-06-04 12:58:43
Enter ".help" for usage hints.
[snip]
sqlite> create table test (test varchar);
openat(AT_FDCWD, "test.sqlite3", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/test.sqlite3", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC,
0644) = 5
openat(AT_FDCWD, "/tmp/test.sqlite3-journal",
O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
openat(AT_FDCWD, "/dev/urandom", O_RDONLY|O_CLOEXEC) = 7
fdatasync(6) = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7) = 0
fdatasync(6) = 0
fdatasync(5) = 0
sqlite> insert into test VALUES ('foo');
openat(AT_FDCWD, "/tmp/test.sqlite3-journal",
O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
fdatasync(6) = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7) = 0
fdatasync(6) = 0
fdatasync(5) = 0
sqlite> update test set test = 'bar';
openat(AT_FDCWD, "/tmp/test.sqlite3-journal",
O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
fdatasync(6) = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7) = 0
fdatasync(6) = 0
fdatasync(5) = 0
No O_DSYNCs to be seen, but quite a few fdatasync's! You don't seem to
be checking that what you're saying matches actual practice & behavior.
Most applications for which I/O performance is important allow writes to
buffer, then
flush the buffers as needed for data integrity.
No objections here. Most applications write their files as a whole, it
makes total sense to do it this way. But there are exceptions to this
rule, and if it modifies its files piecewise, it probably uses O_DSYNC
to be sure.
See above.
> > simply using conv=fdatasync to make sure that the cache is flushed before
exiting
> > is going to be more representative.
>
> If you're answering the question "how fast is my programs are going to
> write there" - sure. If you're answering the question "how fast my
> drive(s) actually is(are)" - nope, you need O_DSYNC.
While OF COURSE the question people want answered is "how fast is my programs are
going to write there"
But the most important hidden question here - which programs?
That ones that write their files by one big chunk (which is common) or
the ones that do it one piece at a time (any RDBMS, for instance)?
See above. RDBMS usually try really hard to coalesce write operations
rather than writing little tiny pieces, even at the cost of writing the
data twice. (Once in a sequential journal, and again as part of combined
random writes.)
Real programs that write large amounts of data have to handle the
possibility of partial writes *even if* they are using O_DSYNC. In
non-trivial cases if you're doing the work to handle the problems that
occur with a partial write, you can just as easily write larger amounts
of data unsynchronized to get better performance, then establish a
synchronization point with f(data)sync. There are cases where O_DSYNC
might be the best option, mostly around appending in relatively small
chunks. Otherwise, as above, you're probably using some kind of journal
and there's no reason to slow down every operation when you only need
things to hit the disk in a certain relative order to get the same
level of integrity.
This is also all stuff that's evolved over time and across systems.
Programs may behave one way on one system and another way on another
system because they have or lack certain guarantees or because of
dramatic performance differences. E.g., postgresql's write-ahead log is
an obvious candidate for O_DSYNC, but even there on linux it defaults to
fdatasync because of historic cases where O_DSYNC behaved dramatically
worse. (The two should be close to identical if you fdatasync after
every single write, with slightly higher overhead when making two
separate system calls, but in some cases that wasn't happening. I think
that doesn't happen anymore, but there's no strong incentive to
change the default because the difference when things are working
properly isn't large but if they're broken the difference is huge.)