On 1/14/14, 3:41 PM, Dave Chinner wrote:
On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgor...@suse.de> wrote:
IOWs, using sync_file_range() does not avoid the need to fsync() a
file for data integrity purposes...
I belive the PG community understands that, but thanks for the heads-up.
Whether the problem is with the system
call or the programmer is harder to determine. I think the problem is
in part that it's not exactly clear when we should call it. So
suppose we want to do a checkpoint. What we used to do a long time
ago is write everything, and then fsync it all, and then call it good.
But that produced horrible I/O storms. So what we do now is do the
writes over a period of time, with sleeps in between, and then fsync
it all at the end, hoping that the kernel will write some of it before
the fsyncs arrive so that we don't get a huge I/O spike.
And that sorta works, and it's definitely better than doing it all at
full speed, but it's pretty imprecise. If the kernel doesn't write
enough of the data out in advance, then there's still a huge I/O storm
when we do the fsyncs and everything grinds to a halt. If it writes
out more data than needed in advance, it increases the total number of
physical writes because we get less write-combining, and that hurts
performance, too.
I think there's a pretty important bit that Robert didn't mention: we have a
specific *time* target for when we want all the fsync's to complete. People
that have problems here tend to tune checkpoints to complete every 5-15
minutes, and they want the write traffic for the checkpoint spread out over 90%
of that time interval. To put it another way, fsync's should be done when 90%
of the time to the next checkpoint hits, but preferably not a lot before then.
Yup, the kernel defaults to maximising bulk write throughput, which
means it waits to the last possible moment to issue write IO. And
that's exactly to maximise write combining, optimise delayed
allocation, etc. There are many good reasons for doing this, and for
the majority of workloads it is the right behaviour to have.
It sounds to me like you want the kernel to start background
writeback earlier so that it doesn't build up as much dirty data
before you require a flush. There are several ways to do this by
tweaking writeback knobs. The simplest is probably just to set
/proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
50MB) and dirty_expire_centiseconds to a few seconds so that
background writeback starts and walks all dirty inodes almost
immediately. This will keep a steady stream of low level background
IO going, and fsync should then not take very long.
Except that still won't throttle writes, right? That's the big issue here: our
users often can't tolerate big spikes in IO latency. They want user requests to
always happen within a specific amount of time.
So while delaying writes potentially reduces the total amount of data you're
writing, users that run into problems here ultimately care more about ensuring
that their foreground IO completes in a timely fashion.
Fundamentally, though, we need bug reports from people seeing these
problems when they see them so we can diagnose them on their
systems. Trying to discuss/diagnose these problems without knowing
anything about the storage, the kernel version, writeback
thresholds, etc really doesn't work because we can't easily
determine a root cause.
So is lsf...@linux-foundation.org the best way to accomplish that?
Also, along the lines of collaboration, it would also be awesome to see kernel hackers at
PGCon (http://pgcon.org) for further discussion of this stuff. That is the conference
that has more Postgres internal developers than any other. There's a variety of different
ways collaboration could happen there, so it's probably best to start a separate
discussion with those from the linux community who'd be interested in attending. PGCon
also directly follows BSDCan (http://bsdcan.org) at the same venue... so we could
potentially kill two OS birds with one stone, so to speak... :) If there's enough
interest we could potentially do a "mini Postgres/OS conference" in-between
BSDCan and the formal PGCon. There's also potential for the Postgres community to sponsor
attendance for kernel hackers if money is a factor.
Like I said... best to start a separate thread if there's significant interest
on meeting at PGCon. :)
--
Jim C. Nasby, Data Architect j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers