On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith <g...@2ndquadrant.com> wrote: > No toe damage, this is great, I hadn't gotten to coding for this angle yet > at all. Suffering from an overload of ideas and (mostly wasted) test data, > so thanks for exploring this concept and proving it works.
Yeah - obviously I want to make sure that someone reviews the logic carefully, since a loss of fsyncs or a corruption of the request queue could affect system stability, but only very rarely, since you'd need full fsync queue + crash. But the code is pretty simple, so it should be possible to convince ourselves as to its correctness (or otherwise). Obviously, major credit to you and Simon for identifying the problem and coming up with a proposed fix. > I'm not sure what to do with the rest of the work I've been doing in this > area here, so I'm tempted to just combine this new bit from you with the > older patch I submitted, streamline, and see if that makes sense. Expected > to be there already, then "how about spending 5 minutes first checking out > that autovacuum lock patch again" turned out to be a wild underestimate. I'd rather not combine the patches, because this one is pretty simple and just does one thing, but feel free to write something that applies over top of it. Looking through your old patch (sync-spread-v3), there seem to be a couple of components there: - Compact the fsync queue based on percentage fill rather than number of writes per absorb. If we apply my queue-compacting logic, do we still need this? The queue compaction may hold the BgWriterCommLock for slightly longer than AbsorbFsyncRequests() would, but I'm not inclined to jump to the conclusion that this is worth getting excited about. The whole idea of accessing BgWriterShmem->num_requests without the lock gives me the willies anyway - sure, it'll probably work OK most of the time, especially on x86, but it seems hard to predict whether there will be occasional bad behavior on platforms with weak memory ordering. - Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests(). Not sure what the motivation for this is. - CheckpointSyncDelay(), to make sure that we absorb fsync requests and free up buffers during a long checkpoint. I think this part is clearly valuable, although I'm not sure we've satisfactorily solved the problem of how to spread out the fsyncs and still complete the checkpoint on schedule. As to that, I have a couple of half-baked ideas I'll throw out so you can laugh at them. Some of these may be recycled versions of ideas you've already had/mentioned, so, again, credit to you for getting the ball rolling. Idea #1: When we absorb fsync requests, don't just remember that there was an fsync request; also remember the time of said fsync request. If a new fsync request arrives for a segment for which we're already remembering an fsync request, update the timestamp on the request. Periodically scan the fsync request queue for requests older than, say, 30 s, and perform one such request. The idea is - if we wrote a bunch of data to a relation and then haven't touched it for a while, force it out to disk before the checkpoint actually starts so that the volume of work required by the checkpoint is lessened. Idea #2: At the beginning of a checkpoint when we scan all the buffers, count the number of buffers that need to be synced for each relation. Use the same hashtable that we use for tracking pending fsync requests. Then, interleave the writes and the fsyncs. Start by performing any fsyncs that need to happen but have no buffers to sync (i.e. everything that must be written to that relation has already been written). Then, start performing the writes, decrementing the pending-write counters as you go. If the pending-write count for a relation hits zero, you can add it to the list of fsyncs that can be performed before the writes are finished. If it doesn't hit zero (perhaps because a non-bgwriter process wrote a buffer that we were going to write anyway), then we'll do it at the end. One problem with this - aside from complexity - is that most likely most fsyncs would either happen at the beginning or very near the end, because there's no reason to assume that buffers for the same relation would be clustered together in shared_buffers. But I'm inclined to think that at least the idea of performing fsyncs for which no dirty buffers remain in shared_buffers at the beginning of the checkpoint rather than at the end might have some value. Idea #3: Stick with the idea of a fixed delay between fsyncs, but compute how many fsyncs you think you're ultimately going to need at the start of the checkpoint, and back up the target completion time by 3 s per fsync from the get-go, so that the checkpoint still finishes on schedule. Idea #4: For ext3 filesystems that like to dump the entire buffer cache instead of only the requested file, write a little daemon that runs alongside of (and completely indepdently of) PostgreSQL. Every 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and closes the file, thus dumping the cache and preventing a ridiculous growth in the amount of data to be sync'd at checkpoint time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers