I have thought a while about this and I have some ideas. Ideally, we would be able to trickle the sync of individuals blocks during the checkpoint, but we can't because we rely on the kernel to sync all dirty blocks that haven't made it to disk using fsync(). We could trickle the fsync() calls, but that just extends the amount of data we are writing that has been dirtied post-checkpoint. In an ideal world, we would be able to fsync() only part of a file at a time, and only those blocks that were dirtied pre-checkpoint, but I don't see that happening anytime soon (and one reason why many commercial databases bypass the kernel cache).
So, in the real world, one conclusion seems to be that our existing method of tuning the background writer just isn't good enough for the average user: #bgwriter_delay = 200ms # 10-10000ms between rounds #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round These settings control what the bgwriter does, but they do not clearly relate to the checkpoint timing, which is the purpose of the bgwriter, and they don't change during the checkpoint interval, which is also less than ideal. If set to aggressively, it writes too much, and if too low, the checkpoint does too much I/O. We clearly need more bgwriter activity as the checkpoint approaches, and one that is more auto-tuned, like many of our other parameters. I think we created these settings to see how they worked in the field, so it probably time to reevaluate them based on field reports. I think the bgwriter should keep track of how far it is to the next checkpoint, and use that information to increase write activity. Basically now, during a checkpoint, the bgwriter does a full buffer scan and fsync's all dirty files, so it changes from the configuration parameter-defined behavior right to 100% activity. I think it would be ideal if we could ramp up the writes so that when it is 95% to the next checkpoint, it can be operating at 95% of the activity it would do during a checkpoint. My guess is if we can do that, we will have much smoother performance because we have more WAL writes just after checkpoint for newly-dirtied pages, and the new setup will give us more write activity just before checkpoint. One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid the kernel cache, so we are sure data will be on disk by checkpoint time. This was avoided in the past because of the expense of second-guessing the kernel disk I/O scheduling algorithms. --------------------------------------------------------------------------- Tom Lane wrote: > "Kevin Grittner" <[EMAIL PROTECTED]> writes: > > "Jim C. Nasby" <[EMAIL PROTECTED]> wrote: > >> Generally, I try and configure the all* settings so that you'll get 1 > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't > >> have any actual tests to back that methodology up. > > > We got to these numbers somewhat scientifically. I studied I/O > > patterns under production load and figured we should be able to handle > > about 800 writes in per 200 ms without causing problems. I have to > > admit that I based the percentages and the ratio between "all" and "lru" > > on gut feel after musing over the documentation. > > I like Kevin's settings better than what Jim suggests. If the bgwriter > only makes one sweep between checkpoints then it's hardly going to make > any impact at all on the number of dirty buffers the checkpoint will > have to write. The point of the bgwriter is to reduce the checkpoint > I/O spike by doing writes between checkpoints, and to have any > meaningful impact on that, you'll need it to make the cycle several times. > > Another point here is that you want checkpoints to be pretty far apart > to minimize the WAL load from full-page images. So again, a bgwriter > that's only making one loop per checkpoint is not gonna be doing much. > > I wonder whether it would be feasible to teach the bgwriter to get more > aggressive as the time for the next checkpoint approaches? Writes > issued early in the interval have a much higher probability of being > wasted (because the page gets re-dirtied later). But maybe that just > reduces to what Takahiro-san already suggested, namely that > checkpoint-time writes should be done with the same kind of scheduling > the bgwriter uses outside checkpoints. We still have the problem that > the real I/O storm is triggered by fsync() not write(), and we don't > have a way to spread out the consequences of fsync(). > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian [EMAIL PROTECTED] EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org