On Mon, Feb 7, 2011 at 10:48 AM, Bruce Momjian <br...@momjian.us> wrote: > Robert Haas wrote: >> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <br...@momjian.us> wrote: >> > Uh, in this C comment: >> > >> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% >> > of >> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes, >> > >> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you >> > clarify this in the patch? >> >> 5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM >> that's exactly what the comment you just quoted says on its face, but >> I'm open to some other wording you want to propose. > > How about: > > otherwise-not-dirty -> only-hint-bit-dirty > > So 95% of your hint bit modificates are discarded if the pages is not > otherwise dirtied? That seems pretty radical.
No, it's more subtle than that, although I admit it *is* radical. There are three ways that pages can get written out to disk: 1. Checkpoints. 2. Background writer activity. 3. Backends writing out dirty buffers because there are no clean buffers available to allocate. What the latest version of the patch implements is: 1. Checkpoints no longer write only-hint-bit-dirty pages to disk. Since a checkpoint doesn't evict pages from memory, the hint bits are still there to be written out (or not) by (2) or (3), below. 2. When the background writer's cleaning scan hits an only-hint-bit-dirty page, it writes it, same as before. This definitely doesn't result in the loss of any hint bits. 3. When a backend writes out a dirty buffer itself, because there are no clean buffers available to allocate, it initially writes them. But if there are more than 100 such pages per block of 2000 allocations, it recycles any after the first 100 without writing them. In normal operation, I suspect that there will be very little impact from this change. The change described in #1 may slightly reduce the size of some checkpoints, but it's unclear that it will be enough to be material. The change described in #3 will probably also not matter, because, in a well-tuned system, the background writer should be set aggressively enough to provide a supply of clean pages, and therefore backends shouldn't be doing many writes themselves, and therefore most buffer allocations will be of already-clean pages, and the logic described in #3 will probably never kick in. Even if they are writing a lot of buffers themselves, the logic in #3 still won't kick in if many of the pages being written are actually dirty - it will only matter if the backends are writing out lots and lots of pages *solely because they are only-hint-bit-dirty*. Where I expect this to make a big difference is on sequential scans of just-loaded tables. In that case, the BufferAccessStrategy machinery will force the backend to reuse the same buffers over and over again, and all of those pages will be only-hint-bit-dirty. So the backend has to do a write for every page it allocates, and even though those writes are being absorbed by the OS cache, it's still slow. With this patch, what will happen is that the backend will write about 100 pages, then perform the next 1900 allocations without writing, then write another 100 pages, etc. So at the end of the scan, instead of having written an amount of data equal to the size of the table, we will have written 5% of that amount, and 5% of the hint bits will be on disk. Each subsequent scan will get another 5% of the hint bits on disk until after 20 scans they are all set. So the work of setting the hint bits is spread out across the first 20 table scans instead of all being done the first time through. Clearly, there's further jiggering that can be done here. But the overall goal is simply that some of our users don't seem to like it when the first scan of a newly loaded table generates a huge storm of *write* traffic. Given that the hint bits appear to be quite important from a performance perspective (see benchmark numbers upthread), we don't really have the option of just not writing them - but we can try to not to do it all at once, if we think that's an improvement, which I think is likely. Overall, I'm inclined to move this patch to the next CommitFest and forget about it for now. I don't think we're going to get enough testing of this in the next week to be really confident that it's right. I might be willing to commit with some more moderate amount of testing if we were right at the beginning of a development cycle, figuring that we'd shake out any warts as the cycle went along, but this isn't seeming like the right time for this kind of a change. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers