On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley <james.bottom...@hansenpartnership.com> wrote: > On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote: >> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley >> <james.bottom...@hansenpartnership.com> wrote: >> > No, I'm sorry, that's never going to be possible. No user space >> > application has all the facts. If we give you an interface to force >> > unconditional holding of dirty pages in core you'll livelock the system >> > eventually because you made a wrong decision to hold too many dirty >> > pages. I don't understand why this has to be absolute: if you advise >> > us to hold the pages dirty and we do up until it becomes a choice to >> > hold on to the pages or to thrash the system into a livelock, why would >> > you ever choose the latter? And if, as I'm assuming, you never would, >> > why don't you want the kernel to make that choice for you? >> >> If you don't understand how write-ahead logging works, this >> conversation is going nowhere. Suffice it to say that the word >> "ahead" is not optional. > > No, I do ... you mean the order of write out, if we have to do it, is > important. In the rest of the kernel, we do this with barriers which > causes ordered grouping of I/O chunks. If we could force a similar > ordering in the writeout code, is that enough?
Probably not. There are a whole raft of problems here. For that to be any of any use, we'd have to move to mmap()ing each buffer instead of read()ing them in, and apparently mmap() doesn't scale well to millions of mappings. And even if it did, then we'd have a solution that only works on Linux. Plus, as Tom pointed out, there are critical sections where it's not just a question of ordering but in fact you need to completely hold off writes. In terms of avoiding double-buffering, here's my thought after reading what's been written so far. Suppose we read a page into our buffer pool. Until the page is clean, it would be ideal for the mapping to be shared between the buffer cache and our pool, sort of like copy-on-write. That way, if we decide to evict the page, it will still be in the OS cache if we end up needing it again (remember, the OS cache is typically much larger than our buffer pool). But if the page is dirtied, then instead of copying it, just have the buffer pool forget about it, because at that point we know we're going to write the page back out anyway before evicting it. This would be pretty similar to copy-on-write, except without the copying. It would just be forget-from-the-buffer-pool-on-write. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers