Trond Myklebust <tron...@gmail.com> writes: > On Jan 14, 2014, at 10:39, Tom Lane <t...@sss.pgh.pa.us> wrote: >> "Don't be aggressive" isn't good enough. The prohibition on early write >> has to be absolute, because writing a dirty page before we've done >> whatever else we need to do results in a corrupt database. It has to >> be treated like a write barrier.
> Then why are you dirtying the page at all? It makes no sense to tell the > kernel were changing this page in the page cache, but we dont want you to > change it on disk: thats not consistent with the function of a page cache. As things currently stand, we dirty the page in our internal buffers, and we don't write it to the kernel until we've written and fsync'd the WAL data that needs to get to disk first. The discussion here is about whether we could somehow avoid double-buffering between our internal buffers and the kernel page cache. I personally think there is no chance of using mmap for that; the semantics of mmap are pretty much dictated by POSIX and they don't work for this. However, disregarding the fact that the two communities speaking here don't control the POSIX spec, you could maybe imagine making it work if *both* pending WAL file contents and data file contents were mmap'd, and there were kernel APIs allowing us to say "you can write this mmap'd page if you want, but not till you've written that mmap'd data over there". That'd provide the necessary write-barrier semantics, and avoid the cache coherency question because all the data visible to the kernel could be thought of as the "current" filesystem contents, it just might not all have reached disk yet; which is the behavior of the kernel disk cache already. I'm dubious that this sketch is implementable with adequate efficiency, though, because in a live system the kernel would be forced to deal with a whole lot of active barrier restrictions. Within Postgres we can reduce write-ordering tests to a very simple comparison: don't write this page until WAL is flushed to disk at least as far as WAL sequence number XYZ. I think any kernel API would have to be a great deal more general and thus harder to optimize. Another difficulty with merging our internal buffers with the kernel cache is that when we're in the process of applying a change to a page, there are intermediate states of the page data that should under no circumstances reach disk (eg, we might need to shuffle records around within the page). We can deal with that fairly easily right now by not issuing a write() while a page change is in progress. I don't see that it's even theoretically possible in an mmap'd world; there are no atomic updates to an mmap'd page that are larger than whatever is an atomic update for the CPU. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers