On Mon, Aug 3, 2020 at 5:26 AM Daniel Wood <hexexp...@comcast.net> wrote: > If we can't eliminate FPW's can we at least solve the impact of it? Instead > of writing the before images of pages inline into the WAL, which increases > the COMMIT latency, write these same images to a separate physical log file. > The key idea is that I don't believe that COMMIT's require these buffers to > be immediately flushed to the physical log. We only need to flush these > before the dirty pages are written. This delay allows the physical before > image IO's to be decoupled and done in an efficient manner without an impact > to COMMIT's.
I think this is what's called a double-write buffer, or what was tried some years ago under that name. A significant problem is that you have to fsync() the double-write buffer before you can write the WAL. So instead of this: - write WAL to OS - fsync WAL You have to do this: - write double-write buffer to OS - fsync double-write buffer - write WAL to OS - fsync WAL Note that you cannot overlap these steps -- the first fsync must be completed before the second write can begin, else you might try to replay WAL for which the double-write buffer information is not available. Because of this, I think this is actually quite expensive. COMMIT requires the WAL to be flushed, unless you configure synchronous_commit=off. So this would double the number of fsyncs we have to do. It's not as bad as all that, because the individual fsyncs would be smaller, and that makes a significant difference. For a big transaction that writes a lot of WAL, you'd probably not notice much difference; instead of writing 1000 pages to WAL, you might write 770 pages to the double-write buffer and 270 to the double-write buffer, or something like that. But for short transactions, such as those performed by pgbench, you'd probably end up with a lot of cases where you had to write 3 pages instead of 2, and not only that, but the writes have to be consecutive rather than simultaneous, and to different parts of the disk rather than sequential. That would likely suck a lot. It's entirely possible that these kinds of problems could be mitigated through really good engineering, maybe to the point where this kind of solution outperforms what we have now for some or even all workloads, but it seems equally possible that it's just always a loser. I don't really know. It seems like a very difficult project. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company