On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner <kevin.gritt...@wicourts.gov> wrote: > Jignesh Shah <jks...@gmail.com> wrote: > >> When we use Doublewrite with checksums, we can safely disable >> full_page_write causing a HUGE reduction to the WAL traffic >> without loss of reliatbility due to a write fault since there are >> two writes always. (Implementation detail discussable). > > The "always" there surprised me. It seemed to me that we only need > to do the double-write where we currently do full page writes or > unlogged writes. In thinking about your message, it finally struck
Currently PG only does full page write for the first change that makes the dirty after a checkpoint. This scheme works when all changes are relative to that first page so when checkpoint write fails then it can recreate the page by using the full page write + all the delta changes from WAL. In the double write implementation, every checkpoint write is double writed, so if the first doublewrite page write fails then then original page is not corrupted and if the second write to the actual datapage fails, then one can recover it from the earlier write. Now while it seems that there are 2X double writes during checkpoint is true. I can argue that there are the same 2 X writes right now except 1X of the write goes to WAL DURING TRANSACTION COMMIT. Also since doublewrite is generally written in its own file it is essentially sequential so it doesnt have the same write latencies as the actual checkpoint write. So if you look at the net amount of the writes it is the same. For unlogged tables even if you do doublewrite it is not much of a penalty while that may not be logging before in the WAL. By doing the double write for it, it is still safe and gives resilience for those tables to it eventhough it is not required. The net result is that the underlying page is never "irrecoverable" due to failed writes. > me that this might require a WAL record to be written with the > checksum (or CRC; whatever we use). Still, writing a WAL record > with a CRC prior to the page write would be less data than the full > page. Doing double-writes instead for situations without the torn > page risk seems likely to be a net performance loss, although I have > no benchmarks to back that up (not having a double-write > implementation to test). And if we can get correct behavior without > doing either (the checksum WAL record or the double-write), that's > got to be a clear win. I am not sure why would one want to write the checksum to WAL. As for the double writes, infact there is not a net loss because (a) the writes to the doublewrite area is sequential the writes calls are relatively very fast and infact does not cause any latency increase to any transactions unlike full_page_write. (b) It can be moved to a different location to have no stress on the default tablespace if you are worried about that spindle handling 2X writes which is mitigated in full_page_writes if you move pg_xlogs to different spindle and my own tests supports that the net result is almost as fast as full_page_write=off but not the same due to the extra write (which gives you the desired reliability) but way better than full_page_write=on. Regards, Jignesh > -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers