On Wed, May 25, 2011 at 10:09 PM, Fujii Masao <masao.fu...@gmail.com> wrote: > On Wed, May 25, 2011 at 9:34 PM, Robert Haas <robertmh...@gmail.com> wrote: >> On Tue, May 24, 2011 at 10:52 PM, Jeff Davis <pg...@j-davis.com> wrote: >>> On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote: >>>> As I think about it a bit more, we'd >>>> need to XLOG not only the parts of the page we actually modifying, but >>>> any that the WAL record would need to be correct on replay. >>> >>> I don't understand that statement. Can you clarify? >> >> I'll try. Suppose we have two WAL records A and B, with no >> intervening checkpoint, that both modify the same page. A reads chunk >> 1 of that page and then modifies chunk 2. B modifies chunk 1. Now, >> suppose we make A do a "partial page write" on chunk 2 only, and B do >> the same for chunk 1. At the point the system crashes, A and B are >> both on disk, and the page has already been written to disk as well. >> Replay begins from a checkpoint preceding A. >> >> Now, when we get to the record for A, what are we to do? If it were a >> full page image, we could just restore it, and everything would be >> fine after that. But if we replay the partial page write, we've got >> trouble. A will now see the state of the chunk 1 as it existed after >> the action protected by B occurred, and will presumably do the wrong >> thing. > > If this is really true, full page writes would also cause the similar problem. > No? Imagine the case where A reads page 1, then modifies page 2, and B > modifies page 1. At the recovery, A will see the state of page 1 as it existed > after the action by B.
Yeah, but it won't matter, because the LSN interlock will prevent A from taking any action. If you only write parts of the page, though, the concept of "the" LSN of the page becomes a bit murky, because you may have different parts of the page from different points in the WAL stream. I believe it's possible to cope with that if we design it carefully, but it does seem rather complex and error-prone (which is not necessarily the best design for a recovery system, but hey). Anyway, you can either have the partial page write for A restore the older LSN, or not. If you do, then you have the problem as I described it. If you don't, then the effects of A vanish into the either. Either way, it doesn't work. > The replay of the WAL record for A doesn't rely on the content of chunk 1 > which B modified. So I don't think that "partial page writes" has such > a problem. > No? Sorry. WAL records today DO rely on the prior state of the page. If they didn't, we wouldn't need full page writes. They don't rely on them terribly heavily - things like where pd_upper is pointing, and what the page LSN is. But they do rely on them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers