On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > This is the fixed verison v22.
I'd like to offer a few thoughts on this thread and on these patches, which is now more than 4 years old and more than 150 messages in length. First, I'd like to restate my understanding of the problem just to see whether I've got the right idea and whether we're all on the same page. When wal_level=minimal, we sometimes try to skip WAL logging on newly-created relations in favor of fsync-ing the relation at commit time. The idea is that if the transaction aborts or is aborted by a crash, the contents of the relation don't need to be reproduced because they are irrelevant, so no WAL is needed, and if the transaction commits we can't lose any data on a crash because we've already fsync'd, and standbys don't matter because wal_level=minimal precludes having any. However, we're not entirely consistent about skipping WAL-logging: some operations do and others don't, and this causes confusion if a crash occurs, because we might try to replay some of the things that happened to that relation but not all of them. For example, the original poster complained about a sequence of steps where an index truncation was logged but subsequent index insertions were not; a badly-timed crash will replay the truncation but can't replay the index insertions because they weren't logged in the first place; consequently, while the state was actually OK at the beginning of replay, it's no longer OK by the end. Replaying nothing would've been OK, but replaying some things and not others isn't. Second, for anyone who is not following this thread closely but is interested in a summary, I'd like to summarize how I believe that the current patch proposes to solve the problem. As I understand it, the approach taken by the patch is to try to change things so that we log nothing at all for relations created or truncated in the current top-level transaction, and everything for others. To achieve this, the patch makes a number of changes, three of which seem to me to be particularly key. One, the patch changes the relcache infrastructure with the goal of making it possible to reliably identify whether a relation has been created or truncated in the current toplevel transaction; our current code does have tracking for this, but it's not 100% accurate. Two, the patch changes the definition of RelationNeedsWAL() so that it not only checks that the relation is a permanent one, but also that either wal_level != minimal or the relation was not created in the current transaction. It seems to me that if RelationNeedsWAL() is used to gate every test for whether or not to write WAL pertaining to a particular relation, this ought to achieve the desired behavior of logging either everything or nothing. It is not quite clear to me how we can be sure that we use that in every relevant place. Three, the patch replaces the various ad-hoc bits of code which fsync relations which perform unlogged operations on permanent relations with a new tracking mechanism that arranges to perform all of the relevant fsync() calls at commit time. This is further augmented with a mechanism that instead logs all the relation pages in lieu of fsync()ing if the relation is very small, on the theory that logging a few FPIs will be cheaper than an fsync(). I view this additional mechanism as perhaps a bit much for a bug fix patch, but I understand that the goal is to prevent a performance regression, and it's not really over the top, so I think it's probably OK. Third, I'd like to offer a couple of general comments on the state of these patches. Broadly, I think they look pretty good. They seem quite well-engineered to me and as far as I can see the overall approach is sound. I think there are a number of places where the comments could be better; I'll include a few points on that further down. I also think that the code in swap_relation_files() which takes ExclusiveLock on the relations looks quite strange. It's hard to understand why it's needed at all, or why that lock level is used. On the flip side, I think that the test suite looks really impressive and should be of considerable help not only in making sure that this is fixed but detecting if it gets broken again in the future. Perhaps it doesn't cover every scenario we care about, but if that turns out to be the case, it seems like it would be easily to further generalize. I really like the idea of this *kind* of test framework. Comments on comments, and other nitpicking: - in-trasaction is mis-spelled in the doc patch. accidentially is mis-spelled in the 0002 patch. - I think the header comment for the new TAP test could do a far better job explaining the overall goal of this testing than it actually does. - I think somewhere in relcache.c or rel.h there ought to be comments explaining the precise degree to which rd_createSubid, rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable, including problem scenarios. This patch removes some language of this sort from CopyFrom(), which was a funny place to have that information in the first place, but I don't see that it adds anything to replace it. I also think that we ought to explain - for the fields that are reliable - that they need to be reliable precisely for the purpose of not breaking this stuff. There's a bit of this right now: + * rd_firstRelfilenodeSubid is the ID of the first subtransaction the + * relfilenode change has took place in the current transaction. Unlike + * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID + * means that the currently active relfilenode is transaction-local and we + * sync the relation at commit instead of WAL-logging. ...but I think that needs to be somewhat expanded and clarified. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company