On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <si...@2ndquadrant.com> wrote: >> Doesn't this seem awfully bad for performance on Hot Standby servers? >> I agree that it fixes the problem with un-WAL-logged pages there, but >> I seem to recall some recent complaining about performance features >> that work on the master but not the standby. Durable hint bits are >> one such feature. > > It's impossible for it to work, in this case, since we cannot write > new WAL to prevent torn pages. > > Note that hint bit setting on a dirty block is allowed, so many hints > will still be set in Hot Standby.
To me, it seems that you are applying a double standard. You have twice attempted to insist that I do extra work to make major features that I worked on - unlogged tables and index-only scans - work in Hot Standby mode, despite the existence of significant technological obstacles. But when it comes to your own feature, you simply state that it cannot be done, and therefore we need not do it. Of course, this feature, like those, CAN be made to work. It just involves solving difficult problems that have little to do with the primary purpose of the patch. To be honest, I don't use Hot Standby enough to care very much about this particular issue, and I'm not saying we should reject it on these grounds. But I do think it's a mistake to dismiss it entirely, since every limitation of Hot Standby narrows the set of cases in which it can be deployed. And at any rate, I want the same standard applied to my work as to yours. >> I am slightly worried that this expansion in the use of this mechanism >> (formerly called inCommit, for those following along at home) could >> lead to checkpoint starvation. Suppose we've got one or two large >> table scans wandering along, setting hint bits, and now suddenly it's >> time to checkpoint. How long will it take the checkpoint process to >> find a time when nobody's got delayChkpt set? > > We don't need to wait until nobody has it set, we just need to wait > for the people that had it set when we first checked to be out of that > state momentarily. Ah... good point. >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it >> doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that >> do? > > As explained in detailed comments, the purpose of this is to implement > Heikki's suggestion that we have a bit set to zero so we can detect > failures that cause a run of 1s. I think it's nonsensical to pretend that there's anything special about that particular bit. If we want to validate the page header before trusting the lack of a checksum bit, we can do that far more thoroughly than just checking that one bit. There are a whole bunch of bits that ought to always be zero, and there are other things we can validate as well (e.g. LSN not in future). If we're concerned about the checksum-enabled bit getting flipped (and I agree that we should be), we can check some combination of that stuff in the hope of catching it, and that'll be a far better guard than just checking one arbitrarily selected bit. That having been said, I don't feel very good about the idea of relying on the contents of the page to tell us whether or not the page has a checksum. There's no guarantee that an error that flips the has-checksum bit will flip any other bit on the page, or that it won't flip everything else we're relying on as a backstop in exactly the manner that foils whatever algorithm we put in place. Random corruption is, perhaps, unlikely to do that, but somehow I feel like it'll happen more often than random chance suggests. Things never fail the way you want them to. Another disadvantage of the current scheme is that there's no particularly easy way to know that your whole cluster has checksums. No matter how we implement checksums, you'll have to rewrite every table in the cluster in order to get them fully turned on. But with the current design, there's no easy way to know how much of the cluster is actually checksummed. If you shut checksums off, they'll linger until those pages are rewritten, and there's no easy way to find the relations from which they need to be removed, either. I'm tempted to suggest a relation-level switch: when you want checksums, you use ALTER TABLE to turn them on, and when you don't want them any more you use ALTER TABLE to shut them off again, in each case rewriting the table. That way, there's never any ambiguity about what's in the data pages in a given relation: either they're either all checksummed, or none of them are. This moves the decision about whether checksums are enabled or disabled quite a but further away from the data itself, and also allows you to determine (by catalog inspection) which parts of the cluster do in fact have checksums. It might be kind of a pain to implement, though: you'd have to pass the information about how any given relation was configured down to the place where we validate page sanity. I'm not sure whether that's practical. I also think that the question of where exactly the checksum ought to be put might bear some more thought. Tom expressed some concern about stealing the page version field, and it seems to me we could work around that by instead stealing something less valuable. The top eight bits of the pd_pagesize_version field probably meet that criteria, since in practice basically everybody uses 8K blocks, and the number of errors that will be caught by checksums is probably much larger than the number of errors that will be caught by the page-size cross-check. But maybe the other 8 bits should come from somewhere else, maybe pd_tli or pd_flags. For almost all practical purposes, storing only the low-order 8 bits of the TLI ought to provide just as much of a safety check as storing the low-order 16 bits, so I think that might be the way to go. We could set it up so that we always check only the low 8 bits against the TLI, regardless of whether checksums are enabled; then we basically free up that bit space at no cost in code complexity. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers