On 15 March 2013 13:08, Andres Freund <and...@2ndquadrant.com> wrote: > On 2013-03-15 14:32:57 +0200, Ants Aasma wrote: >> On Wed, Mar 6, 2013 at 1:34 PM, Heikki Linnakangas >> <hlinnakan...@vmware.com> wrote: >> > Fletcher's checksum is good in general, I was mainly worried about >> > truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious >> > weakness in it, but if it's indeed faster and as good as a straightforward >> > Fletcher-16, I wonder why that method is not more widely used. >> >> As implented, the fletcher algorithm as implemented results in: >> >> checksum low byte = (blkno + sum over i [0..N) (x_i)) % 255 + 1 >> checksum high byte = (blkno + sum over i in [0..N) ((N - i)*x_i)) % 255 + 1 >> >> Where N is the number of 4 bytes words in the page and x_i is the i-th >> word. As modular arithmetic is a ring, it is easy to show that any >> addition or subtraction of a multiple of 255 = 0xFF will result in no >> change to the resulting value. The most obvious case here is that you >> can swap any number of bytes from 0x00 to 0xFF or back without >> affecting the hash. > > I commented on this before, I personally think this property makes fletcher a > not so good fit for this. Its not uncommon for parts of a block being all-zero > and many disk corruptions actually change whole runs of bytes.
I think you're right to pick up on this point, and Ants has done a great job of explaining the issue more clearly. My perspective, after some thought, is that this doesn't matter to the overall effectiveness of this feature. PG blocks do have large runs of 0x00 in them, though that is in the hole in the centre of the block. If we don't detect problems there, its not such a big deal. Most other data we store doesn't consist of large runs of 0x00 or 0xFF as data. Most data is more complex than that, so any runs of 0s or 1s written to the block will be detected. So what we need to look at is how that problem affects the quality of our detection. I would guess we can say that our detection might only be 99% effective, rather than 100% effective. I'm not sure the issue is that bad, but lets look at what would happen if it was that value. Checksums are for detecting problems. What kind of problems? Sporadic changes of bits? Or repeated errors. If we were trying to trap isolated bit changes then CRC-32 would be appropriate. But I'm assuming that whatever causes the problem is going to recur, so what we want to do is detect hardware that is starting to go bad and needs to be replaced. So errors show a repetitive pattern, increasing in frequency and coverage over time; and "issue" is not an isolated incident, its the beginning of a series of related problems. This much the same as the idea that for every mouse you see in your house there are another 10 you don't, and if you ignore the sighting of a mouse, the problem will get worse, often quickly. What we want to do is detect infestations/mouse colonies, rather than detect isolated and non-repeated visitors. Running checksums on the whole block gives us about x1000 better chance of detecting a run of issues than we have with just header checks. The perfection of the actual check, 99%/100%, doesn't alter much the overall *gain* in detection rate we get from using checksums, and so I can say its less important that the check itself is watertight. And in fact, no checksum is watertight, it is a technique that trades performance for detection quality. So even a detector that spotted only 90% of real errors would still be a massive gain in overall detection, because we are applying the check across the whole block. What we need is a cheap way of detecting problems as early as possible. Checksums don't prevent disk corruption, they just alert us to the presence of disk corruption, allowing us to avoid data corruption by reverting to backups. If we don't detect things early enough, then we find that reverting to backup doesn't work because the backed-up data blocks are corrupt. Fletcher-16 seems to be the best combination of speed v quality. What I think we could do here is to allow people to set their checksum algorithm with a plugin. But if we do that, then we open up the possibility for user error on people changing checksum algorithms and not realising that won't change values already calculated. That would be a bad usability problem in itself and is almost certain to bite, since user error is a larger source of real world problems than hardware error. So I'd rather not do that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers