This is per a report by an EnterpriseDB customer and a bunch of off-list analysis by Kevin Grittner and Rahila Syed.
Suppose you have a large relation with OID 123456. There are segment files 123456, 123456.1, and 123456.2. Due to some kind of operating system malfeasance, 123456.1 disappears; you are officially in trouble. Now, a funny thing happens. The next time you call mdnblocks() on this relation, which will probably happen pretty quickly since every sequential scan does one, it will see that 123456 is a complete segment and it will create an empty 123456.1. It and any future mdnblocks() calls will report that the length of the relation is equal to the length of one full segment, because they don't check for the next segment unless the current segment is completely full. Now, if subsequent to this an index scan happens to sweep through and try to fetch a block in 123456.2, it will work! This happens because _mdfd_getseg() doesn't care about the length of the segments; it only cares whether or not they exist. If 123456.1 were actually missing, then we'd never test whether 123456.2 exists and we'd get an error. But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite happy to fetch blocks in 123456.2; it considers the empty 123456.1 file to be a sufficient reason to look for 123456.2, and seeing that file, and finding the block it wants therein, it happily returns that block, blithely ignoring the fact that it passed over a completely .1 segment before returning a block from .2. This is maybe not the smartest thing ever. The comment in mdnblocks.c says this: * Because we pass O_CREAT, we will create the next segment (with * zero length) immediately, if the last segment is of length * RELSEG_SIZE. While perhaps not strictly necessary, this keeps * the logic simple. I don't really see how this "keeps the logic simple". What it does is allow sequential scans and index scans to have two completely different notions of how many accessible blocks there are in the relation. Granted, this kind of thing really shouldn't happen, but sometimes bad things do happen. Therefore, I propose the attached patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RM36310.patch
Description: binary/octet-stream
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers