06.10.2016 20:59, Pavan Deolasee:
I investigated a bug report from one of our customers and it looked
very similar to previous bug reports here [1], [2], [3] (and probably
more). In these reports, the error looks something like this:
ERROR: could not read block 28991 in file "base/16390/572026": read
only 0 of 8192 bytes
I traced it to the following code in MarkBufferDirtyHint(). The
function returns without setting the DIRTY bit on the standby:
3413 /*
3414 * If we're in recovery we cannot dirty a page
because of a hint.
3415 * We can set the hint, just not dirty the page as a
result so the
3416 * hint is lost when we evict the page or shutdown.
3417 *
3418 * See src/backend/storage/page/README for longer
discussion.
3419 */
3420 if (RecoveryInProgress())
3421 return;
3422
freespace.c freely uses MarkBufferDirtyHint() whenever changes are
made to the FSM. I think that's usually alright because FSM changes
are not WAL logged and if FSM ever returns a block with less free
space than the caller needs, the caller is usually prepared to update
the FSM and request for a new block. But if it returns a block that is
outside the size of the relation, then we've a trouble. The very next
ReadBuffer() fails to handle such a block and throws the error.
When a relation is truncated, the FSM is truncated too to remove
references to the heap blocks that are being truncated. But since the
FSM buffer may not be marked DIRTY on the standby, if the buffer gets
evicted from the buffer cache, the on-disk copy of the FSM page may be
left with references to the truncated heap pages. When the standby is
later promoted to be the master, and an insert/update is attempted to
the table, the FSM may return a block that is outside the valid range
of the relation. That results in the said error.
Once this was clear, it was easy to put together a fully reproducible
test case. See the attached script; you'll need to adjust to your
environment. This affects all releases starting 9.3 and the script can
reproduce the problem on all these releases.
I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log
them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs.
But it must not be lost across a checkpoint. Also, since it happens
only during relation truncation, I don't see any problem from
performance perspective.
What bothers me is how to fix the problem for already affected
standbys. If the FSM for some table is already corrupted at the
standby, users won't notice it until the standby is promoted to be the
new master. If the standby starts throwing errors suddenly after
failover, it will be a very bad situation for the users, like we
noticed with our customers. The fix is simple and users can just
delete the FSM (and VACUUM the table), but that doesn't sound nice and
they would not know until they see the problem.
One idea is to always check if the block returned by the FSM is
outside the range and discard such blocks after setting the FSM
(attached patch does that). The problem with that approach is that
RelationGetNumberOfBlocks() is not really cheap and invoking it
everytime FSM is consulted may not be a bright idea. Can we cache that
value in the RelationData or some such place (BulkInsertState?) and
use that as a hint for quickly checking if the block is (potentially)
outside the range and discard it? Any other ideas?
The other concern I've and TBH that's what I initially thought as the
real problem, until I saw RecoveryInProgress() specific code, is: can
this also affect stand-alone masters? The comments at
MarkBufferDirtyHint() made me think so:
3358 * 3. This function does not guarantee that the buffer is always
marked dirty
3359 * (due to a race condition), so it cannot be used for
important changes.
So I was working with a theory that somehow updates to the FSM page
are lost because the race mentioned in the comment actually kicks in.
But I'm not sure if the race is only possible when the caller is
holding a SHARE lock on the buffer. When the FSM is truncated, the
caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're
safe. I could not reproduce the issue on a stand-alone master. But
probably worth checking.
It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially
from standby perspective. I did one round, and couldn't see another
problem.
Thanks,
Pavan
[1]
https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/20160601134819.30392.85...@wrigleys.postgresql.org
[3]
https://www.postgresql.org/message-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company