Re: [HACKERS] FSM corruption leading to errors

Anastasia Lubennikova Fri, 07 Oct 2016 07:51:06 -0700

06.10.2016 20:59, Pavan Deolasee:

I investigated a bug report from one of our customers and it lookedvery similar to previous bug reports here [1], [2], [3] (and probablymore). In these reports, the error looks something like this:
ERROR: could not read block 28991 in file "base/16390/572026": readonly 0 of 8192 bytes
I traced it to the following code in MarkBufferDirtyHint(). Thefunction returns without setting the DIRTY bit on the standby:
3413             /*
3414 * If we're in recovery we cannot dirty a pagebecause of a hint.3415 * We can set the hint, just not dirty the page as aresult so the
3416              * hint is lost when we evict the page or shutdown.
3417              *
3418 * See src/backend/storage/page/README for longerdiscussion.
3419              */
3420             if (RecoveryInProgress())
3421                 return;
3422
freespace.c freely uses MarkBufferDirtyHint() whenever changes aremade to the FSM. I think that's usually alright because FSM changesare not WAL logged and if FSM ever returns a block with less freespace than the caller needs, the caller is usually prepared to updatethe FSM and request for a new block. But if it returns a block that isoutside the size of the relation, then we've a trouble. The very nextReadBuffer() fails to handle such a block and throws the error.
When a relation is truncated, the FSM is truncated too to removereferences to the heap blocks that are being truncated. But since theFSM buffer may not be marked DIRTY on the standby, if the buffer getsevicted from the buffer cache, the on-disk copy of the FSM page may beleft with references to the truncated heap pages. When the standby islater promoted to be the master, and an insert/update is attempted tothe table, the FSM may return a block that is outside the valid rangeof the relation. That results in the said error.
Once this was clear, it was easy to put together a fully reproducibletest case. See the attached script; you'll need to adjust to yourenvironment. This affects all releases starting 9.3 and the script canreproduce the problem on all these releases.
I believe the fix is very simple. The FSM change during truncation iscritical and the buffer must be marked by MarkBufferDirty() i.e. thosechanges must make to the disk. I think it's alright not to WAL logthem because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs.But it must not be lost across a checkpoint. Also, since it happensonly during relation truncation, I don't see any problem fromperformance perspective.
What bothers me is how to fix the problem for already affectedstandbys. If the FSM for some table is already corrupted at thestandby, users won't notice it until the standby is promoted to be thenew master. If the standby starts throwing errors suddenly afterfailover, it will be a very bad situation for the users, like wenoticed with our customers. The fix is simple and users can justdelete the FSM (and VACUUM the table), but that doesn't sound nice andthey would not know until they see the problem.
One idea is to always check if the block returned by the FSM isoutside the range and discard such blocks after setting the FSM(attached patch does that). The problem with that approach is thatRelationGetNumberOfBlocks() is not really cheap and invoking iteverytime FSM is consulted may not be a bright idea. Can we cache thatvalue in the RelationData or some such place (BulkInsertState?) anduse that as a hint for quickly checking if the block is (potentially)outside the range and discard it? Any other ideas?
The other concern I've and TBH that's what I initially thought as thereal problem, until I saw RecoveryInProgress() specific code, is: canthis also affect stand-alone masters? The comments atMarkBufferDirtyHint() made me think so:
3358 * 3. This function does not guarantee that the buffer is alwaysmarked dirty3359 * (due to a race condition), so it cannot be used forimportant changes.
So I was working with a theory that somehow updates to the FSM pageare lost because the race mentioned in the comment actually kicks in.But I'm not sure if the race is only possible when the caller isholding a SHARE lock on the buffer. When the FSM is truncated, thecaller holds an EXCLUSIVE lock on the FSM buffer. So probably we'resafe. I could not reproduce the issue on a stand-alone master. Butprobably worth checking.
It might also be a good idea to inspect other callers ofMarkBufferDirtyHint() and see if any of them is vulnerable, especiallyfrom standby perspective. I did one round, and couldn't see anotherproblem.
Thanks,
Pavan
[1]https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com[2]https://www.postgresql.org/message-id/20160601134819.30392.85...@wrigleys.postgresql.org[3]https://www.postgresql.org/message-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com
--
 Pavan Deolasee http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] FSM corruption leading to errors

Reply via email to