On Thu, Apr 19, 2012 at 7:55 AM, Noah Misch <n...@leadboat.com> wrote: > On Mon, Mar 12, 2012 at 10:50:36PM -0400, Tom Lane wrote: >> There is one more (known) stop-ship problem in SPGiST, which I'd kind of >> like to get out of the way now before I let my knowledge of that code >> get swapped out again. This is that SPGiST is unsafe for use by hot >> standby slaves. > > I suspect that swap-out has passed, but ... > >> The problem comes from "redirect" tuples, which are short-lifespan >> objects that replace a tuple that's been moved to another page. >> A redirect tuple can be recycled as soon as no active indexscan could >> be "in flight" from the parent index page to the moved tuple. SPGiST >> implements this by marking each redirect tuple with the XID of the >> creating transaction, and assuming that the tuple can be recycled once >> that XID is below the OldestXmin horizon (implying that all active >> transactions started after it ended). This is fine as far as >> transactions on the master are concerned, but there is no guarantee that >> the recycling WAL record couldn't be replayed on a hot standby slave >> while there are still HS transactions that saw the old state of the >> parent index tuple. >> >> Now, btree has a very similar problem with deciding when it's safe to >> recycle a deleted index page: it has to wait out transactions that could >> be in flight to the page, and it does that by marking deleted pages with >> XIDs. I see that the problem has been patched for btree by emitting a >> special WAL record just before a page is recycled. However, I'm a bit >> nervous about copying that solution, because the details are a bit >> different. In particular, I see that btree marks deleted pages with >> ReadNewTransactionId() --- that is, the next-to-be-assigned XID --- >> rather than the XID of the originating transaction, and then it >> subtracts one from the XID before sending it to the WAL stream. >> The comments about this are not clear enough for me, and so I'm > > Attempting to write an explanation for that btree code led me conclude that > the code is incorrect. (FWIW, I caused that wrongness.) I will start a > separate thread to fix it.
Wrong or not, we need to better document why we picked ReadNewTransactionID(), rather than OldestXmin which seems the more obvious and cheaper choice. > All hot standby transactions holding snapshots taken before the startup > process applies the tuple-mover transaction's commit record will have xmin <= > its XID. Therefore, passing that XID to ResolveRecoveryConflictWithSnapshot() > meets the need here precisely. Yes, agreed. i.e. don't subtract 1. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers