Andres Freund wrote: > On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:
> > > > Andres mentioned the idea of sharing some code between > > > > heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't > > > > explored that. > > > > > > My idea would just be to have heap_tuple_needs_freeze() call > > > heap_prepare_freeze_tuple() and check whether it returns true. Yes, > > > that's slightly more expensive than the current > > > heap_tuple_needs_freeze(), but it's only called when we couldn't get a > > > cleanup lock on a page, so that seems ok. > > > > Doesn't seem a completely bad idea, but let's leave it for a separate > > patch. This should be changed in master only IMV anyway, while the rest > > of this patch is to be backpatched to 9.3. > > I am not so sure it shouldn't be backpatched together with this. We now > have similar complex logic in both functions. Any other opinions on this? > > > > ! if (ISUPDATE_from_mxstatus(members[i].status) && > > > > ! !TransactionIdDidAbort(members[i].xid))# > > > > > > It makes me wary to see a DidAbort() without a previous InProgress() > > > call. Also, after we crashed, doesn't DidAbort() possibly return > > > false for transactions that were in progress before we crashed? At > > > least that's how I always understood it, and how tqual.c is written. > > > > Yes, that's correct. But note that here we're not doing a tuple > > liveliness test, which is what tqual.c is doing. What we do with this > > info is to keep the Xid as part of the multi if it's still running or > > committed. We also keep it if the xact crashed, which is fine because > > the Xid will be removed by some later step. If we know for certain that > > the update Xid is aborted, then we can ignore it, but this is just an > > optimization and not needed for correctness. > > But why deviate that way? It doesn't seem to save us much? Well, it does save something -- there not being a live update means we are likely to be able to invalidate the Xmax completely if there are no lockers; and even in the case where there are lockers, we will be able to set LOCK_ONLY which means faster access in several places. > > > > if (tuple->t_infomask & HEAP_MOVED_OFF) > > > > ! frz->frzflags |= XLH_FREEZE_XVAC; > > > > else > > > > ! frz->frzflags |= XLH_INVALID_XVAC; > > > > > > > > > > Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and > > > XLH_INVALID_XVAC exactly the other way round? I really don't understand > > > the moved in/off, since the code has been gone longer than I've followed > > > the code... > > > > Yep, fixed. > > I wonder how many of the HEAP_MOVED_* cases around are actually > correct... What was the last version those were generated? 8.4? 8.4, yeah, before VACUUM FULL got rewritten. I don't think anybody tests these code paths, because it involves databases that were upgraded straight from 8.4 and which in their 8.4 time saw VACUUM FULL executed. I think we should be considering removing these things, or at least have some mechanism to ensure they don't survive from pre-9.0 installs. > > (I was toying with the "desc" > > code because it misbehaves when applied on records as they are created, > > as opposed to being applied on records as they are replayed. I'm pretty > > sure everyone already knows about this, and it's the reason why > > everybody has skimped from examining arrays of things stored in followup > > data records. I was naive enough to write code that tries to decode the > > followup record that contains the members of the multiaxct we're > > creating, which works fine during replay but gets them completely wrong > > during regular operation. This is the third time I'm surprised by this > > misbehavior; blame my bad memory for not remembering that it's not > > supposed to work in the first place.) > > I am not really sure what you are talking about. That you cannot > properly decode records before they have been processed by XLogInsert()? > If so, yes, that's pretty clear and I am pretty sure it will break in > lots of places if you try? Well, not sure about "lots of places". The only misbahavior I have seen is in those desc routines. Of course, the redo routines might also fail, but then there's no way for them to be running ... > > Right now there is one case in this code that returns > > FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also > > work to keep the Multi as is and return FRM_NOOP instead; and it also > > returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX > > instead. Neither does any great damage, but there is a consideration > > that future examiners of the tuple would have to resolve the MultiXact > > by themselves (==> performance hit). On the other hand, returning > > INVALIDATE causes the block to be dirtied, which is undesirable if not > > already dirty. > > Otherwise it will be marked dirty the next time reads the page, so I > don't think this is problematic. Not necessarily. I mean, if somebody sees a multi, they might just resolve it to its members and otherwise leave the page alone. Or in some cases not even resolve to members (if it's LOCK_ONLY and old enough to be behind the oldest visible multi). > > ! { > > ! if (ISUPDATE_from_mxstatus(members[i].status)) > > ! { > > ! /* > > ! * It's an update; should we keep it? If the > > transaction is known > > ! * aborted then it's okay to ignore it, otherwise not. > > (Note this > > ! * is just an optimization and not needed for > > correctness, so it's > > ! * okay to get this test wrong; for example, in case an > > updater is > > ! * crashed, or a running transaction in the process of > > aborting.) > > ! */ > > ! if (!TransactionIdDidAbort(members[i].xid)) > > ! { > > ! newmembers[nnewmembers++] = members[i]; > > ! Assert(!TransactionIdIsValid(update_xid)); > > ! > > ! /* > > ! * Tell caller to set HEAP_XMAX_COMMITTED hint > > while we have > > ! * the Xid in cache. Again, this is just an > > optimization, so > > ! * it's not a problem if the transaction is > > still running and > > ! * in the process of committing. > > ! */ > > ! if (TransactionIdDidCommit(update_xid)) > > ! update_committed = true; > > ! > > ! update_xid = newmembers[i].xid; > > ! } > > I don't think the conclusions here are correct - we might be setting > HEAP_XMAX_COMMITTED a smudge to early that way. If the updating > transaction is in progress, there's the situation that we have updated > the clog, but have not yet removed ourselves from the procarray. I.e. a > situation in which TransactionIdIsInProgress() and > TransactionIdDidCommit() both return true. Afaik it is only correct to > set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false. Hmm ... Is there an actual difference? I mean, a transaction that marked itself as committed in pg_clog cannot return to any other state, regardless of what happens elsewhere. > > ! /* > > ! * Checking for very old update Xids is critical: if > > the update > > ! * member of the multi is older than cutoff_xid, we > > must remove > > ! * it, because otherwise a later liveliness check could > > attempt > > ! * pg_clog access for a page that was truncated away by > > the > > ! * current vacuum. Note that if the update had > > committed, we > > ! * wouldn't be freezing this tuple because it would > > have gotten > > ! * removed (HEAPTUPLE_DEAD) by > > HeapTupleSatisfiesVacuum; so it > > ! * either aborted or crashed. Therefore, ignore this > > update_xid. > > ! */ > > ! if (TransactionIdPrecedes(update_xid, cutoff_xid)) > > ! { > > ! update_xid = InvalidTransactionId; > > ! update_committed = false; > > I vote for an Assert(!TransactionIdDidCommit(update_xid)) here. Will add. > > ! else > > ! { > > ! /* > > ! * Create a new multixact with the surviving members of the > > previous > > ! * one, to set as new Xmax in the tuple. > > ! * > > ! * If this is the first possibly-multixact-able operation in the > > ! * current transaction, set my per-backend OldestMemberMXactId > > ! * setting. We can be certain that the transaction will never > > become a > > ! * member of any older MultiXactIds than that. > > ! */ > > ! MultiXactIdSetOldestMember(); > > ! xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers); > > ! *flags |= FRM_RETURN_IS_MULTI; > > ! } > > I worry that this MultiXactIdSetOldestMember() will be problematic in > longrunning vacuums like a anti-wraparound vacuum of a huge > table. There's no real need to set MultiXactIdSetOldestMember() here, > since we will not become the member of a multi. So I think you should > either move the Assert() in MultiXactIdCreateFromMembers() to it's other > callers, or add a parameter to skip it. I would like to have the Assert() work automatically, that is, check the PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't work with CLUSTER. That said, I think we *should* call SetOldestMember in CLUSTER. So maybe both things should be conditional on PROC_IN_VACUUM. (Either way it will be ugly.) > > ! /* > > ! * heap_prepare_freeze_tuple > > * > > * Check to see whether any of the XID fields of a tuple (xmin, xmax, > > xvac) > > ! * are older than the specified cutoff XID and cutoff MultiXactId. > > If so, > > ! * setup enough state (in the *frz output argument) to later execute and > > ! * WAL-log what we would need to do, and return TRUE. Return FALSE if > > nothing > > ! * is to be changed. > > ! * > > ! * Caller is responsible for setting the offset field, if appropriate. > > This > > ! * is only necessary if the freeze is to be WAL-logged. > > I'd leave of that second sentence, if you want to freeze a whole page > but not WAL log it, you'd need to set offset as well... I can buy that. > I worry that all these multixact accesses will create huge performance > problems due to the inefficiency of the multixactid cache. If you scan a > huge table there very well might be millions of different multis we > touch and afaics most of them will end up in the multixactid cache. That > can't end well. > I think we need to either regularly delete that cache when it goes past, > say, 100 entries, or just bypass it entirely. Delete the whole cache, or just prune it of the least recently used entries? Maybe the cache should be a dlist instead of the open-coded stuff that's there now; that would enable pruning of the oldest entries. I think a blanket deletion might be a cure worse than the disease. I see your point anyhow. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers