On Mon, 2007-02-12 at 09:24 +0530, Pavan Deolasee wrote: > On 2/12/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote: > Hannu Krosing wrote: > > Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom > Lane: > >> Hannu Krosing <[EMAIL PROTECTED]> writes: > >>> What if we would just reuse the root tuple directly > instead of turning > >>> it into a stub ? > >>> This would create a cycle of ctid pointers, which changes > the lookup > >>> process from 'follow ctid chaint until the end' to 'follow > the tid chain > >>> until you reach the start'. > >> How do you know which one is newest? > > > > By xmin,cmin of course . > > > >> What happens when you have to put a newer version off-page > for lack of space? > > > > Then this scheme won't work. > > Couldn't the ctid of the latest tuple point to the off-page > tuple as usual? > > It could. But then you may lose reference for older version(s). We can > do > the whole page scans to locate older versions, but that might prove > too > costly > > > It might be acceptable, if it was only stored on those tuples > that are > (HOT) updated. But it's not clear to me what you're proposing > to do with > the field, anyway, Which tuples would have it, and what would > it point to? > > My guess what Hannu is suggesting is to have a circular chain of > HOT-updated > tuples in a page. The index can point to any tuple in the chain. When > update goes off-page or is a COLD update, t_ctid points to the newer > version > as usual. So a tuple which is COLD updated would need two pointers, > one > which points to the oldest version in the circular on-page chain and > other which > points to the new version.
I very much like Hannu's idea, but it does present some issues. My feeling is we need to get some clarity on whether to go this route, or not, because it takes us away from the pointer-swinging idea. To maximise our chances of a something good for 8.3 we really need to pick just one idea now. Maybe we've already done that, by default. I've tried to analyse all aspects of the discussion to see if we can get something out of this: Hannu has described the possibility of having the index point to a tuple version which isn't the earliest one on the block. That raises the issue of how will we locate earlier tuple versions when the current Snapshot cannot see the root tuple? First off, we only need to locate the earlier tuple when the root was re-placed by a HOT update - we can tell that, so we're good so far. Second, we would like to be able to validate the xmax == xmin rule. If the original root tuple was dead, so will the prior versions that are off-block, so nobody will ever follow the chain to the root tuple again as part of EvalPlanQual. So we need to check newroot.xmax == nextinchain.xmin only. Third we need to locate the appropriate tuple version. We have a number of approaches: a) Block Seq Scan: using a scan to get there is possible, but not good. The best mechanism is to scan the block backwards (i.e. highest item to lowest item) picking up the tuple which points to the root, then scanning backwards picking up the parts of the chain as they are seen until we get to a visible tuple that we can validate as part of the chain. If we scanned forwards we'd need to remember the whole chain as we went, which would be less useful. *But* this doesn't allow us to validate the xmax == xmin rule, so that kills this sub-option, IMHO. b) Additional tuple fields We could have additional fields on the root tuple to help us locate the "true root" or oldest version of this tuple on the block. Those additional header fields are only required on the root tuple - no other tuple needs this info. So we need not complain about space savings/losses - the comparison is between extra tuple headers and a TupleStub, which is a full header width (24 bytes). This would need to include a copy of the xmin of the original root tuple, to allow us to validate the xmax == xmin rule (4 bytes). It would also need to include an in-page pointer to the true root (2 bytes). So additional tuple fields of 6 bytes would be added to the root tuple, probably maxaligning to 8 more bytes than normal - a saving of 16 bytes in comparison with the TupleStub approach. Since there is no separate TupleStub, we can replace the root when required, not just at full table VACUUM time. The presence, or not, of those fields can be marked using various info bits. The NULL bitmap is effectively an optional header field, so the idea seems workable - or at least as workable as pointer swinging. The in-page pointer is only ever followed as a result of an index scan. In all those cases we don't record the xmin, so we don't check it - we only need to store the xmin of the tuple pointed to by the in-page pointer. There may be some off-block tuple versions that point at the new root, but those links will never be followed because they are by-definition dead rows. I didn't start this analysis with any preconceived outcome, but ISTM now that Hannu's idea can be made to work robustly, whilst avoiding the need for pointer swinging, which thus allows retail VACUUMs, *and* saving space on-block overall. What it requires is two optional fields on any root tuple that has been replaced following a HOT update. If that is acceptable, then we have a better design as a result of Hannu's suggestion. Comments? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq