I've been looking into Pavan Deolasee's recent discovery that when storing a maximum-length toast tuple, heap_insert uselessly recurses to toast_insert_or_update, wasting a nontrivial number of cycles. It turns out there are several interrelated mistakes here, which are wasting space as well as cycles.
First off, as to the exact nature of what's happening: the toast code is designed so that when breaking down a large datum, it's divided into rows with data payloads of exactly TOAST_MAX_CHUNK_SIZE bytes each. On a 4-byte-MAXALIGN machine, this means the rows have total t_len of exactly TOAST_TUPLE_THRESHOLD, which is what was intended. However, that value is not a multiple of 4. Hence when heapam.c compares MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD it decides the tuple needs re-toasting. I noted before that this does not happen on an 8-byte-MAXALIGN machine, but had not understood exactly why. The reason is the outer MAXALIGN call in the definition #define TOAST_MAX_CHUNK_SIZE (TOAST_TUPLE_THRESHOLD - \ MAXALIGN( \ MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) + \ sizeof(Oid) + \ sizeof(int32) + \ VARHDRSZ)) On a 4-byte machine that call doesn't do anything, but on an 8-byte machine it causes the value of TOAST_MAX_CHUNK_SIZE to be reduced by 4, which means that t_len of a toast row comes out 4 bytes smaller than on a 4-byte machine, which makes it smaller than TOAST_TUPLE_THRESHOLD even after maxalign'ing. Hence no recursion. That MAXALIGN is actually *wrong* now that I look at it: it's effectively supposing that there is padding alignment after the varlena length word for the chunk data, which of course there is not. But we can't change TOAST_MAX_CHUNK_SIZE without forcing an initdb. Instead we can fix the recursion by removing the MAXALIGN() operations in heapam.c and tuptoaster.c that compare tuple lengths to the thresholds. This effectively moves the threshold for tuple compression up a couple bytes, which is a safe change to make, and makes the comparisons slightly cheaper to boot. I propose doing that in 8.2 (and maybe older branches after we get a bit more testing of it). But the real problem is that we've got sloppy choices of the thresholds and sizes. In the first place, TOAST_MAX_CHUNK_SIZE is being set at a value that makes every toast row have two wasted padding bytes after it (turns out it's the same on both 4- and 8-byte machines, though the specific size of the rows differs). This is silly, we should be using a TOAST_MAX_CHUNK_SIZE that makes the actual row length come out at exactly a MAXALIGN multiple. In the second place, examination of toast tables will show you that on a page with four maximum-length toast rows, there are 12 free bytes on a 4-byte machine and 28 free on an 8-byte machine (not counting the aforementioned padding bytes after each row). That's fine at first glance; because of alignment considerations it's actually the best we can do. The trouble is that TOAST_TUPLE_THRESHOLD is derived from MaxTupleSize, which is derived on the assumption that we should leave 32 bytes for "special space" on heap pages. If we actually had such special space, it wouldn't fit. This happens because the threshold calculation is just #define TOAST_TUPLE_THRESHOLD (MaxTupleSize / 4) which fails to account for the "line pointers" needed for all but the first tuple. These errors cancel out at the moment, but wouldn't if we changed anything about the page header or special space layout. What I suggest we do about this in HEAD is: 1. Rename MaxTupleSize to MaxHeapTupleSize, and get rid of the MaxSpecialSpace allotment in its calculation. We don't use special space on heap pages and we shouldn't be artificially restricting tuple length to allow for something that's unlikely to appear in the future. (Note: yes, I know it's been suggested to keep free-space maps in some heap pages, but that need not factor into a MaxHeapTupleSize limit: big tuples can simply go into a page without any free-space map.) 2. Fix TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET to be correctly calculated (properly allowing for line pointers) and to be MAXALIGN multiples. The threshold value should be exactly the size of the largest tuple that you can put four of onto one page. Fix TOAST_MAX_CHUNK_SIZE so that it is *not* necessarily a MAXALIGN multiple, but rather causes the total length of a toast tuple to come out that way. This guarantees minimum space wastage on toast pages. This will force initdb due to changing chunk sizes in toast tables, but unless we're going to reject Heikki's patch to merge cmin/cmax, there is no hope of an in-place upgrade for 8.3 anyway. BTW, while I was looking at this I noticed that BTMaxItemSize is incorrectly calculated as well: it's coming out a couple bytes smaller than it could safely be. And with a different page header size it could come out a couple bytes larger, instead :-(. Perhaps this is related to Heikki's recent observation that there always seemed to be some extra free space on btree pages? I think the correct calculation would be #define BTMaxItemSize(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ MAXALIGN(sizeof(PageHeaderData) + 2 * sizeof(ItemIdData)) - MAXALIGN(sizeof(BTPageOpaqueData))) / 3) where MAXALIGN_DOWN rounds down to the nearest maxalign multiple, instead of up. Comments? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org