Re: More speedups for tuple deformation

Junwang Zhao Sat, 07 Mar 2026 08:37:23 -0800

Hi David,

On Fri, Mar 6, 2026 at 12:10 PM David Rowley <[email protected]> wrote:
>
> One of my goals for proactively populating
> CompactAttribute.attcacheoff is to make it so we're able to support
> deforming only a subset of columns. If we only need a small number of
> columns from the tuple and all those columns have a known attcacheoff
> and no NULLs come prior, then we can quite efficiently just go to
> those cached offsets and fetch only the attributes that we need.  To
> do this, we'll need an extra array to store which attnums we're
> interested in, rather than deforming all attrs up to the highest
> attnum that we need, as we do today.  I expect that looking at this
> new array will slow things down a bit when we're accessing either most
> or all columns in, say, a SELECT * query. So, IMO, it'd be bad to
> *replace* the current deforming code with code which does this.
> Instead, I propose we add an additional deform operator and have some
> heuristic which decides which one is best to use. I expect
> ExecPushExprSetupSteps() could make that choice fairly easily. Perhaps
> something cheap like bms_num_members(scan_attrs) is less than half the
> bms_prev_member(scan_attrs, -1) (the highest member).
>
> There's going to be many cases where the attcacheoff isn't known in
> the attributes being selected. So that we still get some gains when
> that's the case, I've coded it up so that we start walking the tuple
> at the last attribute that has an attcacheoff. In many cases, that'll
> mean we don't need to walk the entire tuple. Often, leading columns
> are fixed-width, so this means that there's likely some benefit to
> most cases. There might need to be a bit more education or
> documentation about best column ordering practises.
>
> There are a few hurdles to make this work, and one is the physical
> tlist optimization. If the planner replaces the targetlist with a
> physical tlist, the executor is going to think we need all columns,
> which would have it likely choose not to do the selective deforming.
> To make this work, I've added some code in createplan.c to extract the
> attnums we need from the qual and tlist before the physical tlist is
> installed. That's recorded in a Bitmapset and passed down to the
> executor and to the code which sets up the ExprStates. Currently,
> mostly to exercise this code as much as possible, I've coded it to
> always do the selective deforming when the Bitmapset isn't empty. So
> far, I've only done this for Seq Scan, but I expect all the scans that
> deform tuples could use this.
>
> I've attached the code which does all this in the 0006 patch.
> Ideally, I'd have had this at least to the current state about 2-3
> months ago, so I don't intend that 0006 is v19 material, but I wanted
> to share to show where I intend this work to go.
>
> Performance:
>
> Using the t_1_40 table from the deform_test_setup.sh script I sent in
> [1], running "select a from t_1_40 where a = 0;" ("a" is the 43rd
> column in that table), on my Zen2 machine, I get the following from
> perf top and pgbench:
>
> master:
>   75.57%  postgres   [.] tts_buffer_heap_getsomeattrs
>    4.70%  postgres   [.] ExecInterpExpr
>    2.85%  postgres   [.] ExecSeqScanWithQualProject
>    1.94%  postgres   [.] heapgettup_pagemode
>    1.21%  postgres   [.] UnlockBuffer
>    1.15%  postgres   [.] slot_getsomeattrs_int
>
> $ for i in {1..3}; do pgbench -n -f bench.sql -M prepared -T 10
> postgres | grep latency; done
> latency average = 154.175 ms
> latency average = 156.780 ms
> latency average = 157.599 ms
>
> 0001-0005:
>   64.24%  postgres   [.] tts_buffer_heap_getsomeattrs
>   15.01%  postgres   [.] ExecInterpExpr
>    3.22%  postgres   [.] ExecSeqScanWithQualProject
>    3.01%  postgres   [.] heapgettup_pagemode
>    1.57%  postgres   [.] ExecStoreBufferHeapTuple
>    1.53%  postgres   [.] heap_prepare_pagescan
>
> $ for i in {1..3}; do pgbench -n -f bench.sql -M prepared -T 10
> postgres | grep latency; done
> latency average = 130.981 ms
> latency average = 134.700 ms
> latency average = 134.898 ms
>
> 0001-0006:
>   42.28%  postgres          [.] heapgettup_pagemode
>   11.38%  postgres          [.] ExecInterpExpr
>    7.13%  postgres          [.] ExecSeqScanWithQualProject
>    5.92%  postgres          [.] tts_buffer_heap_selectattrs <-- it's down 
> here.
>    5.69%  postgres          [.] ExecStoreBufferHeapTuple
>    5.11%  postgres          [.] heap_getnextslot
>    3.87%  postgres          [.] heap_prepare_pagescan
>
> $ for i in {1..3}; do pgbench -n -f bench.sql -M prepared -T 10
> postgres | grep latency; done
> latency average = 71.689 ms
> latency average = 75.638 ms
> latency average = 75.149 ms
>
> Keep in mind that this is one of the best cases as t_1_40 has no NULLs
> and only has fixed-width columns. The only slightly better case would
> be to add more columns and fetch only the final one. 40 doesn't seem
> excessively unrealistic, to get an idea of the gains that someone
> *could* see.
>
> You can see that perf top report that tts_buffer_heap_getsomeattrs
> dropped from taking 75.57% down to 64.24% with 0001-0005.  Adding 0006
> sees that replaced with tts_buffer_heap_selectattrs which takes less
> than 6% of the CPU time. It also highlights the next most interesting
> thing we should probably make faster, heapgettup_pagemode().
>
> I've attached v12 of the patch. There are a few changes in 0001-0005
> that should help make things a bit faster than v11. I've also attached
> the new selective deforming code in 0006. There's no JIT support for
> 0006 yet, I don't need to be told about that :-)
>
> I'm planning on starting to go through 0002-0005 in much more detail
> from mid next week with my committer hat on. If anyone wants to relook
> at any of the 0002-0005 patches, there's still time. I'm also happy to


I have some comments on v12-0004.

1.

+ off += cattr->attlen;
+ firstNonCachedOffsetAttr = i + 1;
+ }
+
+ tupdesc->firstNonCachedOffsetAttr = firstNonCachedOffsetAttr;
+ tupdesc->firstNonGuaranteedAttr = firstNonGuaranteedAttr;
+}

The firstNonCachedOffsetAttr seems to be the first variable width
attribute, but it seems that the offset of this attribute can be cached,
for example, in a table defined as (int, text), the offset of
firstNonCachedOffsetAttr should be 4, is that correct?

If TupleDescFinalize records the offset firstNonCachedOffsetAttr,
it might save one iterator of the deforming loop. For example,
add something like the following after the above mentioned code.

if (firstNonCachedOffsetAttr < tupdesc->natts)
{
cattr = TupleDescCompactAttr(tupdesc, firstNonCachedOffsetAttr);
cattr->attcacheoff = off;
}

2.

in slot_deform_heap_tuple, there are multiple statements setting
firstNonCacheOffsetAttr,

+ firstNonCacheOffsetAttr = tupleDesc->firstNonCachedOffsetAttr;

+ /* We can only use any cached offsets until the first NULL attr */
+ firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr,
+   firstNullAttr);

+ /* We can only fetch as many attributes as the tuple has. */
+ firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, natts);

Based on the logic, it seems the second one could be moved
to the third position, and the third one could then be safely
removed?

> receive feedback on 0006, but I will address concerns with that at a
> lower priority. One thing that's still left todo in the 0004 patch is
> enable the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS optimisation for a few
> other scan types.
>
> Thanks for reading
>
> David
>
> [1] 
> https://postgr.es/m/caaphdvo1i-ycacwnk3l7zastum8mw46kvrqmauhd46hsujm...@mail.gmail.com



-- 
Regards
Junwang Zhao

Re: More speedups for tuple deformation

Reply via email to