Re: Qual push down to table AM

Robert Haas Wed, 10 Dec 2025 07:41:47 -0800

On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <[email protected]> wrote:
> On 2025-12-09 16:40:17 -0500, Robert Haas wrote:
> > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <[email protected]> 
> > wrote:
> > Potentially, there could be a performance problem
>
> I think the big performance hazard with this is repeated deforming. The
> scankey infrastructure deforms attributes one-by-one *and* it does not
> "persist" the work of deforming for later accesses.  So if you e.g. have
> something like
>
>   SELECT sum(col_29) FROM tbl WHERE col_30 = common_value;
> or
>   SELECT * FROM tbl WHERE col_30 = common_value;
>
> we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll
> deform columns 1-29 in the slot (because we always deform all the leading
> columns), during projection.


Hmm, this is a good point, and I agree that it's a huge challenge for
this patch set. Repeated tuple deforming is REALLY expensive, which is
why we've spent so much energy trying to use slots in an as many
places as possible. I find it easy to believe that HeapKeyTest's loop
over heap_getattr() is going to prohibitively painful and that this
code will need to somehow also be slot-ified for this to be a viable
project.

> I don't really see this being viable without first tackling two nontrivial
> projects:
>
> 1) Make slot deforming for expressions & projections selective, i.e. don't
>    deform all the leading columns, but only ones that will eventually be
>    needed
> 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
>    and to make deforming of multiple columns sufficiently efficient.

IOW, I agree that we probably need to do #2. I am not entirely sure
about #1. I'm a little afraid that trying to skip over columns without
deforming them will add a bunch of code complexity that doesn't really
pay off. You have to do the bookkeeping to know what to skip, and then
how much are you really gaining by skipping it? If you can skip over a
bunch of fixed-width columns, that's cool, but it's probably fairly
normal to have lots of varlena columns, and then I don't really see
that you're gaining much here. You still have to iterate through the
tuple, and not storing the pointer to the start of each column as you
find it doesn't seem like it will save much. What's your reasoning
behind thinking that #1 will be necessary?

> > So, somewhat to my surprise, I think that v4-0001 might be basically
> > fine. I wonder if anyone else sees a problem that I'm missing?
>
> I doubt this would be safe as-is: ISTM that if you release the page lock
> between tuples, things like the number of items on the page can change. But we
> store stuff like that in registers / on the stack, which could change while
> the lock is not held.
>
> We could refetch the number items on the page for every loop iteration, but
> that'd probably not be free. OTOH, it's probably nothing compared to the cost
> of relocking the page...

We still hold a pin, though, which I think means very little can
change. More items can be added to the page, so we might want to
refresh the number of items on the page at least when we think we're
done, but I believe that any sort of more invasive page rearrangement
would be precluded by the pin.

I kind of wonder if it would be good to make a change along the lines
of v4-0001 even if this patch set doesn't move forward overall, or
will need a lot of slot-ification to do so. It seems weird to me that
we're OK with calling out to arbitrary code with a buffer lock held,
and even weirder that whether or not we do that depends on whether
SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind
between pagemode behavior and non-pagemode behavior would survive
review if someone proposed it today; the fact that it works the way it
does is probably an artifact of this mechanism having been added
twenty years ago when the project was in a very different place.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Qual push down to table AM

Reply via email to