Query plan for "id IS NULL" on PK

2023-02-14 Thread Ben Chrobot
Hello,

Long time listener, first time caller.

We have a large table (~470 million rows) with integer primary key id (not
null) on a Postgres 14.5 cluster. A third-party tool is attempting to
perform a SELECT-based full table copy in preparation for log-based sync
with a query like the following:

SELECT "id", "other_column_a", "other_column_b", "created_at", "updated_at"
FROM "public"."my_large_table"
WHERE (("id" > ? OR "id" IS NULL)) AND (("id" <= ?))
ORDER  BY "id" LIMIT 5;

The lower bound increments by batch size (50k) while the upper bound is
always the `max(id)`, in our case around 575,000,000.

The query plan produced is very slow as the index condition does basically
nothing:


  QUERY PLAN

 Limit  (cost=0.57..21901.46 rows=5 width=417) (actual
time=1708920.675..1709198.995 rows=5 loops=1)
   ->  Index Scan using my_large_table_pkey on
my_large_table  (cost=0.57..135230792.97 rows=308733624 width=417) (actual
time=1708920.673..1709195.926 rows=5 loops=1)
 Index Cond: (id <= 575187488)
 Filter: ((id > 193208795) OR (id IS NULL))
 Rows Removed by Filter: 157784540
 Planning Time: 0.186 ms
 Execution Time: 1709200.618 ms
(7 rows)
Time: 1709231.721 ms (28:29.232)

We cannot change the query being executed. Is there any way we can make the
query planner ignore `OR (id IS NULL)` (as that will never be the case for
the PK) and use both `id` clauses in the index condition?

We have provided the vendor with the same query without the `id is null`
showing that it's significantly faster (see below). They have informed us
that addressing the null check on a not null PK is on their roadmap to
address but no timeline.

explain analyze
SELECT "id", "other_column_a", "other_column_b", "created_at", "updated_at"
FROM "public"."my_large_table"
WHERE (("id" > ?)) AND (("id" <= ?))
ORDER  BY "id" LIMIT 5;

QUERY PLAN
-
Limit (cost=0.57..13930.19 rows=5 width=416) (actual
time=2.118..400.937 rows=5 loops=1)
-> Index Scan using my_large_table_pkey on my_large_table
(cost=0.57..85429173.84 rows=306645829 width=416) (actual
time=2.117..398.325 rows=5 loops=1)
Index Cond: ((id > 193208795) AND (id <= 575187488))
Planning Time: 0.166 ms
Execution Time: 402.376 ms

We have tried leading the planner to water with this view but it did not
change the slow query plan:

create view my_fast_large_table as
select *
from my_large_table
where id is not null;

Any other tricks to try here?

Thank you,
Ben Chrobot


Re: Query plan for "id IS NULL" on PK

2023-02-17 Thread Ben Chrobot
Thank you all for your responses!

I will continue to put pressure on the vendor (Stitch Data, if anyone knows
folks there) to address the issue on their end with the query being issued.

Best,
Ben Chrobot


On Tue, Feb 14, 2023 at 11:11 PM Tom Lane  wrote:

> David Rowley  writes:
> > On Wed, 15 Feb 2023 at 11:39, Peter J. Holzer  wrote:
> >> OTOH it could also be argued that the optimizer should be able to
> >> perform the same simplifications as I did above and produce the same
> >> code for WHERE (("id" > ? OR "id" IS NULL)) AND (("id" <= ?))
> >> as for WHERE (("id" > ?)) AND (("id" <= ?)).
>
> > You're right, and it has been brought up quite a few times in the
> > past.  To make it work, it's a fairly trivial change. We'd just need
> > to record all the attnotnull columns during something like
> > get_relation_info() then when adding baserestrictinfos to the base
> > relations, we could look to see if the qual is a NullTest and skip
> > that if we deem the qual as constantly true.
>
> There's an order-of-operations issue that makes this more painful
> than you might think at first.  In the above example, the NullTest
> node *isn't* going to be a top-level restrictinfo: it's buried inside
> an OR.  Really, the only reasonable place to suppress such a NullTest
> is during eval_const_expressions, which already has the logic that would
> get rid of the now-unnecessary OR above it.  And that's problematic
> because it's done way ahead of where we know any relation-specific
> information.  (Since eval_const_expressions happens ahead of join
> removal, for $good_reasons, moving the plancat.c fetching to someplace
> earlier than that wouldn't be cost-free either.)
>
> > The problem with that is that doing that has an above zero cost and
> > since it likely only applies to nearly zero real-world cases, it just
> > does not seem like useful cycles to add to the planner.
>
> Yeah, this.  In the end there is a low threshold on expensive stuff
> that we're willing to do to clean up after brain-dead ORMs, because
> the costs of that will also be paid by not-so-brain-dead applications.
> In the example at hand, it's hard to argue that the query generator
> sending this query shouldn't know better, since as Peter points out
> the IS NULL check is redundant on its face, primary key or not.
>
> regards, tom lane
>
>
>