Hi, Alena! Thank you for your work on the subject.
On Wed, Oct 4, 2023 at 10:21 PM a.rybakina <a.rybak...@postgrespro.ru> wrote: > I fixed the kernel dump issue and all the regression tests were successful, > but I discovered another problem when I added my own regression tests. > Some queries that contain "or" expressions do not convert to "ANY". I have > described this in more detail using diff as expected and real results: > > diff -U3 > /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out > /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out > --- /home/alena/postgrespro__copy6/src/test/regress/expected/create_index.out > 2023-10-04 21:54:12.496282667 +0300 > +++ /home/alena/postgrespro__copy6/src/test/regress/results/create_index.out > 2023-10-04 21:55:41.665422459 +0300 > @@ -1925,17 +1925,20 @@ > EXPLAIN (COSTS OFF) > SELECT count(*) FROM tenk1 > WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3) OR thousand = 41; > - QUERY PLAN > --------------------------------------------------------------------------------------------------------- > + QUERY PLAN > +--------------------------------------------------------------------------------------------------------------------------- > Aggregate > -> Bitmap Heap Scan on tenk1 > - Recheck Cond: (((thousand = 42) AND (tenthous = ANY > ('{1,3}'::integer[]))) OR (thousand = 41)) > + Recheck Cond: ((((thousand = 42) AND (tenthous = 1)) OR ((thousand > = 42) AND (tenthous = 3))) OR (thousand = 41)) > -> BitmapOr > - -> Bitmap Index Scan on tenk1_thous_tenthous > - Index Cond: ((thousand = 42) AND (tenthous = ANY > ('{1,3}'::integer[]))) > + -> BitmapOr > + -> Bitmap Index Scan on tenk1_thous_tenthous > + Index Cond: ((thousand = 42) AND (tenthous = 1)) > + -> Bitmap Index Scan on tenk1_thous_tenthous > + Index Cond: ((thousand = 42) AND (tenthous = 3)) > -> Bitmap Index Scan on tenk1_thous_tenthous > Index Cond: (thousand = 41) > -(8 rows) > +(11 rows) I think this query is not converted, because you only convert top-level ORs in the transform_ors() function. But in the example given, the target OR lays under AND, which in turn lays under another OR. I think you need to make transform_ors() recursive to handle cases like this. I wonder about the default value of the parameter or_transform_limit of 500. In [1] and [2] you show the execution time degradation from 0 to ~500 OR clauses. I made a simple SQL script with the query "SELECT * FROM pgbench_accounts a WHERE aid = 1 OR aid = 2 OR ... OR aid = 100;". The pgbench results for a single connection in prepared mode are the following. master: 936 tps patched (or_transform_limit == 0) :1414 tps So, transformation to ANY obviously accelerates the execution. I think it's important to identify the cases where this patch causes the degradation. Generally, I don't see why ANY could be executed slower than the equivalent OR clause. So, the possible degradation cases are slower plan generation and worse plans. I managed to find both. As you stated before, currently the OR transformation has a quadratic complexity depending on the number of or-clause-groups. I made a simple test to evaluate this. containing 10000 or-clause-groups. SELECT * FROM pgbench_accounts a WHERE aid + 1 * bid = 1 OR aid + 2 * bid = 1 OR ... OR aid + 10000 * bid = 1; master: 316ms patched: 7142ms Note, that the current or_transform_limit GUC parameter is not capable of cutting such cases, because it cuts cases lower than the limit not higher than the limit. In the comment, you mention that we could invent something like hash to handle this. Hash should be nice, but the problem is that we currently don't have a generic facility to hash nodes (or even order them). It would be nice to add this facility, that would be quite a piece of work. I would propose to limit this patch for now to handle just a single Var node as a non-const side of the clause and implement a simple hash for Vars. Another problem is the possible generation of worse plans. I made an example table with two partial indexes. create table test as (select (random()*10)::int x, (random()*1000) y from generate_series(1,1000000) i); create index test_x_1_y on test (y) where x = 1; create index test_x_2_y on test (y) where x = 2; vacuum analyze test; Without the transformation of ORs to ANY, our planner manages to use both indexes with a Bitmap scan. # explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12) Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y = '100'::double precision) AND (x = 2))) -> BitmapOr (cost=8.60..8.60 rows=1 width=0) -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) (7 rows) With transformation, the planner can't use indexes. # explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN ----------------------------------------------------------------------------- Gather (cost=1000.00..12690.10 rows=1 width=12) Workers Planned: 2 -> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12) Filter: ((x = ANY (ARRAY[1, 2])) AND (y = '100'::double precision)) (4 rows) The solution I see would be to tech Bitmap scan to handle ANY clause in the same way as the OR clause. I think the entry point for the relevant logic is the choose_bitmap_and() function. Regarding the GUC parameter, I don't see we need a limit. It's not yet clear whether a small number or a large number of OR clauses are more favorable for transformation. I propose to have just a boolean enable_or_transformation GUC. Links 1. https://www.postgresql.org/message-id/6b97b517-f36a-f0c6-3b3a-0cf8cfba220c%40yandex.ru 2. https://www.postgresql.org/message-id/938d82e1-98df-6553-334c-9db7c4e288ae%40yandex.ru ------ Regards, Alexander Korotkov