Hi hackers, There's currently an unfortunate CPU sink in the planning for partitioned tables. It happens both for the declarative partitioning, and for the partitioning through inheritance like we use in TimescaleDB.
The gist of it is that there's a linear search of the EM corresponding to the given child relation in find_ec_member_matching_pathkeys(). For a partitioned table, this adds up to a time quadratic in the number of partitions, which can make some queries unusable after you hit 1k partitions or so. The usual callers are prepare_sort_from_pathkeys() or generate_join_implied_equalities(). There's a very simple fix for this, that makes use of the fact that these EMs are normally looked up in the partitioning order. We can cache the position of the previously found EM in a static variable in this function, which makes this search constant-time for the typical scenarios. This is arguably a hack, but a simple one. The problem is really unfortunate and is a subject of many complaints from our customers. There was another attempt to fix this in [1], but it's been going on for years without reaching a committable shape. This patch proposes a quick and dirty alternative that can be adopted until the more principled solution is worked out. I did a test with 10k partitions, and this patch makes planning there 3.5 times faster (2.4 s -> 0.7 s): create table t(x int) partition by range(x); select format('create table t_%s partition of t for values from (%s) to (%s)', x, x * 10000, (x + 1) * 10000) from generate_series(0, 9999) x \gexec insert into t select generate_series(0, 10000 * 9999); vacuum analyze t; set max_parallel_workers_per_gather = 0; set enable_partitionwise_aggregate = 1; set enable_hashagg = 0; insert into t select generate_series(0, 10000 * 9999); explain select count(*) from t group by x; The query is supposed to generate a partitionwise aggregation over sort like this: Append -> GroupAggregate Group Key: t.x -> Sort Sort Key: t.x -> Seq Scan on t_0 t -> GroupAggregate Group Key: t_1.x -> Sort Sort Key: t_1.x -> Seq Scan on t_1 ... Would be glad to hear your opinions on this. CCing some committers who were recently involved with partitioning. References: 1: https://www.postgresql.org/message-id/flat/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com --- Alexander Kuzmenkov Timescale
From 4bc0c6d0ef14006b2dea8db871ba0e4a247bec71 Mon Sep 17 00:00:00 2001 From: Alexander Kuzmenkov <36882414+ak...@users.noreply.github.com> Date: Wed, 22 Jan 2025 16:33:40 +0100 Subject: [PATCH v1] Fix quadratic equivalence member search for partitioned tables Currently the find_ec_member_matching_expr() uses a linear search. When creating ordered paths for a partitioned table, the total time taken by this linear search becomes effectively quadratic in the number of partitions, which starts seriously affecting the planning performance after we hit 1k partitions. Ideally, ec_members should be a hash table, but as a quick fix, we can make use of the fact that this search usually looks up the either the same EM as before, or the next one, and start the search from the previous position. This speeds up the common case. --- src/backend/optimizer/path/equivclass.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 7cafaca33c..9d8fb5e40d 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -764,16 +764,26 @@ find_ec_member_matching_expr(EquivalenceClass *ec, Expr *expr, Relids relids) { - ListCell *lc; - /* We ignore binary-compatible relabeling on both ends */ while (expr && IsA(expr, RelabelType)) expr = ((RelabelType *) expr)->arg; - foreach(lc, ec->ec_members) + /* + * When creating ordered paths for a partitioned table, the total time taken + * by this linear search becomes effectively quadratic in the number of + * partitions, which starts seriously affecting the planning performance + * after we hit 1k partitions. Ideally, ec_members should be a hash table, + * but as a quick fix, we can make use of the fact that this search usually + * looks up the either the same EM as before, or the next one, and start the + * search from the previous position. This speeds up the common case. + */ + static int previous_em_found_at = 0; + int n = list_length(ec->ec_members); + for(int search_offset = 0; search_offset < n; search_offset++) { - EquivalenceMember *em = (EquivalenceMember *) lfirst(lc); - Expr *emexpr; + const int current_position = (search_offset + previous_em_found_at) % n; + EquivalenceMember *em = list_nth_node(EquivalenceMember, ec->ec_members, + current_position); /* * We shouldn't be trying to sort by an equivalence class that @@ -792,12 +802,15 @@ find_ec_member_matching_expr(EquivalenceClass *ec, /* * Match if same expression (after stripping relabel). */ - emexpr = em->em_expr; + Expr *emexpr = em->em_expr; while (emexpr && IsA(emexpr, RelabelType)) emexpr = ((RelabelType *) emexpr)->arg; if (equal(emexpr, expr)) + { + previous_em_found_at = current_position; return em; + } } return NULL; -- 2.34.1