Hello experts,

Recently, we've been encountering poor query plans on our database workload.
We've successfully diagnosed the cause as reaching the default limit of 8
for join_collapse_limit.

We've already dealt with the situation successfully. However, I've been
discussing with some colleagues the possibility of raising the
join_collapse_limit parameter value (and from_collapse_limit with it) in
our production environment. I was trying to determine what would be a
sensible new value when I stumbled upon this thread on OLTP Star Joins:
https://www.postgresql.org/message-id/1ea167aa-457d-422a-8422-b025bb660ef3%40vondra.me

Our data model is hierarchical, so we rather have child tables instead of
the dimension tables mentioned above. I created a similar script as the one
found on the OLTP Star Join thread to test the effects of different values
of join_collapse_limit locally on an hierarchical model which is more
similar to ours. I ran pg_bench with 30 threads and 30 clients for 5
seconds. The following results were obtained on my machine (AMD Ryzen 7
5700U with 8 cores and 32 GB RAM, PostgreSQL 17.5 (Debian 17.5-1.pgdg120+1)
on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit):

OLTP starjoin (10 dimensions)
- join_collapse_limit = 1: 6.4k TPS
- join_collapse_limit = 8: 2.8k TPS
- join_collapse_limit = 11: 400 TPS

Tree join (10 levels)
- join_collapse_limit = 1:  7.4k TPS
- join_collapse_limit = 8: 4.7k TPS
- join_collapse_limit = 11:  3.9k TPS

Tree join (14 levels)
- join_collapse_limit = 1:  5.1k TPS
- join_collapse_limit = 8:  3.4k TPS
- join_collapse_limit = 11:  3k TPS
- join_collapse_limit = 14, geqo = off:  2.2k TPS
- join_collapse_limit = 14, geqo = on: 200 TPS

Note: geqo_threshold was unchanged from the default (12)

I assume that the reason why the hierarchical "tree join" is much faster is
due to the dependencies among tables, so the standard join search has a
much narrower range of possible query paths compared to the OLTP Star Join
case. What surprised me, however, is that when GEQO is turned on, the TPS
falls dramatically. Given that the documentation states that GEQO "...
reduces planning time for complex queries (those joining many relations),
at the cost of producing plans that are sometimes inferior to those found
by the normal exhaustive-search algorithm", it made me wonder what could be
the cause of this much slower planning. I'm not really familiar with
genetic algorithms, so perhaps I might be missing something, but is this
kind of planning performance hit normal when GEQO is on? I was hoping
someone could help us on this topic.

Regards,
Carlo

Attachment: treejoin-10-dims.sql
Description: application/sql

Attachment: treejoin-14-dims.sql
Description: application/sql

Attachment: treejoin-create.sql
Description: application/sql

Reply via email to