Hi there, When upgrading from PostgreSQL 14.4, I noticed that one of my somewhat complex analytical queries sometimes gets an inefficient plan under PostgreSQL 16, 17, and 18. Under 14.4, the query runs with a stable plan and completes in 19 to 22 seconds. In newer versions, the plan seems to be unstable, sometimes the query completes in 17 to 20 seconds, sometimes it runs for 5 to 18 minutes with the inefficient plan. This also happens even if the data is not significantly changed.
When I was preparing this email last week, I saw another thread on a similar topic. This email is unrelated to that thread. After some experiments, I have now 4 cloned VM's one for each pg version. For pg 14.4, 16.11, 17.6, 18.2 and I can now more or less reproduce the plan flip. I use the vacuumdb command to trigger a flip, but under "production-like-test" conditions it seems to flip randomly. To test I created a bash script to run the analytical query via psql in an endless loop. The script prints out the runtime or after 28 seconds the query will be aborted. I also tried to get an inefficient plan on pg 14.4, for that, I must set default_statistics_target to 9 or less. While pg 14.4 chooses (so far) always the inefficient plan with default_statistics_target < 10, pg 18.2 seems to be able to randomly produce the efficient plan with default_statistics_target=1 too. My questions are so far, where to begin, how can I find out why the plan changes and how can I find out what I must change to stabilize the plan. Any advice is welcome. I can provide some more info if needed, I can also recompile pg with debug, change settings, etc. if that helps. The environment: Postgresql runs in a development VM with 6 Core, 16 GB RAM. OS is Debian 13: Linux d13pg18 6.12.63+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.63-1 (2025-12-30) x86_64 GNU/Linux Postgresql 18.2 compiled from source with the same settings as V14.4 (tried so far: 16.11, 17.6, 18.0, 18.1) I am the only user in VM and on the Host too. During the test, only one of the VM is running. All tables required by the query are cached. (atop: RAM Total 15.6G, Free: 12.0G, Cache: 2.7G) All tables using the default statistics. The database gets no changes, no other requests. The analytical query makes no changes to the db (it only reads data, no temp tables, no materialized views, no temp files). Preparation: - clone VM - compile and install pg - initdb - create database, create role - set default_statistics_target = 170 in postgresql.conf (I tried 1, 5, 9, 10, 50, 100, 170, 180, 190, 200 with no difference. at 500 I could not get an OK plan, or not tried long enough) - (set other params in conf, mostly like in V14.4, or only for this test. see below) - restart postgres - psql < dump.sql - vacuumdb -Upostgres -avz -j5 - restart VM after that, when I execute "vacuumdb -Upostgres -vZ -t schema1.tbl_used_in_query db1" sometimes the plan flips from inefficient to OK or back. Once flipped, it stays stable for multiple runs. To make it flip again I must execute the above command or "vacuumdb -Upostgres -vZa -j5". The table "schema1.tbl_used_in_query" is one of the tables used in query. To trigger the flip the vacuumed table must be one of the tables used by the query. For example, a real case: After VM restart, the query is running in endless loop. The data is not changing. ## 1x run, the first, aborted after 28 sec. (the first run after a reboot is always aborted because of uncached data) ## PLAN OK (5x runs) vacuumdb -Upostgres -vZ -t schema1.tbl_used_in_query db1 vacuumdb: vacuuming database "db1" INFO: analyzing "schema1.tbl_used_in_query" INFO: "tbl_used_in_query": scanned 51000 of 171971 pages, containing 977840 live rows and 0 dead rows; 51000 rows in sample, 3297257 estimated total rows INFO: finished analyzing table "db1.schema1.tbl_used_in_query" avg read rate: 58.830 MB/s, avg write rate: 0.232 MB/s buffer usage: 26748 hits, 27651 reads, 109 dirtied WAL usage: 507 records, 101 full page images, 662069 bytes, 0 buffers full system usage: CPU: user: 3.55 s, system: 0.09 s, elapsed: 3.67 s ## PLAN WRONG (5x runs) vacuumdb -Upostgres -vZ -t schema1.tbl_used_in_query db1 vacuumdb: vacuuming database "db1" INFO: analyzing "schema1.tbl_used_in_query" INFO: "tbl_used_in_query": scanned 51000 of 171971 pages, containing 977837 live rows and 0 dead rows; 51000 rows in sample, 3297247 estimated total rows INFO: finished analyzing table "db1.schema1.tbl_used_in_query" avg read rate: 57.663 MB/s, avg write rate: 0.234 MB/s buffer usage: 26319 hits, 28077 reads, 114 dirtied WAL usage: 502 records, 101 full page images, 654978 bytes, 0 buffers full system usage: CPU: user: 3.64 s, system: 0.14 s, elapsed: 3.80 s ## PLAN WRONG (5x runs) vacuumdb -Upostgres -vZ -t schema1.tbl_used_in_query db1 vacuumdb: vacuuming database "db1" INFO: analyzing "schema1.tbl_used_in_query" INFO: "tbl_used_in_query": scanned 51000 of 171971 pages, containing 977527 live rows and 0 dead rows; 51000 rows in sample, 3296202 estimated total rows INFO: finished analyzing table "db1.schema1.tbl_used_in_query" avg read rate: 61.615 MB/s, avg write rate: 0.253 MB/s buffer usage: 26604 hits, 27785 reads, 114 dirtied WAL usage: 416 records, 10 full page images, 205744 bytes, 0 buffers full system usage: CPU: user: 3.38 s, system: 0.12 s, elapsed: 3.52 s ## PLAN OK (22x runs) vacuumdb -Upostgres -vZ -t schema1.tbl_used_in_query db1 vacuumdb: vacuuming database "db1" INFO: analyzing "schema1.tbl_used_in_query" INFO: "tbl_used_in_query": scanned 51000 of 171971 pages, containing 978434 live rows and 0 dead rows; 51000 rows in sample, 3299260 estimated total rows INFO: finished analyzing table "db1.schema1.tbl_used_in_query" avg read rate: 55.409 MB/s, avg write rate: 0.229 MB/s buffer usage: 26670 hits, 27639 reads, 114 dirtied WAL usage: 477 records, 102 full page images, 655391 bytes, 0 buffers full system usage: CPU: user: 3.75 s, system: 0.13 s, elapsed: 3.89 s ## PLAN WRONG (43x runs) vacuumdb -Upostgres -vZ -t schema1.tbl_used_in_query db1 vacuumdb: vacuuming database "db1" INFO: analyzing "schema1.tbl_used_in_query" INFO: "tbl_used_in_query": scanned 51000 of 171971 pages, containing 978347 live rows and 0 dead rows; 51000 rows in sample, 3298967 estimated total rows INFO: finished analyzing table "db1.schema1.tbl_used_in_query" avg read rate: 57.883 MB/s, avg write rate: 0.248 MB/s buffer usage: 26731 hits, 27584 reads, 118 dirtied WAL usage: 484 records, 110 full page images, 719407 bytes, 0 buffers full system usage: CPU: user: 3.59 s, system: 0.11 s, elapsed: 3.72 s ## PLAN OK (55x runs) #postgresql.conf of this VM escape_string_warning = off lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' track_activities = on logging_collector = on log_directory = '/usr/local/foo/logs/pg_logs' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_min_messages = info log_min_error_statement = warning log_temp_files = 1 huge_pages=try shared_buffers = 768MB temp_buffers = 512MB effective_cache_size = 4GB work_mem = 768MB hash_mem_multiplier = 2.5 maintenance_work_mem = 1GB max_stack_depth = 4MB max_locks_per_transaction=256 wal_buffers = -1 jit = off max_worker_processes = 4 max_parallel_workers_per_gather = 4 max_parallel_workers = 4 effective_io_concurrency = 16 checkpoint_completion_target = 0.9 seq_page_cost = 1.0 random_page_cost = 1.2 default_statistics_target = 170 vacuum_cost_limit = 2000 Thanks Regards, Attila Soki
