Huge performance penalty with parallel queries in Windows x64 v. Linux x64
When developing a solution for a new customer request I created a new query over the production data. Despite the relatively low row counts of the involved tables (all < 100k) I noticed quite a long execution time of about 85 ms to 100 ms. The explain anaylze plan showed a parallel execution plan with 2 parallels. The data structure and index structure was not quite optimal for this kind of query (which does not matter in this case). The comparison of the explain analyze plans on win-x64 and Linux x64 showed about 3 times longer execution on windows. For comparison I reinstalled the production data on two test databases on different virtual machines on the same hardware (the very same machine with Hyper-V virtualization). The steps were only (on a mostly complete idle machine): 1. create test database 2. pg_restore of the production data from same dump file 3. analyze on the database 4. run the query multiple times (about 5 times) and took the fastest explain analyze. On fedora 34 64 bit, PG 13.2 unmodified self compiled the query took about 33 ms. On Windows Server 2019 64 bit, PG 13.2 from EDB download packages the query took about 85 ms. version -- PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.1.1 20210428 (Red Hat 11.1.1-1), 64-bit version PostgreSQL 13.2, compiled by Visual C++ build 1914, 64-bit The corresponding explain plans are available at explain.depesz.com -- fedora https://explain.depesz.com/s/Mq3P -- windows https://explain.depesz.com/s/VLtZ The main difference is the time shown for the Gather Merge step (65 ms vs. 7 ms) The explain plans and the non-standard GUCs are included in the attachments, the configuration for the databases seems quite comparable. Unfortunately I cannot disclose the query and the table data. My experience with parallel queries is not very wide, but this massive execution time difference of the exact same query on the exact same data on the exact same hardware with the same, unmodified last stable Postgres version is very astonishing. BTW I generally observed slower execution under Windows, so production has moved now to Linux. There seem no relevant GUC differences concerning query execution, so the performance penalty of 300% to 900% (one step only) is not easily explainable. The databases remain on the system to repeat the queries on request in the queue of further investigation. Thanks for looking. Hans Buschmann QUERY PLAN Append (cost=8284.93..8437.72 rows=2022 width=198) (actual time=31.506..32.637 rows=34 loops=1) CTE qsum -> Sort (cost=8280.38..8284.93 rows=1822 width=180) (actual time=31.502..32.577 rows=22 loops=1) Sort Key: or_followup.of_season, orders.or_clis_sub_code Sort Method: quicksort Memory: 28kB -> Hash Left Join (cost=7864.72..8181.70 rows=1822 width=180) (actual time=31.274..32.566 rows=22 loops=1) Hash Cond: (or_followup.of_season = seasons.id_sea) -> Hash Left Join (cost=7862.95..8165.94 rows=1822 width=106) (actual time=31.252..32.523 rows=22 loops=1) Hash Cond: (orders.or_clis_sub_code = clients_sub.clis_sub_code) -> Finalize GroupAggregate (cost=7818.93..8098.89 rows=1822 width=94) (actual time=30.987..32.251 rows=22 loops=1) Group Key: or_followup.of_season, orders.or_clis_sub_code -> Gather Merge (cost=7818.93..8028.30 rows=1518 width=94) (actual time=30.978..32.207 rows=30 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial GroupAggregate (cost=6818.91..6853.06 rows=759 width=94) (actual time=24.688..25.201 rows=10 loops=3) Group Key: or_followup.of_season, orders.or_clis_sub_code -> Sort (cost=6818.91..6820.80 rows=759 width=20) (actual time=24.601..24.652 rows=1334 loops=3) Sort Key: or_followup.of_season, orders.or_clis_sub_code Sort Method: quicksort Memory: 53kB Worker 0: Sort Method: quicksort Memory: 179kB Worker 1:
AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
>No Windows here, but could it be super slow at launching workers? How >does a trivial parallel query compare, something like? >SET force_parallel_mode = on; >EXPLAIN ANALYZE SELECT 42; indeed this query takes about 40ms in windows and 7ms on Linux (lowest values). Due to remoting the machine the reported times vary quite a bit. The problem seems that this (probably inherent) performance disadvantage of windows is not reflected in the cost model. This causes little to middle complex queries to prioritize parallel execution on windows which is certainly not the best option in these cases. The starting of processes should have an adequate cost penalty to guide the planner in the right direction. Generally disabling parallel queries seems not a viable option with mixed loads. Here are the query plans: QUERY PLAN Windows Gather (cost=1000.00..1000.11 rows=1 width=4) (actual time=34.995..38.207 rows=1 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=1) Planning Time: 0.016 ms Execution Time: 39.136 ms (7 Zeilen) QUERY PLAN Linux -- Gather (cost=1000.00..1000.11 rows=1 width=4) (actual time=6.864..7.764 rows=1 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1) Planning Time: 0.026 ms Execution Time: 7.812 ms (7 rows) Hans Buschmann
AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
Thank you Thomas for pointing me to this GUC which I haven't realized before. >From the documentation I take that a cost of 1.0 is set for a sequential page >fetch. In my opinion, even for Linux the default for parallel_setup_cost is set too low (1000). It should reflect the sequential access of 1000 pages, which normally is faster from buffer cache on modern hardware. For Windows, these costs are much higher, so I would propose to set the default to at least 1, perhaps 25000 to reflect the real parallel overhead. (BTW: Is this cost multiplied by the real count of workers choosen (max_parallel_workers_per_gather) or only a value independent of the number of workers?. This would matter in windows-high-parallel scenarios) The inadequate default gives more and more slower-then-necessary plans when people are moving to newer PG versions with good parallel support. For them it's like for me a little surprise, which most won't even notice or remedy nor full understand. For bigger installations the knowledge of query tuning is more probable and people can react on their real situation. Perhaps someone with more knowledge with parallel queries can make some profiling / performance tests to justify my proposals (e.g. what is the sequential page access equivalent of 40 ms on selected platforms): New defaults proposal: -- Linux and comparable architectures with fast process creation: parallel_setup_cost = 2500 -- Windows parallel_setup_cost = 25000 Thanks Hans Buschmann
LZ4 missing in pg14-beta1 Windows build, OLD VS/compiler used
I tried to test a customer case of using bytea columns with the new lz4 compression. But lz4 support is not included in the standard binaries downloadable through the PostreSQL Website (from EDB). For easy testing with windows this should be enabled in the upcoming releases of pg14: not everybody is willing or capable of self-compiling a Windows distribution… I also noticed that VS2017 is still used for pg14. After two years it should be better to distribute the Windows version compiled with Visual Studio 2019 Environment: select version (); version --- PostgreSQL 14beta1, compiled by Visual C++ build 1914, 64-bit (1 Zeile) Thanks for investigating Hans Buschmann