Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-03 Thread Hans Buschmann

When developing a solution for a new customer request I created a new query 
over the production data.

Despite the relatively low row counts of the involved tables (all < 100k) I 
noticed quite a long execution time of about 85 ms to 100 ms.

The explain anaylze plan showed a parallel execution plan with 2 parallels.

The data structure and index structure was not quite optimal for this kind of 
query (which does not matter in this case).

The comparison of the explain analyze plans on win-x64 and Linux x64 showed 
about 3 times longer execution on windows.

For comparison I reinstalled the production data on two test databases on 
different virtual machines on the same hardware (the very same machine with 
Hyper-V virtualization).

The steps were only (on a mostly complete idle machine):
1. create test database
2. pg_restore of the production data from same dump file
3. analyze on the database
4. run the query multiple times (about 5 times) and took the fastest explain 
analyze.

On fedora 34 64 bit, PG 13.2 unmodified self compiled the query took about 33 
ms.
On Windows Server 2019 64 bit, PG 13.2 from EDB download packages the query 
took about 85 ms.

 version
--
 PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.1.1 20210428 
(Red Hat 11.1.1-1), 64-bit

  version

 PostgreSQL 13.2, compiled by Visual C++ build 1914, 64-bit

The corresponding explain plans are available at explain.depesz.com

-- fedora
https://explain.depesz.com/s/Mq3P

-- windows
https://explain.depesz.com/s/VLtZ

The main difference is the time shown for the Gather Merge step (65 ms vs. 7 ms)

The explain plans and the non-standard GUCs are included in the attachments, 
the configuration for the databases seems quite comparable.

Unfortunately I cannot disclose the query and the table data.

My experience with parallel queries is not very wide, but this massive 
execution time difference of the exact same query on the exact same data on the 
exact same hardware with the same, unmodified last stable Postgres version is 
very astonishing.

BTW I generally observed slower execution under Windows, so production has 
moved now to Linux.

There seem no relevant GUC differences concerning query execution, so the 
performance penalty of 300% to 900% (one step only) is not easily explainable.

The databases remain on the system to repeat the queries on request in the 
queue of further investigation.


Thanks for looking.

Hans Buschmann


 QUERY PLAN

 Append  (cost=8284.93..8437.72 rows=2022 width=198) (actual 
time=31.506..32.637 rows=34 loops=1)
   CTE qsum
 ->  Sort  (cost=8280.38..8284.93 rows=1822 width=180) (actual 
time=31.502..32.577 rows=22 loops=1)
   Sort Key: or_followup.of_season, orders.or_clis_sub_code
   Sort Method: quicksort  Memory: 28kB
   ->  Hash Left Join  (cost=7864.72..8181.70 rows=1822 width=180) 
(actual time=31.274..32.566 rows=22 loops=1)
 Hash Cond: (or_followup.of_season = seasons.id_sea)
 ->  Hash Left Join  (cost=7862.95..8165.94 rows=1822 
width=106) (actual time=31.252..32.523 rows=22 loops=1)
   Hash Cond: (orders.or_clis_sub_code = 
clients_sub.clis_sub_code)
   ->  Finalize GroupAggregate  (cost=7818.93..8098.89 
rows=1822 width=94) (actual time=30.987..32.251 rows=22 loops=1)
 Group Key: or_followup.of_season, 
orders.or_clis_sub_code
 ->  Gather Merge  (cost=7818.93..8028.30 rows=1518 
width=94) (actual time=30.978..32.207 rows=30 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Partial GroupAggregate  
(cost=6818.91..6853.06 rows=759 width=94) (actual time=24.688..25.201 rows=10 
loops=3)
 Group Key: or_followup.of_season, 
orders.or_clis_sub_code
 ->  Sort  (cost=6818.91..6820.80 
rows=759 width=20) (actual time=24.601..24.652 rows=1334 loops=3)
   Sort Key: or_followup.of_season, 
orders.or_clis_sub_code
   Sort Method: quicksort  Memory: 
53kB
   Worker 0:  Sort Method: 
quicksort  Memory: 179kB
   Worker 1: 

AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-04 Thread Hans Buschmann

>No Windows here, but could it be super slow at launching workers?  How
>does a trivial parallel query compare, something like?

>SET force_parallel_mode = on;
>EXPLAIN ANALYZE SELECT 42;

indeed this query takes about 40ms in windows and 7ms on Linux (lowest values).

Due to remoting the machine the reported times vary quite a bit.

The problem seems that this (probably inherent) performance disadvantage of 
windows is not reflected in the cost model.

This causes little to middle complex queries to prioritize parallel execution 
on windows which is certainly not the best option in these cases.

The starting of processes should have an adequate cost penalty to guide the 
planner in the right direction.

Generally disabling parallel queries seems not a viable option with mixed loads.

Here are the query plans:

 QUERY PLAN Windows

 Gather  (cost=1000.00..1000.11 rows=1 width=4) (actual time=34.995..38.207 
rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 
rows=1 loops=1)
 Planning Time: 0.016 ms
 Execution Time: 39.136 ms
(7 Zeilen)

QUERY PLAN Linux
--
 Gather  (cost=1000.00..1000.11 rows=1 width=4) (actual time=6.864..7.764 
rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 
rows=1 loops=1)
 Planning Time: 0.026 ms
 Execution Time: 7.812 ms
(7 rows)


Hans Buschmann



AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-04 Thread Hans Buschmann
Thank you Thomas for pointing me to this GUC which I haven't realized before.

>From the documentation I take that a cost of 1.0 is set for a sequential page 
>fetch.

In my opinion, even for Linux the default for parallel_setup_cost is set too 
low (1000). It should reflect the sequential access of 1000 pages, which 
normally is faster from buffer cache on modern hardware.

For Windows, these costs are much higher, so I would propose to set the default 
to at least 1, perhaps 25000 to reflect the real parallel overhead.

(BTW: Is this cost multiplied by the real count of workers choosen 
(max_parallel_workers_per_gather) or only a value independent of the number of 
workers?. This would matter in windows-high-parallel scenarios)

The inadequate default gives more and more slower-then-necessary plans when 
people are moving to newer PG versions with good parallel support. For them 
it's like for me a little surprise, which most won't even notice or remedy nor 
full understand.

For bigger installations the knowledge of query tuning is more probable and 
people can react on their real situation.

Perhaps someone with more knowledge with parallel queries can make some 
profiling / performance tests to justify my proposals (e.g. what is the 
sequential page access equivalent of 40 ms on selected platforms):

New defaults proposal:
-- Linux and comparable architectures with fast process creation:
parallel_setup_cost  =  2500

-- Windows
parallel_setup_cost = 25000


Thanks

Hans Buschmann



LZ4 missing in pg14-beta1 Windows build, OLD VS/compiler used

2021-06-03 Thread Hans Buschmann

I tried to test a customer case of using bytea columns with the new lz4 
compression.

But lz4 support is not included in the standard binaries downloadable through 
the PostreSQL Website (from EDB).

For easy testing with windows this should be enabled in the upcoming releases 
of pg14: not everybody is willing or capable of self-compiling a Windows 
distribution…

I also noticed that VS2017 is still used for pg14.

After two years it should be better to distribute the Windows version compiled 
with Visual Studio 2019

Environment:

select version ();
version
---
 PostgreSQL 14beta1, compiled by Visual C++ build 1914, 64-bit
(1 Zeile)

Thanks for investigating

Hans Buschmann