Hi, On 2017-11-15 14:09:13 -0500, Robert Haas wrote: > On Wed, Nov 15, 2017 at 1:35 PM, Andres Freund <and...@anarazel.de> wrote: > > But this does bug me, and I think it's what made me pause here to make a > > bad joke. The way that parallelism treats work_mem makes it even more > > useless of a config knob than it was before. Parallelism, especially > > after this patch, shouldn't compete / be benchmarked against a > > single-process run with the same work_mem. To make it "fair" you need to > > compare parallelism against a single threaded run with work_mem * > > max_parallelism. > > I don't really know how to do a fair comparison between a parallel > plan and a non-parallel plan. Even if the parallel plan contains zero > nodes that use work_mem, it might still use more memory than the > non-parallel plan, because a new backend uses a bunch of memory. If > you really want a comparison that is fair on the basis of memory > usage, you have to take that into account somehow.
That's not quite what I'm concerned about. Consider something (completely artifical) like: tpch_5[18786][1]=# SET work_mem = '50MB'; tpch_5[18786][1]=# EXPLAIN SELECT c_name, count(*) FROM orders JOIN customer ON (o_custkey = c_custkey) WHERE o_orderdate BETWEEN '1995-01-01' AND '1995-01-05' GROUP BY 1 ORDER BY count(*) DESC LIMIT 10; ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Limit (cost=77344.16..77344.19 rows=10 width=27) │ │ -> Sort (cost=77344.16..77379.68 rows=14206 width=27) │ │ Sort Key: (count(*)) DESC │ │ -> HashAggregate (cost=76895.12..77037.18 rows=14206 width=27) │ │ Group Key: customer.c_name │ │ -> Hash Join (cost=35347.04..76824.09 rows=14206 width=19) │ │ Hash Cond: (orders.o_custkey = customer.c_custkey) │ │ -> Bitmap Heap Scan on orders (cost=302.04..41599.74 rows=14206 width=4) │ │ Recheck Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <= '1995-01-05'::date)) │ │ -> Bitmap Index Scan on i_o_orderdate (cost=0.00..298.49 rows=14206 width=0) │ │ Index Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <= '1995-01-05'::date)) │ │ -> Hash (cost=25670.00..25670.00 rows=750000 width=23) │ │ -> Seq Scan on customer (cost=0.00..25670.00 rows=750000 width=23) │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (13 rows) tpch_5[18786][1]=# SET work_mem = '10MB'; tpch_5[18786][1]=# EXPLAIN SELECT c_name, count(*) FROM orders JOIN customer ON (o_custkey = c_custkey) WHERE o_orderdate BETWEEN '1995-01-01' AND '1995-01-05' GROUP BY 1 ORDER BY count(*) DESC LIMIT 10; ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Limit (cost=82847.92..82847.94 rows=10 width=27) │ │ -> Sort (cost=82847.92..82883.43 rows=14206 width=27) │ │ Sort Key: (count(*)) DESC │ │ -> HashAggregate (cost=82398.87..82540.93 rows=14206 width=27) │ │ Group Key: customer.c_name │ │ -> Merge Join (cost=42580.44..82327.84 rows=14206 width=19) │ │ Merge Cond: (customer.c_custkey = orders.o_custkey) │ │ -> Index Scan using i_c_custkey on customer (cost=0.42..37663.43 rows=750000 width=23) │ │ -> Sort (cost=42579.54..42615.05 rows=14206 width=4) │ │ Sort Key: orders.o_custkey │ │ -> Bitmap Heap Scan on orders (cost=302.04..41599.74 rows=14206 width=4) │ │ Recheck Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <= '1995-01-05'::date)) │ │ -> Bitmap Index Scan on i_o_orderdate (cost=0.00..298.49 rows=14206 width=0) │ │ Index Cond: ((o_orderdate >= '1995-01-01'::date) AND (o_orderdate <= '1995-01-05'::date)) │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (14 rows) Note how the plan switched from a hashjoin to a mergejoin solely on the basis of different work_mem settings, and that there's obviously different costs associated between the two plans. What I'm basically worried about is that the *only* reason for some plans to choose to use parallelism is that essentially the effective amount of work_mem between the plans is that the parallel one uses (max_parallel_workers_per_gather + 1) * work_mem. Which might push queries to use parallelism even if it's not actually beneficial in reducing runtime. Thomas' earlier comparison of this behaviour with e.g. parallel oblivious hash nodes does *NOT* seem apt to me. There's currently effectively no cost pressure for/against parallelism for those (even if there potentially should). Which means they do not favor parallel queries solely because they're allowed to use more memory, and thus it's far less likely that every of those nodes uses the maximum alloted work_mem. I think it's wrong to just multiply the amount of work_mem that way, and it'll bite use. Introducing a separate guc, perhaps inheriting from work_mem if set to -1, that limits the amount of memory inside a parallel node seems saner. That value then would not be multiplied with the chosen worker number. > But even then, the parallel plan is also almost certainly consuming > more CPU cycles to produce the same results. Parallelism is all about > trading away efficiency for execution time. Not just because of > current planner and executor limitations, but intrinsically, parallel > plans are less efficient. The globally optimal solution on a system > that is short on either memory or CPU cycles is to turn parallelism > off. I do think that our parallelism isn't properly tunable on that front. I think we really need something like a 'parallel_resource_efficiency' [0..1] GUC. As far as I understand it we currently cost a gather's startup cost once, not per worker. That startup cost effectively include redundant work like materializing a table, building parallel oblivious hashtables, resorting the same table for a mergejoin etc... That's fine if your goal is solely to return a single query as fast as possible, even if doubling the resources will only give you a minimal cost advantage. If instead your goal is to optimize both for individual query performance and overall system throughput that's obviously not good. I think we should cost the startup cost of paralell nodes more like startup_cost * (max_parallel_workers_per_gather * parallel_resource_efficiency + 1) which'd allow to tune query performance for using parallelism even for tiny benefits (parallel_resource_efficiency = 0), and conversely tune it so it only gets used if the overall loss of efficiency is small (parallel_resource_efficiency = 0.9 or such). (skipping over a lot of details for such a proposal) Now, currently that'd have the weakness that we sometimes would end up not using parallelism, because at the determined amount of parallelism it's not beneficial to use it, even though it'd be still worthwhile to use a lower level of parallelism. But given we currently don't plan for multiple degrees of parallelism that seems the right thing to do if you care about efficiency / overall throughput. Greetings, Andres Freund