On Thu, Nov 16, 2017 at 7:35 AM, Andres Freund <and...@anarazel.de> wrote: > On 2017-11-15 08:37:11 -0500, Robert Haas wrote: >> I mean, the very first version of this patch that Thomas submitted was >> benchmarked by Rafia and had phenomenally good performance >> characteristics. That turned out to be because it wasn't respecting >> work_mem; you can often do a lot better with more memory, and >> generally you can't do nearly as well with less. To make comparisons >> meaningful, they have to be comparisons between algorithms that use >> the same amount of memory. And it's not just about testing. If we >> add an algorithm that will run twice as fast with equal memory but >> only allow it half as much, it will probably never get picked and the >> whole patch is a waste of time. > > But this does bug me, and I think it's what made me pause here to make a > bad joke. The way that parallelism treats work_mem makes it even more > useless of a config knob than it was before. Parallelism, especially > after this patch, shouldn't compete / be benchmarked against a > single-process run with the same work_mem. To make it "fair" you need to > compare parallelism against a single threaded run with work_mem * > max_parallelism. > > Thomas argues that this makes hashjoins be treated faily vis-a-vi > parallel-oblivious hash join etc. And I think he has somewhat of a > point. But I don't think it's quite right either: In several of these > cases the planner will not prefer the multi-process plan because it uses > more work_mem, it's a cost to be paid. Whereas this'll optimize towards > using work_mem * max_parallel_workers_per_gather amount of memory. > > This makes it pretty much impossible to afterwards tune work_mem on a > server in a reasonable manner. Previously you'd tune it to something > like free_server_memory - (max_connections * work_mem * > 80%_most_complex_query). Which you can't really do anymore now, you'd > also need to multiply by max_parallel_workers_per_gather. Which means > that you might end up "forcing" paralellism on a bunch of plans that'd > normally execute in too short a time to make parallelism worth it.
Currently our way of choosing the number of workers is 'rule based': we use a simple formula that takes relation sizes and some GUCs and per-relation options as inputs. The comparison against non-parallel plans is cost based of course, but we won't consider any other number of workers. Suppose we had 'cost based' worker number selection instead: simply try planning with various different worker counts and pick the winner. Then I think we'd see the moral hazard in this scheme more clearly: the planner effectively has a free memory printing press. It will think that it's a good idea to use a huge number of workers to get more and more work_mem-sized hash tables or in-memory sorts into memory (whether that's with partition-wise join, Parallel Hash, or something else). We could switch to a model where work_mem is divided by the number of workers. Parallel Hash would be able to use a full work_mem by combining them, and parallel-oblivious Hash would be able to use only work_mem / participants. That'd be the other way to give Parallel Hash a fair amount of memory compared to the competition, but I didn't propose that because it'd be a change to the already-released behaviour. Then I'd have been saying "hey, look at this new plan I wrote, it performs really well if you first tie the other plans' shoe laces together". It may actually be better though, even without Parallel Hash in the picture. -- Thomas Munro http://www.enterprisedb.com