Hi, On 2017-04-07 11:44:39 +0530, Amit Khandekar wrote: > On 6 April 2017 at 07:33, Andres Freund <and...@anarazel.de> wrote: > > On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote: > >> This is what the earlier versions of my patch had done : just add up > >> per-subplan parallel_workers (1 for non-partial subplan and > >> subpath->parallel_workers for partial subplans) and set this total as > >> the Append parallel_workers. > > > > I don't think that's great, consider e.g. the case that you have one > > very expensive query, and a bunch of cheaper ones. Most of those workers > > wouldn't do much while waiting for the the expensive query. What I'm > > basically thinking we should do is something like the following > > pythonesque pseudocode: > > > > best_nonpartial_cost = -1 > > best_nonpartial_nworkers = -1 > > > > for numworkers in 1...#max workers: > > worker_work = [0 for x in range(0, numworkers)] > > > > nonpartial_cost += startup_cost * numworkers > > > > # distribute all nonpartial tasks over workers. Assign tasks to the > > # worker with the least amount of work already performed. > > for task in all_nonpartial_subqueries: > > least_busy_worker = worker_work.smallest() > > least_busy_worker += task.total_nonpartial_cost > > > > # the nonpartial cost here is the largest amount any single worker > > # has to perform. > > nonpartial_cost += worker_work.largest() > > > > total_partial_cost = 0 > > for task in all_partial_subqueries: > > total_partial_cost += task.total_nonpartial_cost > > > > # Compute resources needed by partial tasks. First compute how much > > # cost we can distribute to workers that take shorter than the > > # "busiest" worker doing non-partial tasks. > > remaining_avail_work = 0 > > for i in range(0, numworkers): > > remaining_avail_work += worker_work.largest() - worker_work[i] > > > > # Equally divide up remaining work over all workers > > if remaining_avail_work < total_partial_cost: > > nonpartial_cost += (worker_work.largest - remaining_avail_work) / > > numworkers > > > > # check if this is the best number of workers > > if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost: > > best_nonpartial_cost = worker_work.largest > > best_nonpartial_nworkers = nworkers > > > > Does that make sense? > > Yeah, I gather what you are trying to achieve is : allocate number of > workers such that the total cost does not exceed the cost of the first > non-partial plan (i.e. the costliest one, because the plans are sorted > by descending cost). > > So for non-partial costs such as (20, 10, 5, 2) allocate only 2 > workers because the 2nd worker will execute (10, 5, 2) while 1st > worker executes (20). > > But for costs such as (4, 4, 4, .... 20 times), the logic would give > us 20 workers because we want to finish the Append in 4 time units; > and this what we want to avoid when we go with > don't-allocate-too-many-workers approach.
I guess, my problem is that I don't agree with that as a goal in an of itself. If you actually want to run your query quickly, you *want* 20 workers here. The issues of backend startup overhead (already via parallel_setup_cost), concurrency and such cost should be modelled, not burried in a formula the user can't change. If we want to make it less and less likely to start more workers we should make that configurable, not the default. I think there's some precedent taken from the parallel seqscan case, that's not actually applicable here. Parallel seqscans have a good amount of shared state, both on the kernel and pg side, and that shared state reduces gains of increasing the number of workers. But with non-partial scans such shared state largely doesn't exist. > > especially if we get partitionwise joins. > > About that I am not sure, because we already have support for parallel > joins, so wouldn't the join subpaths corresponding to all of the > partitions be partial paths ? I may be wrong about that. We'll probably generate both, and then choose the cheaper one. The startup cost for partitionwise joins should usually be a lot cheaper (because e.g. for hashtables we'll generate smaller hashtables), and we should have less cost of concurrency. > But if the subplans are foreign scans, then yes all would be > non-partial plans. This may provoke off-topic discussion, but here > instead of assigning so many workers to all these foreign plans and > all those workers waiting for the results, a single asynchronous > execution node (which is still in the making) would be desirable > because it would do the job of all these workers. That's something that probably shouldn't be modelled throug a parallel append, I agree - it shouldn't be too hard to develop a costing model for that however. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers