v2 patch was not rebased over the latest master branch commits. Please refer to the attached ParallelAppend_v3.patch, instead.
On 6 February 2017 at 11:06, Amit Khandekar <amitdkhan...@gmail.com> wrote: > Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote: >>> We may want to think about a third goal: preventing too many workers >>> from executing the same plan. As per comment in get_parallel_divisor() >>> we do not see any benefit if more than 4 workers execute the same >>> node. So, an append node can distribute more than 4 worker nodes >>> equally among the available subplans. It might be better to do that as >>> a separate patch. >> >> I think that comment is for calculating leader contribution. It does >> not say that 4 workers is too many workers in general. >> >> But yes, I agree, and I have it in mind as the next improvement. >> Basically, it does not make sense to give more than 3 workers to a >> subplan when parallel_workers for that subplan are 3. For e.g., if >> gather max workers is 10, and we have 2 Append subplans s1 and s2 with >> parallel workers 3 and 5 respectively. Then, with the current patch, >> it will distribute 4 workers to each of these workers. What we should >> do is : once both of the subplans get 3 workers each, we should give >> the 7th and 8th worker to s2. >> >> Now that I think of that, I think for implementing above, we need to >> keep track of per-subplan max_workers in the Append path; and with >> that, the bitmap will be redundant. Instead, it can be replaced with >> max_workers. Let me check if it is easy to do that. We don't want to >> have the bitmap if we are sure it would be replaced by some other data >> structure. > > Attached is v2 patch, which implements above. Now Append plan node > stores a list of per-subplan max worker count, rather than the > Bitmapset. But still Bitmapset turned out to be necessary for > AppendPath. More details are in the subsequent comments. > > >>> Goal A : Allow non-partial subpaths in Partial Append. >>> Goal B : Distribute workers across the Append subplans. >>> Both of these require some kind of synchronization while choosing the >>> next subplan. So, goal B is achieved by doing all the synchronization >>> stuff. And implementation of goal A requires that goal B is >>> implemented. So there is a dependency between these two goals. While >>> implementing goal B, we should keep in mind that it should also work >>> for goal A; it does not make sense later changing the synchronization >>> logic in goal A. >>> >>> I am ok with splitting the patch into 2 patches : >>> a) changes required for goal A >>> b) changes required for goal B. >>> But I think we should split it only when we are ready to commit them >>> (commit for B, immediately followed by commit for A). Until then, we >>> should consider both of these together because they are interconnected >>> as explained above. >> >> For B, we need to know, how much gain that brings and in which cases. >> If that gain is not worth the complexity added, we may have to defer >> Goal B. Goal A would certainly be useful since it will improve >> performance of the targetted cases. The synchronization required for >> Goal A is simpler than that of B and thus if we choose to implement >> only A, we can live with a simpler synchronization. > > For Goal A , the logic for a worker synchronously choosing a subplan will be : > Go the next subplan. If that subplan has not already assigned max > workers, choose this subplan, otherwise, go the next subplan, and so > on. > For Goal B , the logic will be : > Among the subplans which are yet to achieve max workers, choose the > subplan with the minimum number of workers currently assigned. > > I don't think there is a significant difference between the complexity > of the above two algorithms. So I think here the complexity does not > look like a factor based on which we can choose the particular logic. > We should choose the logic which has more potential for benefits. The > logic for goal B will work for goal A as well. And secondly, if the > subplans are using their own different system resources, the resource > contention might be less. One case is : all subplans using different > disks. Second case is : some of the subplans may be using a foreign > scan, so it would start using foreign server resources sooner. These > benefits apply when the Gather max workers count is not sufficient for > running all the subplans in their full capacity. If they are > sufficient, then the workers will be distributed over the subplans > using both the logics. Just the order of assignments of workers to > subplans will be different. > > Also, I don't see a disadvantage if we follow the logic of Goal B. > >> >> BTW, Right now, the patch does not consider non-partial paths for a >> child which has partial paths. Do we know, for sure, that a path >> containing partial paths for a child, which has it, is always going to >> be cheaper than the one which includes non-partial path. If not, >> should we build another paths which contains non-partial paths for all >> child relations. This sounds like a 0/1 knapsack problem. > > I didn't quite get this. We do create a non-partial Append path using > non-partial child paths anyways. > >> >>> >>> >>>> Here are some review comments >>> I will handle the other comments, but first, just a quick response to >>> some important ones : >>> >>>> 6. By looking at parallel_worker field of a path, we can say whether it's >>>> partial or not. We probably do not require to maintain a bitmap for that >>>> at in >>>> the Append path. The bitmap can be constructed, if required, at the time of >>>> creating the partial append plan. The reason to take this small step is 1. >>>> we >>>> want to minimize our work at the time of creating paths, 2. while freeing a >>>> path in add_path, we don't free the internal structures, in this case the >>>> Bitmap, which will waste memory if the path is not chosen while planning. >>> >>> Let me try keeping the per-subplan max_worker info in Append path >>> itself, like I mentioned above. If that works, the bitmap will be >>> replaced by max_worker field. In case of non-partial subpath, >>> max_worker will be 1. (this is the same info kept in AppendState node >>> in the patch, but now we might need to keep it in Append path node as >>> well). >> >> It will be better if we can fetch that information from each subpath >> when creating the plan. As I have explained before, a path is minimal >> structure, which should be easily disposable, when throwing away the >> path. > > Now in the v2 patch, we store per-subplan worker count. But still, we > cannot use the path->parallel_workers to determine whether it's a > partial path. This is because even for a non-partial path, it seems > the parallel_workers can be non-zero. For e.g., in > create_subqueryscan_path(), it sets path->parallel_workers to > subpath->parallel_workers. But this path is added as a non-partial > path. So we need a separate info as to which of the subpaths in Append > path are partial subpaths. So in the v2 patch, I continued to use > Bitmapset in AppendPath. But in Append plan node, number of workers is > calculated using this bitmapset. Check the new function > get_append_num_workers(). > >>>> 7. If we consider 6, we don't need concat_append_subpaths(), > As explained above, I have kept the BitmapSet for AppendPath. > >>>> but still here are >>>> some comments about that function. Instead of accepting two separate >>>> arguments >>>> childpaths and child_partial_subpaths_set, which need to be in sync, we can >>>> just pass the path which contains both of those. In the same following >>>> code may >>>> be optimized by adding a utility function to Bitmapset, which advances >>>> all members >>>> by given offset and using that function with bms_union() to merge the >>>> bitmapset e.g. >>>> bms_union(*partial_subpaths_set, >>>> bms_advance_members(bms_copy(child_partial_subpaths_set), >>>> append_subpath_len)); >>>> if (partial_subpaths_set) > > I will get back on this after more thought. > >> >>> >>>> 12. cost_append() essentially adds costs of all the subpaths and then >>>> divides >>>> by parallel_divisor. This might work if all the subpaths are partial >>>> paths. But >>>> for the subpaths which are not partial, a single worker will incur the >>>> whole >>>> cost of that subpath. Hence just dividing all the total cost doesn't seem >>>> the >>>> right thing to do. We should apply different logic for costing non-partial >>>> subpaths and partial subpaths. >>> >>> WIth the current partial path costing infrastructure, it is assumed >>> that a partial path node should return the average per-worker cost. >>> Hence, I thought it would be best to do it in a similar way for >>> Append. But let me think if we can do something. With the current >>> parallelism costing infrastructure, I am not sure though. >> >> The current parallel mechanism is in sync with that costing. Each >> worker is supposed to take the same burden, hence the same (average) >> cost. But it will change when a single worker has to scan an entire >> child relation and different child relations have different sizes. > > I gave more thought on this. Considering each subplan has different > number of workers, I think it makes sense to calculate average > per-worker cost even in parallel Append. In case of non-partial > subplan, a single worker will execute it, but it will next choose > another subplan. So on average each worker is going to process the > same number of rows, and also the same amount of CPU. And that amount > of CPU cost and rows cost should be calculated by taking the total > count and dividing it by number of workers (parallel_divsor actually). > > >> Here are some review comments >> >> 1. struct ParallelAppendDescData is being used at other places. The >> declaration >> style doesn't seem to be very common in the code or in the directory where >> the >> file is located. >> +struct ParallelAppendDescData >> +{ >> + slock_t pa_mutex; /* mutual exclusion to choose >> next subplan */ >> + parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER]; >> +}; >> Defining it like >> typdef struct ParallelAppendDescData >> { >> slock_t pa_mutex; /* mutual exclusion to choose next >> subplan */ >> parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER]; >> }; >> will make its use handy. Instead of struct ParallelAppendDescData, you will >> need to use just ParallelAppendDescData. May be we want to rename >> parallel_append_info as ParallelAppendInfo and change the style to match >> other >> declarations. >> >> 2. The comment below refers to the constant which it describes, which looks >> odd. May be it should be worded as "A special value of >> AppendState::as_whichplan, to indicate no plans left to be executed.". Also >> using INVALID for "no plans left ..." seems to be a misnomer. >> /* >> * For Parallel Append, AppendState::as_whichplan can have PA_INVALID_PLAN >> * value, which indicates there are no plans left to be executed. >> */ >> #define PA_INVALID_PLAN -1 >> >> 3. The sentence "We have got NULL", looks odd. Probably we don't need it as >> it's clear from the code above that this code deals with the case when the >> current subplan didn't return any row. >> /* >> * We have got NULL. There might be other workers still processing >> the >> * last chunk of rows for this same node, but there's no point for >> new >> * workers to run this node, so mark this node as finished. >> */ >> 4. In the same comment, I guess, the word "node" refers to "subnode" and not >> the node pointed by variable "node". May be you want to use word "subplan" >> here. >> >> 4. set_finished()'s prologue has different indentation compared to other >> functions in the file. >> >> 5. Multilevel comment starts with an empty line. >> + /* Keep track of the node with the least workers so far. For the >> very >> > Done 1. to 5. above, as per your suggestions. > >> 9. Shouldn't this funciton return double? >> int >> get_parallel_divisor(int parallel_workers) > > v2 patch is rebased on latest master branch, which already contains > this function returning double. > > >> 10. We should probably move the parallel_safe calculation out of >> cost_append(). >> + path->parallel_safe = path->parallel_safe && >> + subpath->parallel_safe; >> >> 11. This check shouldn't be part of cost_append(). >> + /* All child paths must have same parameterization */ >> + Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer)); >> > Yet to handle the above ones. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
ParallelAppend_v3.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers