Oops forgot to attach latest patch in the earlier mail.
On Fri, Nov 11, 2016 at 6:26 PM, Rushabh Lathia <rushabh.lat...@gmail.com> wrote: > > > On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro < > thomas.mu...@enterprisedb.com> wrote: > >> On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia >> <rushabh.lat...@gmail.com> wrote: >> > Please find attached latest patch which fix the review point as well as >> > additional clean-up. >> >> I've signed up to review this patch and I'm planning to do some >> testing. Here's some initial feedback after a quick read-through: >> >> > Thanks Thomas. > > >> + if (gather_merge_readnext(gm_state, i, initialize ? false : true)) >> >> Clunky ternary operator... how about "!initialize". >> >> > Fixed. > > >> +/* >> + * Function clear out a slot in the tuple table for each gather merge >> + * slots and returns the clear clear slot. >> + */ >> >> Maybe better like this? "_Clear_ out a slot in the tuple table for >> each gather merge _slot_ and _return_ the _cleared_ slot." >> >> > Fixed. > > >> + /* Free tuple array as we no more need it */ >> >> "... as we don't need it any more" >> >> > Fixed > > >> +/* >> + * Read the next tuple for gather merge. >> + * >> + * Function fetch the sorted tuple out of the heap. >> + */ >> >> "_Fetch_ the sorted tuple out of the heap." >> >> > Fixed > > >> + * Otherwise, pull the next tuple from whichever participate we >> + * returned from last time, and reinsert the index into the heap, >> + * because it might now compare differently against the existing >> >> s/participate/participant/ >> >> > Fixed. > > >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group >> + * Portions Copyright (c) 1994, Regents of the University of California >> >> Shouldn't this say just "(c) 2016, PostgreSQL Global Development >> Group"? > > > Fixed. > > >> Are we supposed to be blaming the University of California >> for new files? >> >> > Not quite sure about this, so keeping this as it is. > > >> +#include "executor/tqueue.h" >> +#include "miscadmin.h" >> +#include "utils/memutils.h" >> +#include "utils/rel.h" >> +#include "lib/binaryheap.h" >> >> Not correctly sorted. >> >> > Copied from nodeGather.c. But Fixed here. > > >> + /* >> + * store the tuple descriptor into gather merge state, so we can use it >> + * later while initilizing the gather merge slots. >> + */ >> >> s/initilizing/initializing/ >> >> > Fixed. > > >> +/* ---------------------------------------------------------------- >> + * ExecEndGatherMerge >> + * >> + * frees any storage allocated through C routines. >> + * ---------------------------------------------------------------- >> >> The convention in Postgres code seems to be to use a form like "Free >> any storage ..." in function documentation. Not sure if that's an >> imperative, an infinitive, or if the word "we" is omitted since >> English is so fuzzy like that, but it's inconsistent with other >> documentation to use "frees" here. Oh, I see that exact wording is in >> several other files. I guess I'll just leave this as a complaint >> about all those files then :-) >> >> > Sure. > > >> + * Pull atleast single tuple from each worker + leader and set up the >> heap. >> >> s/atleast single/at least a single/ >> >> > Fixed. > > >> + * Read the tuple for given reader into nowait mode, and form the tuple >> array. >> >> s/ into / in / >> >> > Fixed. > > >> + * Function attempt to read tuple for the given reader and store it into >> reader >> >> s/Function attempt /Attempt / >> >> > Fixed. > > >> + * Function returns true if found tuple for the reader, otherwise returns >> >> s/Function returns /Return / >> >> > Fixed. > > >> + * First try to read tuple for each worker (including leader) into nowait >> + * mode, so that we initialize read from each worker as well as leader. >> >> I wonder if it would be good to standardise on the terminology we use >> when we mean workers AND the leader. In my Parallel Shared Hash work, >> I've been saying 'participants' if I mean = workers + leader. What do >> you think? >> >> > I am not quite sure about participants. In my options when we explicitly > say workers + leader its more clear. I am open to change is if committer > thinks otherwise. > > > >> + * After this if all active worker unable to produce the tuple, then >> + * re-read and this time read the tuple into wait mode. For the worker, >> + * which was able to produced single tuple in the earlier loop and still >> + * active, just try fill the tuple array if more tuples available. >> + */ >> >> How about this? "After this, if all active workers are unable to >> produce a tuple, then re-read and this time us wait mode. For workers >> that were able to produce a tuple in the earlier loop and are still >> active, just try to fill the tuple array if more tuples are >> available." >> >> > Fixed. > > >> + * The heap is never spilled to disk, since we assume N is not very >> large. So >> + * this is much simple then cost_sort. >> >> s/much simple then/much simpler than/ >> >> > Fixed. > > >> + /* >> + * Avoid log(0)... >> + */ >> + N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers; >> + logN = LOG2(N); >> ... >> + /* Per-tuple heap maintenance cost */ >> + run_cost += path->path.rows * comparison_cost * 2.0 * logN; >> >> Why multiply by two? The comment above this code says "about log2(N) >> comparisons to delete the top heap entry and another log2(N) >> comparisons to insert its successor". In fact gather_merge_getnext >> calls binaryheap_replace_first, which replaces the top element without >> any comparisons at all and then performs a sift-down in log2(N) >> comparisons to find its new position. There is no per-tuple "delete" >> involved. We "replace" the top element with the value it already had, >> just to trigger the sift-down, because we know that our comparator >> function might have a new opinion of the sort order of this element. >> Very clever! The comment and the 2.0 factor in cost_gather_merge seem >> to be wrong though -- or am I misreading the code? >> >> See cost_merge_append. > > >> Also, shouldn't we add 1 to N to account for the leader? Suppose >> there are 2 workers. There are 3 elements in the binary heap. The >> element to be sifted down must be compared against either 1 or 2 >> others to reorganise the heap. Surely in that case we should estimate >> log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison. >> > > Yes, good catch. For Gather Merge leader always participate, so > we should num_workers + 1. > > >> >> I suspect that the leader's contribution will be equivalent to a whole >> worker if the plan involves a sort: as soon as the leader pulls a >> tuple in gather_merge_init, the sort node will pull all the tuples it >> can in a tight loop. It's unfortunate that cost_seqscan has to >> estimate what the leader's contribution will be without knowing >> whether it has a "greedy" high-startup-cost consumer like a sort or >> hash node where the leader will contribute a whole backend's full >> attention as soon as it executes the plan, or a lazy consumer where >> the leader will probably not contribute much if there are enough >> workers to keep it distracted. In the case of a Gather Merge -> Sort >> -> Parallel Seq Scan plan, I think we will overestimate the number of >> rows (per participant), because cost_seqscan will guess that the >> leader is spending 30% of its time per worker servicing the workers, >> when in fact it will be sucking tuples into a sort node just as fast >> as anyone else. But I don't see what this patch can do about that... >> >> > Exactly. There is very thin line - when it comes to calculating the cost. > In general, while calculating the cost for GM, I just tried to be similar > to the Gather + MergeAppend. > > >> + * When force is true, function reads the tuple into wait mode. For >> gather >> + * merge we need to fill the slot from which we returned the earlier >> tuple, so >> + * this require tuple to be read into wait mode. During initialization >> phase, >> + * once we try to read the tuple into no-wait mode as we want to >> initialize all >> + * the readers. Refer gather_merge_init() for more details. >> + * >> + * Function returns true if found tuple for the reader, otherwise returns >> + * false. >> + */ >> +static bool >> +gather_merge_readnext(GatherMergeState *gm_state, int reader, bool >> force) >> >> s/into wait mode/in wait mode/ >> >> This appears throughout the comments; not sure if I can explain this >> well but "in wait mode" describes a state of being which is wanted >> here, "into wait mode" describes some kind of change or movement or >> insertion. >> >> Perhaps it would be better to say "reads the tuple _queue_ in wait >> mode", just to make clearer that this is talking about the wait/nowait >> feature of tuple queues, and perhaps also note that the leader always >> waits since it executes the plan. >> >> > Fixed. Just choose to s/into wait mode/in wait mode/ > > Maybe we should use "bool nowait" here anway, mirroring the TupleQueue >> interface? Why introduce another terminology for the same thing with >> inverted sense? >> >> > Agree with you. Changed the function gm_readnext_tuple() & > gather_merge_readnext() > APIs. > > >> +/* >> + * Read the tuple for given reader into nowait mode, and form the tuple >> array. >> + */ >> +static void >> +form_tuple_array(GatherMergeState *gm_state, int reader) >> >> This function is stangely named. How about try_to_fill_tuple_buffer >> or something? >> >> + GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader]; >> >> I wonder if the purpose of gm_tuple, would be clearer if it were >> called gm_tuple_buffers. Plural because it holds one buffer per >> reader. Then in that variable on the left hand side there could be >> called tuple_buffer (singular), because it's the buffer of tuples for >> one single reader. >> >> > Yes, you are right. I renamed the variable as well as structure. > > PFA latest patch which address the review comments as > well as few other clean ups. > > Apart from this my colleague Rafia Sabih reported one regression with > GM. Which was like, if we set work_mem enough to accommodate the > sort operation - in such case GM path get select even though Sort > performs much better. > > Example: > > create table t (i int); > insert into t values(generate_series(1,10000000)); > set work_mem =1024000; > explain analyze select * from t order by i; > set enable_gathermerge =off; > explain analyze select * from t order by i; > > postgres=# explain analyze select * from t order by i; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------- > Gather Merge (cost=335916.26..648415.76 rows=2499996 width=4) (actual > time=2234.145..7628.555 rows=10000000 loops=1) > Workers Planned: 4 > Workers Launched: 4 > -> Sort (cost=334916.22..341166.21 rows=2499996 width=4) (actual > time=2226.609..2611.041 rows=2000000 loops=5) > Sort Key: i > Sort Method: quicksort Memory: 147669kB > -> Parallel Seq Scan on t (cost=0.00..69247.96 rows=2499996 > width=4) (actual time=0.034..323.129 rows=2000000 loops=5) > Planning time: 0.061 ms > Execution time: 8143.809 ms > (9 rows) > > postgres=# set enable_gathermerge = off; > SET > postgres=# explain analyze select * from t order by i; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Sort (cost=1306920.83..1331920.79 rows=9999985 width=4) (actual > time=3521.143..4854.148 rows=10000000 loops=1) > Sort Key: i > Sort Method: quicksort Memory: 854075kB > -> Seq Scan on t (cost=0.00..144247.85 rows=9999985 width=4) (actual > time=0.113..1340.758 rows=10000000 loops=1) > Planning time: 0.100 ms > Execution time: 5535.560 ms > (6 rows) > > > Looking at the plan I realize that this is happening because wrong costing > for Gather Merge. Here in the plan we can see the row estimated by > Gather Merge is wrong. This is because earlier patch GM was considering > rows = subpath->rows, which is not true as subpath is partial path. So > we need to multiple it with number of worker. Attached patch also fixed > this issues. I also run the TPC-H benchmark with the patch and results > are same as earlier. > > > Thanks, > Rushabh Lathia > www.EnterpriseDB.com > -- Rushabh Lathia
gather_merge_v4.patch
Description: application/download
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers