Hi, Here is a new version of the patch series addressing complaints from Rafia, Peter, Andres and Robert -- see below.
First, two changes not already covered in this thread: 1. Today Robert asked me a question off-list that I hadn't previously considered: since I am sharing tuples between backends, don't I have the same kind of transient record remapping problems that tqueue.c has to deal with? The answer must be yes, and in fact it's a trickier version because there are N 'senders' and N 'receivers' communicating via the shared hash table. So I decided to avoid the problem by not planning shared hash tables if the tuples could include transient RECORD types: see tlist_references_transient_type() in 0007-hj-shared-single-batch-v8.patch. Perhaps in the future we can find a way for parallel query to keep local types in sync, so this restriction could be lifted. (I've tested this with a specially modified build, because I couldn't figure out how to actually get any transient types to be considered in a parallel query, but if someone has a suggestion for a good query for that I'd love to put one into the regression test.) 2. Earlier versions included support for Shared Hash (= one worker builds, other workers wait, but we get to use work_mem * P memory) and Parallel Shared Hash (= all workers build). Shared Hash is by now quite hard to reach, since so many hash join inner plans are now parallelisable. I decided to remove support for it from the latest patch series: I think it adds cognitive load and patch lines for little or no gain. With time running out, I thought that it would be better to rip it out for now to try to simplify things and avoid some difficult questions about how to cost that mode. It could be added with a separate patch after some more study if it really does make some sense. >> On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih >> <rafia.sa...@enterprisedb.com> wrote: >>> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a >>> few regressions, >>> Q21: 52 secs on HEAD and 400 secs with this patch As already mentioned there is a planner bug which we can fix separately from this patch series. Until that is resolved, please see that other thread[1] for the extra tweak require for sane results when testing Q21. Even with that tweak, there was a slight regression with fewer than 3 workers at 1GB for Q21. That turned out to be because the patched version was not always using as many workers as unpatched. To fix that, I had to rethink the deadlock avoidance system to make it a bit less conservative about giving up workers: see src/backend/utils/misc/leader_gate.c in 0007-hj-shared-single-batch-v8.patch. Here are some speed-up numbers comparing master to patched that I recorded on TPCH scale 10 with work_mem = 1GB. These are the queries whose plans change with the patch. Both master and v8 were patched with fix-neqseljoin-for-semi-joins.patch. query | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8 -------+-------+-------+-------+-------+-------+-------+-------+-------+------- Q3 | 0.94x | 1.06x | 1.25x | 1.46x | 1.64x | 1.87x | 1.99x | 1.67x | 1.67x Q5 | 1.17x | 1.03x | 1.23x | 1.27x | 1.44x | 0.56x | 0.95x | 0.94x | 1.16x Q7 | 1.13x | 1.04x | 1.31x | 1.06x | 1.15x | 1.28x | 1.31x | 1.35x | 1.13x Q8 | 0.99x | 1.13x | 1.23x | 1.22x | 1.36x | 0.42x | 0.82x | 0.78x | 0.81x Q9 | 1.16x | 0.95x | 1.92x | 1.68x | 1.90x | 1.89x | 2.02x | 2.05x | 1.81x Q10 | 1.01x | 1.03x | 1.08x | 1.10x | 1.16x | 1.17x | 1.09x | 1.01x | 1.07x Q12 | 1.03x | 1.19x | 1.42x | 0.75x | 0.74x | 1.00x | 0.99x | 1.00x | 1.01x Q13 | 1.10x | 1.66x | 1.99x | 1.00x | 1.12x | 1.00x | 1.12x | 1.01x | 1.13x Q14 | 0.97x | 1.13x | 1.22x | 1.45x | 1.43x | 1.55x | 1.55x | 1.50x | 1.45x Q16 | 1.02x | 1.13x | 1.07x | 1.09x | 1.10x | 1.10x | 1.13x | 1.10x | 1.11x Q18 | 1.05x | 1.43x | 1.33x | 1.21x | 1.07x | 1.57x | 1.76x | 1.09x | 1.09x Q21 | 0.99x | 1.01x | 1.07x | 1.18x | 1.28x | 1.37x | 1.63x | 1.26x | 1.60x These tests are a bit short and noisy and clearly there are some strange dips in there that need some investigation but the trend is positive. Here are some numbers from some simple test joins, so that you can see the raw speedup of large hash joins without all the other things going on in those TPCH plans. I executed 1-join, 2-join and 3-join queries like this: CREATE TABLE simple AS SELECT generate_series(1, 10000000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; ANALYZE simple; SELECT COUNT(*) FROM simple r JOIN simple s USING (id); SELECT COUNT(*) FROM simple r JOIN simple s USING (id) JOIN simple t USING (id); SELECT COUNT(*) FROM simple r JOIN simple s USING (id) JOIN simple t USING (id) JOIN simple u USING (id); Unpatched master can make probing go faster by adding workers, but not building, so in these self-joins the ability to scale with more CPUs is limited (here w = 1 shows the speedup compared to the base case of w = 0): joins | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 -------+-------------+-------+-------+-------+-------+------- 1 | 10746.395ms | 1.46x | 1.66x | 1.63x | 1.49x | 1.36x 2 | 20057.117ms | 1.41x | 1.58x | 1.59x | 1.43x | 1.28x 3 | 30108.872ms | 1.29x | 1.39x | 1.36x | 1.35x | 1.03x With the patch, scalability is better because extra CPUs can be used for both of these phases (though probably limited here by my 4 core machine): joins | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 -------+-------------+-------+-------+-------+-------+------- 1 | 10820.613ms | 1.86x | 2.62x | 2.99x | 3.04x | 2.90x 2 | 20348.011ms | 1.83x | 2.54x | 2.71x | 3.06x | 3.17x 3 | 30074.413ms | 1.82x | 2.49x | 2.79x | 3.08x | 3.27x On Thu, Feb 16, 2017 at 3:36 PM, Andres Freund <and...@anarazel.de> wrote: > I think the synchronization protocol with the various phases needs to be > documented somewhere. Probably in nodeHashjoin.c's header. I will supply that shortly. > Don't we also need to somehow account for the more expensive hash-probes > in the HASHPATH_TABLE_SHARED_* cases? Seems quite possible that we'll > otherwise tend to use shared tables for small hashed tables that are > looked up very frequently, even though a private one will likely be > faster. In this version I have two GUCs: cpu_shared_tuple_cost to account for the extra cost of building a shared hash table. cpu_synchronization_cost to account for the cost of waiting for a barrier between building and probing, probing and unmatched-scan if outer, and so on for future batches. I'm not yet sure what their default settings should be, but these provide the mechanism to discourage the case you're talking about. On Wed, Mar 8, 2017 at 12:58 PM, Andres Freund <and...@anarazel.de> wrote: > +static void *dense_alloc(HashJoinTable hashtable, Size size, > + bool respect_work_mem); > > I still dislike this, but maybe Robert's point of: > ... > Is enough. I this version I changed the name to load_(private|shared)_tuple, and made it return NULL to indicate that work_mem would be exceeded. The caller needs to handle that by trying to shrink the hash table. Is this better? On Fri, Mar 10, 2017 at 3:02 PM, Peter Geoghegan <p...@bowt.ie> wrote: > On Thu, Mar 9, 2017 at 4:29 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> Yeah, this seems to fall out of the requirement to manage a growable >> number of partition files in a fixed space. I wonder how this could >> go wrong. One way would be for a crash-restart to happen (which >> leaves all temporary files in place by design, though it could clean >> them up like a normal restart if it wanted to), followed by a very >> long running cluster eventually generating the same (pid, set number) >> pair. I think I see a simple way to defend against that, which I'll >> write about in the PHJ thread. > > I am not expressing any real opinion about the idea of relying on or > suppressing ENOENT-on-unlink() just yet. What's clear is that that's > unorthodox. I seldom have any practical reason to make a distinction > between unorthodox and unacceptable. It's usually easier to just not > do the unorthodox thing. Maybe this is one of the rare exceptions. In 0008-hj-shared-buf-file-v8.patch, the problem I mentioned above is addressed; see make_tagged_segment(). >> Thanks. I will respond with code and comments over on the PHJ thread. >> Aside from the broken extendBufFile behaviour you mentioned, I'll look >> into the general modularity complaints I'm hearing about fd.c and >> buffile.c interaction. > > buffile.c should stop pretending to care about anything other than > temp files, IMV. 100% of all clients that want temporary files go > through buffile.c. 100% of all clients that want non-temp files (files > which are not marked FD_TEMPORARY) access fd.c directly, rather than > going through buffile.c. I still need BufFile because I want buffering. There are 3 separate characteristics enabled by flags with 'temporary' in their name. I think we should consider separating the concerns by splitting and renaming them: 1. Segmented BufFile behaviour. I propose renaming BufFile's isTemp member to isSegmented, because that is what it really does. I want that feature independently without getting confused about lifetimes. Tested with small MAX_PHYSICAL_FILESIZE as you suggested. 2. The temp_file_limit system. Currently this applies to fd.c files opened with FD_TEMPORARY. You're right that we shouldn't be able to escape that sanity check on disk space just because we want to manage disk file ownership differently. I propose that we create a new flag FD_TEMP_FILE_LIMIT that can be set independently of the flags controlling disk file lifetime. When working with SharedBufFileSet, the limit applies to each backend in respect of files it created, while it has them open. This seems a lot simpler than any shared-temp-file-limit type scheme and is vaguely similar to the way work_mem applies in each backend for parallel query. 3. Delete-on-close/delete-at-end-of-xact. I don't want to use that facility so I propose disconnecting it from the above. We c{ould rename those fd.c-internal flags FD_TEMPORARY and FD_XACT_TEMPORARY to FD_DELETE_AT_CLOSE and FD_DELETE_AT_EOXACT. As shown in 0008-hj-shared-buf-file-v8.patch. Thoughts? [1] https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
parallel-shared-hash-v8.tgz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers