On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan <p...@heroku.com> wrote: >> Imagine a data structure that is stored in dynamic shared memory and >> contains space for a filename, a reference count, and a mutex. Let's >> call this thing a SharedTemporaryFile or something like that. It >> offers these APIs: >> >> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *); >> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment >> *seg); >> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname); >> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *); > > I'm a little bit tired right now, and I have yet to look at Thomas' > parallel hash join patch in any detail. I'm interested in what you > have to say here, but I think that I need to learn more about its > requirements in order to have an informed opinion.
Attached is V7 of the patch. The overall emphasis with this revision is on bringing clarity on how much can be accomplished using generalized infrastructure, explaining the unification mechanism coherently, and related issues. Notable changes --------------- * Rebased to work with the newly simplified logtape.c representation (the recent removal of "indirect blocks" by Heikki). Heikki's work was something that helped with simplifying the whole unification mechanism, to a significant degree. I think that there was over a 50% reduction in logtape.c lines of code in this revision. * randomAccess cases are now able to reclaim disk space from blocks originally written by workers. This further simplifies logtape.c changes significantly. I don't think that this is important because some future randomAccess caller might otherwise have double the storage overhead for their parallel sort, or even because of the disproportionate performance penalty such a caller would experience; rather, it's important because it removes previous special cases (that were internal to logtape.c). For example, aside from the fact that worker tapes within a unified tapeset will often have a non-zero offset, there is no state that actually remembers that this is a unified tapeset, because that isn't needed anymore. And, even though we reclaim blocks from workers, we only have one central chokepoint for applying worker offsets in the leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with things like positional seeking for mark/restore for certain tuplesort clients (which are. in general, poorly tested) now need to have no knowledge of unification while still working just the same. This is a consequence of the fact that ltsWriteBlock() callers (and ltsWriteBlock() itself) never have to think about offsets. I'm pretty happy about that. * pg_restore now prevents the planner from deciding that parallelism should be used, in order to make restoration behavior more consistent and predictable. Iff a dump being restored happens to have a CREATE INDEX with the new index storage parameter parallel_workers set, then pg_restore will use parallel CREATE INDEX. This is accomplished with a new GUC, enable_parallelddl (since "max_parallel_workers_maintenance = 0" will disable parallel CREATE INDEX across the board, ISTM that a second new GUC is required). I think that this behavior the right trade-off for pg_restore goes, although I still don't feel particularly strongly about it. There is now a concrete proposal on what to do about pg_restore, if nothing else. To recap, the general concern address here is that there are typically no ANALYZE stats available for the planner to decide with when pg_restore runs CREATE INDEX, although that isn't always true, which was both surprising and inconsistent. * Addresses the problem of anonymous record types and their need for "remapping" across parallel workers. I've simply pushed the responsibility on callers within tuplesort.h contract; parallel CREATE INDEX callers don't need to care about this, as explained there. (CLUSTER tuplesorts would also be safe.) * Puts the whole rationale for unification into one large comment above the function BufFileUnify(), and removes traces of the same kind of discussion from everywhere else. I think that buffile.c is the right central place to discuss the unification mechanism, now that logtape.c has been greatly simplified. All the fd.c changes are in routines that are only ever called by buffile.c anyway, and are not too complicated (in general, temp fd.c files are only ever owned transitively, through BufFiles). So, morally, the unification mechanism is something that wholly belongs to buffile.c, since unification is all about temp files, and buffile.h is the interface through which temp files are owned and accessed in general, without exception. Unification remains specialized ------------------------------- On the one hand, BufFileUnify() now describes the whole idea of unification in detail, in its own general terms, including its performance characteristics, but on the other hand it doesn't pretend to be more general than it is (that's why we really have to talk about performance characteristics). It doesn't go as far as admitting to being the thing that logtape.c uses for parallel sort, but even that doesn't seem totally unreasonable to me. I think that BufFileUnify() might also end up being used by tuplestore.c, so it isn't entirely non-general, but I now realize that it's unlikely to be used by parallel hash join. So, while randomAccess reclamation of worker blocks within the leader now occurs, I have not followed Robert's suggestion in full. For example, I didn't do this: "ltsGetFreeBlock() need to be re-engineered to handle concurrency with other backends". The more I've thought about it, the more appropriate the kind of specialization I've come up with seems. I've concluded: - Sorting is important, and therefore worth adding non-general infrastructure in support of. It's important enough to have its own logtape.c module, so why not this? Much of buffile.c was explicitly written with sorting and hashing in mind from the beginning. We use BufFiles for other things, but those two things are by far the two most important users of temp files, and the only really compelling candidates for parallelization. - There are limited opportunities to share BufFile infrastructure for parallel sorting and parallel hashing. Hashing is inverse to sorting conceptually, so it should not be surprising that this is the case. By that I mean that hashing is characterized by logical division and physical combination, whereas sorting is characterized by physical division and logical combination. Parallel tuplesort naturally allows each worker to do an enormous amount of work with whatever data it is fed by the parallel heap scan that it joins, *long* before the data needs to be combined with data from other workers in any way. Consider this code from Thomas' parallel hash join patch: > +bool > +ExecHashCheckForEarlyExit(HashJoinTable hashtable) > +{ > + /* > + * The golden rule of leader deadlock avoidance: since leader processes > + * have two separate roles, namely reading from worker queues AND > executing > + * the same plan as workers, we must never allow a leader to wait for > + * workers if there is any possibility those workers have emitted tuples. > + * Otherwise we could get into a situation where a worker fills up its > + * output tuple queue and begins waiting for the leader to read, while > + * the leader is busy waiting for the worker. > + * > + * Parallel hash joins with shared tables are inherently susceptible to > + * such deadlocks because there are points at which all participants must > + * wait (you can't start check for unmatched tuples in the hash table > until > + * probing has completed in all workers, etc). Parallel sort will never have to do anything like this. There is minimal IPC before the leader's merge, and the dependencies between phases are extremely simple (there is only one; workers need to finish before leader can merge, and must stick around in a quiescent state throughout). Data throughput is what tuplesort cares about; it doesn't really care about latency. Whereas, I gather that there needs to be continual gossip between hash join workers (those building a hash table) about the number of batches. They don't have to be in perfect lockstep, but they need to cooperate closely; the IPC is pretty eager, and therefore latency sensitive. Thomas makes use of atomic ops in his patch, which makes sense, but I'd never bother with anything like that for parallel tuplesort; there'd be no measurable benefit there. In general, it's not obvious to me that the SharedTemporaryFile() API that Robert sketched recently (or any very general shared file interface that does things like buffer writes in shared memory, uses a shared read pointer, etc) is right for either parallel hash join or parallel sort. I don't see that there is much to be said for a reference count mechanism for parallel sort BufFiles, since the dependencies are so simple and fixed, and for hash join, a much tighter mechanism seems desirable. I can't think why Thomas would want a shared read pointer, since the way he builds the shared hash table leaves it immutable once probing is underway; ISTM that he'll want that kind of mechanism to operate at a higher level, in a more specialized way. That said, I don't actually know what Thomas has in mind for multi-batch parallel hash joins, since that's only a TODO item in the most recent revision of his patch (maybe I missed something he wrote on this topic, though). Thomas is working on a revision that resolves that open item, at which point we'll know more. I understand that a new revision of his patch that closes out the TODO item isn't too far from being posted. -- Peter Geoghegan
0002-Add-temporary-testing-tools.patch.gz
Description: GNU Zip compressed data
0001-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers