On Wed, Jan 11, 2017 at 12:05 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Jan 11, 2017 at 2:20 PM, Peter Geoghegan <p...@heroku.com> wrote: >> You'd probably still want to throw an error when workers ended up not >> deleting BufFile segments they owned, though, at least for parallel >> tuplesort. > > Don't see why.
Simply because that's not expected as things stand -- why should the file go away in that context? (Admittedly, that doesn't seem like an excellent reason now.) I actually like the idea of a reference count, the more I think about it, since it doesn't actually have any tension with my original idea of ownership. If something like a randomAccess parallel tuplesort leader merge needs to write new segments (which it almost certainly *won't* anyway, due to my recent V7 changes), then it can still own those new segments itself, alone, and delete them on its own in the manner of conventional temp files, because we can still restrict the shared refcount mechanism to the deletion of "initial" segments. The refcount == 0 deleter only deletes those initial segments, and not any same-BufFile segments that might have been added (added to append to our unified BufFile within leader). I think that parallel hash join won't use this at all, and, as I said, it's only a theoretical requirement for parallel tuplesort, which will generally recycle blocks from worker temp files for its own writes all the time for randomAccess cases, the only cases that ever write within logtape.c. So, the only BufFile shared state needed, that must be maintained over time, is the refcount variable itself. The size of the "initial" BufFile (from which we derive number of new segments during unification) is passed, but it doesn't get maintained in shared memory. BufFile size remains a one way, one time message needed during unification. I only really need to tweak things in fd.c temp routines to make all this work. This is something I like because it makes certain theoretically useful things easier. Say, for example, we wanted to have tuplesort workers merge worker final materialized tapes (their final output), in order to arrange for the leader to have fewer than $NWORKER runs to merge at the end -- that's made easier by the refcount stuff. (I'm still not convinced that that's actually going to make CREATE INDEX faster. Still, it should, on general principle, be easy to write a patch that makes it happen -- a good overall design should leave things so that writing that prototype patch is easy.) >> This idea is something that's much more limited than the >> SharedTemporaryFile() API that you sketched on the parallel sort >> thread, because it only concerns resource management, and not how to >> make access to the shared file concurrency safe in any special, >> standard way. > > Actually, I only intended that sketch to be about resource management. > Sounds like I didn't explain very well. I'm glad to hear that, because I was very puzzled by what you said. I guess I was thrown off by "shared read pointers". I don't want to get into the business of flushing out dirty buffers, or making sure that every backend stays in lockstep about what the total size of the BufFile needs to be. It's so much simpler to just have clear "barriers" for each parallel operation, where backends present a large amount of immutable state to one other backend at the end, and tells it how big its BufFile is only once. (It's not quite immutable, since randomAccess recycle of temp files can happen within logtape.c, but the point is that there should be very little back and forth -- that needs to be severely restricted.) -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers