On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan <p...@bowt.ie> wrote: >> I do not think there should be any reason why we can't get the >> resource accounting exactly correct here. If a single backend manages >> to remove every temporary file that it creates exactly once (and >> that's currently true, modulo system crashes), a group of cooperating >> backends ought to be able to manage to remove every temporary file >> that any of them create exactly once (again, modulo system crashes). > > I believe that we are fully in agreement here. In particular, I think > it's bad that there is an API that says "caller shouldn't throw an > elog error between these two points", and that will be fixed before > too long. I just think that it's worth acknowledging a certain nuance.
I attach my V9 of the patch. I came up some stuff for the design of resource management that I think meets every design goal that we have for shared/unified BufFiles: * Avoids both resource leaks, and spurious double-freeing of resources (e.g., a second unlink() for a file from a different process) when there are errors. The latter problem was possible before, a known issue with V8 of the patch. I believe that this revision avoids these problems in a way that is *absolutely bulletproof* in the face of arbitrary failures (e.g., palloc() failure) in any process at any time. Although, be warned that there is a remaining open item concerning resource management in the leader-as-worker case, which I go into below. There are now what you might call "critical sections" in one function. That is, there are points where we cannot throw an error (without a BEGIN_CRIT_SECTION()!), but those are entirely confined to unification code within the leader, where we can be completely sure that no error can be raised. The leader can even fail before some but not all of a particular worker's segments are in its local resource manager, and we still do the right thing. I've been testing this by adding code that randomly throws errors at points interspersed throughout worker and leader unification hand-off points. I then leave this stress-test build to run for a few hours, while monitoring for leaked files and spurious fd.c reports of double-unlink() and similar issues. Test builds change LOG to PANIC within several places in fd.c, while MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ. All of these guarantees are made without any special care from caller to buffile.c. The only V9 change to tuplesort.c or logtape.c in this general area is that they have to pass a dynamic shared memory segment to buffile.c, so that it can register a new callback. That's it. This may be of particular interest to Thomas. All complexity is confined to buffile.c. * No expansion in the use of shared memory to manage resources. BufFile refcount is still per-worker. The role of local resource managers is unchanged. * Additional complexity over and above ordinary BufFile resource management is confined to the leader process and its on_dsm_detach() callback. Only the leader registers a callback. Of course, refcount management within BufFileClose() can still take place in workers, but that isn't something that we rely on (that's only for non-error paths). In general, worker processes mostly have resource managers managing their temp file segments as a thing that has nothing to do with BufFiles (BufFiles are still not owned by resowner.c/fd.c -- they're blissfully unaware of all of this stuff). * In general, unified BufFiles can still be treated in exactly the same way as conventional BufFiles, and things just work, without any special cases being exercised internally. There is still an open item here, though: The leader-as-worker Tuplesortstate, a special case, can still leak files. So, stress-testing will only show the patch to be completely robust against resource leaks when nbtsort.c is modified to enable FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you can also modify that file to force there to be arbitrary-many workers requested (just change "requested = 1" to something else). The leader-as-worker problem is avoided because we don't have the leader participating as a worker this way, which would otherwise present issues for resowner.c that I haven't got around to fixing just yet. It isn't hard to imagine why this is -- one backend with two FDs for certain fd.c temp segments is just going to cause problems for resowner.c without additional special care. Didn't seem worth blocking on that. I want to prove that my general approach is workable. That problem is confined to one backend's resource manager when it is the leader participating as a worker. It is not a refcount problem. The simplest solution here would be to ban the leader-as-worker case by contract. Alternatively, we could pass fd.c segments from the leader-as-worker Tuplesortstate's BufFile to the leader Tuplesortstate's BufFile without opening or closing anything. This way, there will be no second vFD entry for any segment at any time. I've also made several changes to the cost model, changes agreed to over on the "Cost model for parallel CREATE INDEX" thread. No need for a recap on what those changes are here. In short, things have been *significantly* simplified in that area. Finally, note that I decided to throw out more code within tuplesort.c. Now, a parallel leader is a thing that is explicitly set up to be exactly consistent with a conventional/serial external sort whose merge is about to begin. In particular, it now uses mergeruns(). Robert said that he thinks that this is a patch that is to some degree a parallelism patch, and to some degree about sorting. I'd say that by now, it's roughly 5% about sorting, in terms of the proportion of code that expressly considers sorting. Most of the new stuff in tuplesort.c is about managing dependencies between participating backends. I've really focused on avoiding new special cases, especially with V9. -- Peter Geoghegan
0001-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data
0002-Add-temporary-testing-tools.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers