Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Tue, 20 Dec 2016 20:04:32 -0800

On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan <p...@heroku.com> wrote:
> Without meaning to sound glib, unification is the process by which
> parallel CREATE INDEX has the leader read temp files from workers
> sufficient to complete its final on-the-fly merge.


That's not glib, but you can't in the end define BufFile unification
in terms of what parallel CREATE INDEX needs.  Whatever changes we
make to lower-level abstractions in the service of some higher-level
goal need to be explainable on their own terms.

>>It's fine to make up new words -- indeed, in some sense that is the essence
>> of writing any complex problem -- but you have to define them.
>
> I invite you to help me define this new word.

If at some point I'm able to understand what it means, I'll try to do
that.  I think you're loosely using "unification" to mean combining
stuff from different backends in some way that depends on the
particular context, so that "BufFile unification" can be different
from "LogicalTape unification".  But that's just punting the question
of what each of those things actually are.

> Maybe it's inferior to that, but I think what Heikki proposes is more
> or less complementary to what I've proposed, and has nothing to do
> with resource management and plenty to do with making the logtape.c
> interface look nice, AFAICT. It's also about refactoring/simplifying
> logtape.c itself, while we're at it. I believe that Heikki has yet to
> comment either way on my approach to resource management, one aspect
> of the patch that I was particularly keen on your looking into.

My reading of Heikki's point was that there's not much point in
touching the BufFile level of things if we can do all of the necessary
stuff at the LogicalTape level, and I agree with him about that.  If a
shared BufFile had a shared read-write pointer, that would be a good
justification for having it.  But it seems like unification at the
BufFile level is just concatenation, and that can be done just as well
at the LogicalTape level, so why tinker with BufFile?  As I've said, I
think there's some low-level hacking needed here to make sure files
get removed at the correct time in all cases, but apart from that I
see no good reason to push the concatenation operation all the way
down into BufFile.

> The theory of operation here is that workers own their own BufFiles,
> and are responsible for deleting them when they die. The assumption,
> rightly or wrongly, is that it's sufficient that workers flush
> everything out (write out temp files), and yield control to the
> leader, which will open their temp files for the duration of the
> leader final on-the-fly merge. The resource manager in the leader
> knows it isn't supposed to ever delete worker-owned files (just
> close() the FDs), and the leader errors if it cannot find temp files
> that match what it expects. If there is a an error in the leader, it
> shuts down workers, and they clean up, more than likely. If there is
> an error in the worker, or if the files cannot be deleted (e.g., if
> there is a classic hard crash scenario), we should also be okay,
> because nobody will trip up on some old temp file from some worker,
> since fd.c has some gumption about what workers need to do (and what
> the leader needs to avoid) in the event of a hard crash. I don't see a
> risk of file descriptor leaks, which may or may not have been part of
> your concern (please clarify).

I don't think there's any new issue with file descriptor leaks here,
but I think there is a risk of calling unlink() too early or too late
with your design.  My proposal was an effort to nail that down real
tight.

>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> I'll remind you that parallel CREATE INDEX doesn't actually ever need
> to be randomAccess, and so we are not actually going to ever need to
> do this as things stand. I wrote the code that way in order to not
> break the existing interface, which seemed like a blocker to posting
> the patch. I am open to the idea of such an "adoption" occurring, even
> though it actually wouldn't help any case that exists in the patch as
> proposed. I didn't go that far in part because it seemed premature,
> given that nobody had looked at my work to date at the time, and given
> the fact that there'd be no initial user-visible benefit, and given
> how the exact meaning of "unification" was (and is) somewhat in flux.
>
> I see no good reason to not do that, although that might change if I
> actually seriously undertook to teach the leader about this kind of
> "adoption". I suspect that the interface specification would make for
> confusing reading, which isn't terribly appealing, but I'm sure I
> could manage to make it work given time.

I think the interface is pretty clear: the worker's logical tapes get
incorporated into the leader's LogicalTapeSet as if they'd been there
all along.  After all, by the time this is happening, IIUC (please
confirm), the worker is done with those tapes and will never read or
modify them again.  If that's right, the worker just needs a way to
identify those tapes to the leader, which can then add them to its
LogicalTapeSet.  That's it.  It needs a way to identify them, but I
think that shouldn't be hard; in fact, I think your patch has
something like that already.  And it needs to make sure that the files
get removed at the right time, but I already sketched a solution to
that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to