On Mon, Nov 4, 2019 at 2:04 PM Andres Freund <and...@anarazel.de> wrote: > Is that really true? In the case where it started and failed we except > the error queue to have been attached to, and there to be either an > error 'E' or a 'X' response (cf HandleParallelMessage()). It doesn't > strike me as very complicated to keep track of whether any worker has > sent an 'E' or not, no? I don't think we really need the
One of us is confused here, because I don't think that helps. Consider three background workers Alice, Bob, and Charlie. Alice fails to launch because fork() fails. Bob launches but then exits unexpectedly. Charlie has no difficulties and carries out his assigned duties. Now, the system you are proposing will say that Charlie is OK but Alice and Bob are a problem. However, that's the way it already works. What Tom wants is to distinguish Alice from Bob, and your proposal is of no help at all with that problem, so far as I can see. > > We certainly can't ignore a worker that managed to start and > > then bombed out, because it might've already, for example, claimed a > > block from a Parallel Seq Scan and not yet sent back the corresponding > > tuples. We could ignore a worker that never started at all, due to > > EAGAIN or whatever else, but the original process that registered the > > worker has no way of finding this out. > > Sure, but in that case we'd have gotten either an error back from the > worker, or postmaster wouldhave PANIC restarted everyone due to an > unhandled error in the worker, no? An unhandled ERROR in the worker is not a PANIC. I think it's just an ERROR that ends up being fatal in effect, but even if it's actually promoted to FATAL, it's not a PANIC. It is *generally* true that if a worker hits an ERROR, the error will be propagated back to the leader, but it is not an invariable rule. One pretty common way that it fails to happen - common in the sense that it comes up during development, not common on production systems I hope - is if a worker dies before reaching the call to pq_redirect_to_shm_mq(). Before that, there's no possibility of communicating anything. Granted, at that point we shouldn't yet have done any work that might mess up the query results. Similarly, once we reach that point, we are dependent on a certain amount of good behavior for things to work as expected; yeah, any code that calls proc_exit() is suppose to signal an ERROR or FATAL first, but what if it doesn't? Granted, in that case we'd probably fail to send an 'X' message, too, so the leader would still have a chance to realize something is wrong. I guess I agree with you to this extent: I made a policy decision that if a worker is successfully fails to show up, that's an ERROR. It would be possible to adopt the opposite policy, namely that if a worker doesn't show up, that's an "oh well." You'd have to be very certain that the worker wasn't going to show up later, though. For instance, suppose you check all of the shared memory queues used for returning tuples and find that every queue is either in a state where (1) nobody's ever attached to it or (2) somebody attached and then detached. This is not good enough, because it's possible that after you checked queue #1, and found it in the former state, someone attached and read a block, which caused queue #2 to enter the latter state before you got around to checking it. If you decide that it's OK to decide that we're done at this point, you'll never return the tuples that are pushed through queue #1. But, assuming you nailed the door shut so that such problems could not occur, I think we could make a decision to ignore works that failed before doing anything interesting. Whether that would be a good policy decision is pretty questionable in my mind. In addition to what I mentioned before, I think there's a serious danger that errors that users would have really wanted to know about - or developers would really want to have known about - would get ignored. You could have some horrible problem that's making your workers fail to launch, and the system would just carry on as if everything were fine, except with bad query plans. I realize that you and others might say "oh, well, monitor your logs, then," but I think there is certainly some value in an ordinary user being able to know that things didn't go well without having to look into the PostgreSQL log for errors. Now, maybe you think that's not enough value to justify having it work the way it does today, and I certainly respect that, but I don't view it that way myself. What I mostly want to emphasize here is that, while parallel query has had a number of bugs in this area that were the result of shoddy design or inadequate testing - principally by me - this isn't one of them. This decision was made consciously by me because I thought it gave us the best chance of having a system that would be reliable and have satisfying behavior for users. Sounds like not everybody agrees, and that's fine, but I just want to get it out there that this wasn't accidental on my part. > > And even if you solved for all of that, I think you might still find > > that it breaks some parallel query (or parallel create index) code > > that expects the number of workers to change at registration time, but > > not afterwards. So, that could would all need to be adjusted. > > Fair enough. Although I think practically nearly everything has to be > ready to handle workers just being slow to start up anyway, no? There's > plenty cases where we just finish before all workers are getting around > to do work. Because of the shutdown race mentioned above, we generally have to wait for workers to exit before we can shut down parallelism. See commit 2badb5afb89cd569500ef7c3b23c7a9d11718f2f (whose commit message also documents some of the behaviors now in question). So we tolerate slow startup in that it doesn't prevent us from getting started on query execution, but not to the extent that we can finish query execution without knowing definitively that every worker is either already gone or will never be showing up. (Is it possible to do better there? Perhaps. If we could somehow throw up a brick wall to prevent new workers from doing anything that would cause problems, then verify that every worker which got past the brick wall has exited cleanly, then we could ignore the risk of more workers showing up later, because they'd hit the brick wall before causing any trouble.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company