On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Oh, right, I forgot the patch also adds the leader into the group, for > some reason (I agree it's unclear why that would be necessary, as you > pointed out later). > > But all this is happening while holding the partition lock (in exclusive > mode). And the decoding backends do synchronize on the lock correctly > (although, man, the rechecks are confusing ...). > > But now I see ProcKill accesses decodeGroupLeader in multiple places, > and only the first one is protected by the lock, for some reason > (interestingly enough the one in lockGroupLeader block). Is that what > you mean?
I haven't traced out the control flow completely, but it sure looks to me like there are places where decodeGroupLeader is checked without holding any LWLock at all. Also, it looks to me like some places (like where we're trying to find a PGPROC by XID) we use ProcArrayLock and in others -- I guess where we're checking the decodeGroupBlah stuff -- we are using the lock manager locks. I don't know how safe that is, and there are not a lot of comments justifying it. I also wonder why we're using the lock manager locks to protect the decodeGroup stuff rather than backendLock. > FWIW I suspect the ProcKill part is borked due to incorrectly resolved > merge conflict or something, per my initial response from today. Yeah I wasn't seeing the code the way I thought you were describing it in that response, but I'm dumb this week so maybe I just misunderstood. >> I think that's probably not going to work out, but of course it's up >> to you how you want to spend your time! > > Well, yeah. I'm sure I could think of more fun things to do, but OTOH I > also have patches that require the capability to decode in-progress > transactions. It's not a matter of fun; it's a matter of whether it can be made to work. Don't get me wrong -- I want the ability to decode in-progress transactions. I complained about that aspect of the design to Andres when I was reviewing and committing logical slots & logical decoding, and I complained about it probably more than I complained about any other aspect of it, largely because it instantaneously generates a large lag when a bulk load commits. But not liking something about the way things are is not the same as knowing how to make them better. I believe there is a way to make it work because I believe there's a way to make anything work. But I suspect that it's at least one order of magnitude more complex than this patch currently is, and likely an altogether different design. > But the way I understand it, it pretty much *is* a list of waiters, > along with a couple of flags to allow the processes to notify the other > side about lock/unlock/abort. It does resemble the lock groups, but I > don't think it has the same goals. So the parts that aren't relevant shouldn't be copied over. >> That having been said, I still don't see how that's really going to >> work. Just to take one example, suppose that the leader is trying to >> ERROR out, and the decoding workers are blocked waiting for a lock >> held by the leader. The system has no way of detecting this deadlock >> and resolving it automatically, which certainly seems unacceptable. >> The only way that's going to work is if the leader waits for the >> worker by trying to acquire a lock held by the worker. Then the >> deadlock detector would know to abort some transaction. But that >> doesn't really work either - the deadlock was created by the >> foreground process trying to abort, and if the deadlock detector >> chooses that process as its victim, what then? We're already trying >> to abort, and the abort code isn't supposed to throw further errors, >> or fail in any way, lest we break all kinds of other things. Not to >> mention the fact that running the deadlock detector in the abort path >> isn't really safe to begin with, again because we can't throw errors >> when we're already in an abort path. > > Fair point, not sure. I'll leave this up to Nikhil. That's fine, but please understand that I think there's a basic design flaw here that just can't be overcome with any amount of hacking on the details here. I think we need a much higher-level consideration of the problem here and probably a lot of new infrastructure to support it. One idea might be to initially support decoding of in-progress transactions only if they don't modify any catalog state. That would leave out a bunch of cases we'd probably like to support, such as CREATE TABLE + COPY in the same transaction, but it would likely dodge a lot of really hard problems, too, and we could improve things later. One approach to the problem of catalog changes would be to prevent catalog tuples from being removed even after transaction abort until such time as there's no decoding in progress that might care about them. That is not by itself sufficient because a transaction can abort after inserting a heap tuple but before inserting an index tuple and we can't look at the catalog when it's an inconsistent state like that and expect reasonable results. But it helps: for example, if you are decoding a transaction that has inserted a WAL record with a cmin or cmax value of 4, and you know that none of the catalog records created by that transaction can have been pruned, then it should be safe to use a snapshot with CID 3 or smaller to decode the catalogs. So consider a case like: BEGIN; CREATE TABLE blah ... -- command ID 0 COPY blah FROM '/tmp/blah' ... -- command ID 1 Once we see the COPY show up in the WAL, it should be safe to decode the CREATE TABLE command and figure out what a snapshot with command ID 0 can see (again, assuming we've suppressed pruning in the catalogs in a sufficiently-well-considered way). Then, as long as the COPY command doesn't do any DML via a trigger or a datatype input function (!) or whatever, we should be able to use that snapshot to decode the data inserted by COPY. I'm not quite sure what happens if the COPY does do some DML or something like that -- we might have to stop decoding until the following command begins in the live transaction, or something like that. Or maybe we don't have to do that. I'm not totally sure how the command counter is managed for catalog snapshots. However it works in detail, we will get into trouble if we ever use a catalog snapshot that can see a change that the live transaction is still in the midst of making. Even with pruning prevented, we can only count on the catalogs to be in a consistent state once the live transaction has finished the command -- otherwise, for example, it might have increased pg_class.relnatts but not yet added the pg_attribute entry at the time it aborts, or something like that. I'm blathering a little bit but hopefully you get the point: I think the way forward is for somebody to think carefully through how and under what circumstances using a catalog snapshot can be made safe even if an abort has occurred afterwards -- not trying to postpone the abort, which I think is never going to be right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company