On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas <hlinnakan...@vmware.com> wrote: > I think it's important to recap the design goals of this.
Seems reasonable to list them out. > * It should be usable and perform well for both large batch updates and > small transactions. I think that that's a secondary goal, a question to be considered but perhaps deferred during this initial effort. I agree that it certainly is important. > * It should perform well both when there are no duplicates, and when there > are lots of duplicates I think this is very important. > And from that follows some finer requirements: > > * Performance when there are no duplicates should be close to raw INSERT > performance. > > * Performance when all rows are duplicates should be close to raw UPDATE > performance. > > * We should not leave behind large numbers of dead tuples in either case. I agree with all that. > Anything else I'm missing? I think so, yes. I'll add: * Should not deadlock unreasonably. If the UPDATE case is to work and perform almost as well as a regular UPDATE, that must mean that it has essentially the same characteristics as plain UPDATE. In particular, I feel fairly strongly that it is not okay for upserts to deadlock with each other unless the possibility of each transaction locking multiple rows (in an inconsistent order) exists. I don't want to repeat the mistakes of MySQL here. This is a point that I stressed to Robert on a previous occasion [1]. It's why value locks and row locks cannot be held at the same time. Incidentally, that implies that all alternative schemes involving bloat will bloat once per attempt, I believe. I'll also add: * Should anticipate a day when Postgres needs plumbing for SQL MERGE, which is still something we want, particularly for batch operations. I realize that the standard doesn't strictly require MERGE to handle the concurrency issues, but even still I don't think that an implementation that doesn't is practicable - does such an implementation currently exist in any other system? > What about exclusion constraints? I'd like to see this work for them as > well. Currently, exclusion constraints are checked after the tuple is > inserted, and you abort if the constraint was violated. We could still > insert the heap and index tuples first, but instead of aborting on > violation, we would kill the heap tuple we already inserted and retry. There > are some complications there, like how to wake up any other backends that > are waiting to grab a lock on the tuple we just killed, but it seems doable. I agree that it's at least doable. > That would, however, perform badly and leave garbage behind if there are > duplicates. A refinement of that would be to first check for constraint > violations, then insert the tuple, and then check again. That would avoid > the garbage in most cases, but would perform much more poorly when there are > no duplicates, because it needs two index scans for every insertion. A > further refinement would be to keep track of how many duplicates there have > been recently, and switch between the two strategies based on that. Seems like an awful lot of additional mechanism. > That cost of doing two scans could be alleviated by using markpos/restrpos > to do the second scan. That is presumably cheaper than starting a whole new > scan with the same key. (markpos/restrpos don't currently work for non-MVCC > snapshots, so that'd need to be fixed, though) Well, it seems like we could already use a "pick up where you left off" mechanism in the case of regular btree index tuple insertions into unique indexes -- after all, we don't do that in the event of blocking pending the outcome of the other transaction (that inserted a duplicate that we need to conclusively know has or has not committed) today. The fact that this doesn't already exist leaves me less than optimistic about the prospect of making it work to facilitate a scheme such as the one you describe here. (Today we still need to catch a committed version of the tuple that would make our tuple a duplicate from a fresh index scan, only *after* waiting for a transaction to commit/abort at the end of our original index scan). So we're already pretty naive about this, even though it would pay to not be. Making something like markpos work for the purposes of an upsert implementation seems not only hard, but also like a possible modularity violation. Are we not unreasonably constraining the implementation going forward? My patch respects the integrity of the am abstraction, and doesn't really add any knowledge to the core system about how amcanunique index methods might go about implementing the new "amlock" method. The core system worries a bit about the "low level locks" (as it naively refers to value locks), and doesn't consider that it has the right to hold on to them for more than an instant, but that's about it. Plus we don't have to worry about whether something does or does not work for a certain snapshot type with my approach, because as with the current unique index btree coding, it operates at a lower level than that, and does not need to consider visibility as such. The markpos and restpos am methods only called for regular index (only) scans, that don't need to worry about things that are not visible. Of course, upsert needs to worry about invisible-but-conclusively-live things. This seems much harder, and basically implies value locking of some kind, if I'm not mistaken. So have you really gained anything? So what I've done, aside from being, as you say below, close to optimal, is in a sense defined in terms of existing, well-established abstractions. I feel it's easier to reason about the implications of holding value locks (whatever the implementation) for longer and across multiple operations than it is to do all this instead. What I've done with locking is scary, but not as scary as the worst case of alternative implementations. > And that detour with exclusion constraints takes me back to the current > patch :-). What if you implemented the unique check in a similar fashion too > (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key, > and mark the position. Then do the insertion to that position. If the > insertion fails because of a duplicate key (which was inserted after we did > the first scan), mark the heap tuple as dead, and start over. The indexam > changes would be quite similar to the changes you made in your patch, but > instead of keeping the page locked, you'd only hold a pin on the target page > (if even that). The first indexam call would check that the key doesn't > exist, and remember the insert position. The second call would re-find the > previous position, and insert the tuple, checking again that there really > wasn't a duplicate key violation. The locking aspects would be less scary > than your current patch. > > I'm not sure if that would perform as well as your current patch. I must > admit your current approach is pretty optimal performance-wise. But I'd like > to see it, and that would be a solution for exclusion constraints in any > case. I'm certainly not opposed to making something like this work for exclusion constraints. Certainly, I want this to be as general as possible. But I don't think that it needs to be a blocker, and I don't think we gain anything in code footprint by addressing that by being as general as possible in our approach to the basic concurrency issue. After all, we're going to have to repeat the basic pattern in multiple modules. With exclusion constraints, we'd have to worry about a single slot proposed for insertion violating (and therefore presumably obliging us to lock) every row in the table. Are we going to have a mechanism for spilling a tid array potentially sized in gigabytes to disk (relating to just one slot proposed for insertion)? Is it principled to have that one slot project out rejects consisting of (say) the entire table? Is it even useful to lock multiple rows if we can't really update them, because they'll overlap each other when all updated with the one value? These are complicated questions, and frankly I don't have the bandwidth to answer them too soon. I just want to implement a feature that there is obviously huge pent up demand for, that has in the past put Postgres at a strategic disadvantage. I don't think it is unsound to define ON DUPLICATE KEY in terms of unique indexes. That's how we represent uniques...it isn't spelt ON OVERLAPPING or whatever. That seems like an addition, a nice-to-have, and maybe not even that, because exclusion-constrained columns *aren't* keys, and people aren't likely to want to upsert details of a booking (the typical exclusion constraint use-case) with the booking range in the UPDATE part's predicate. They'd just do it by key, because they'd already have a booking number PK value or whatever. Making this perform as well as possible is an important consideration. All alternative approaches that involve bloat concern me, and for reasons that I'm not sure were fully appreciated during earlier discussion on this thread: I'm worried about the worst case, not the average case. I am worried about a so-called "thundering herd" scenario. You need something like LockTuple() to arbitrate ordering, which seems complex, and like a further modularity violation. If this is to perform well when there are lots of existing tuples to be updated (with contention that wouldn't be considered unreasonable for plain updates), the amount of bloat generated by a thundering herd could be really really bad (once per attempt per "head of cattle"/upserter) . It's hard to say for sure how much of a problem this is, but I think it needs to be considered. It's a problem that I'm not sure we have the tools to analyze ahead of time. It's easier to pin down and reason about the conventional value locking stuff, because we know how deadlocks work. We know how to do analysis of deadlock hazards, and the surface area actually turns out to be not too large there. > One fairly limitation with your current approach is that the number of > lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 100). > Another limitation is that the minimum for shared_buffers is only 16. > Neither of those is a serious problem in real applications - no-one runs > with shared_buffers=16 and no sane schema has a hundred unique indexes, but > it's still something to consider. I was under the impression, based on previously feedback, that what I've done with LWLocks was unlikely to be accepted. I proceeded under the assumption that we'll be able to ameliorate these problems, as for example by implementing an alternative value locking mechanism (an SLRU?) that is similar to what I've done to date (in particular, very cheap and fast), but without all the down-sides that concerned Robert and Andres, and now you. As I said, I still think that's easier and safer than all alternative approaches described to date. It just so happens that I also believe it will perform a lot better in the average case too, but that isn't a key advantage to my mind. You're right that the value locking is scary. I think we need to very carefully consider it, once I have buy-in on the basic approach. I really do think it's the least-worst approach described to date. It isn't like we can't discuss making it inherently less scary, but I hesitate to do that now, given that I don't know if that discussing will go anywhere. Thanks for your efforts on reviewing my work here! Do you think it would be useful at this juncture to write a patch to make the order of locking across unique indexes well-defined? I think it may well have independent value to get the insertion into unique indexes (that can throw errors) out of the way when doing a regular slot insertion. Better to abort the transaction as soon as possible. [1] http://www.postgresql.org/message-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9m...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers