On Thu, Nov 24, 2016 at 11:06 AM, Greg Stark <st...@mit.edu> wrote: > Fwiw, Oracle does not use the undo log for snapshot fetches. It's used > only for transaction rollback and recovery. > > For snapshot isolation Oracle has yet a *third* copy of the data in a > space called the "rollback segment(s)". When you update a row in a > block you save the whole block in the rollback segment. When you try > to access a block you check if the CSN -- which is basically > equivalent to our LSN -- is newer than your snapshot and if it is you > fetch the old version of the block from the rollback.
My understanding is that this isn't correct. I think the rollback segments are what they call the thing that stores UNDO. See e.g. http://ss64.com/ora/syntax-redo.html Having said that, I'm by no means an Oracle expert. > Essentially their MVCC is done on a per-block level rather than a > per-row level and they keep only the newest version of the block in > the table, the rest are in the rollback segment. For what it's worth > I think our approach is cleaner and more flexible. They had a lot of > trouble with their approach over the years and it works well only > because they invested an enormous amount of development in it and also > because people throw a lot of hardware at it too. People do throw a lot of hardware at Oracle, but I don't think I'd accept the contention that Oracle generally gets less performance out of the same amount of hardware than PostgreSQL. There are certainly some things we often do faster -- including inserts, deletes, and transaction aborts -- but their implementation of updates seems to be very fast. Obviously, a lot depends on configuration and workload, so it's hard to make general statements, but I think it's more correct to imagine that people throw a lot of hardware at Oracle because that's where their big, critical databases are than to imagine that it's because Oracle is a resource hog. > I think the main use case we have trouble with is actually the "update > every row in the table" type of update which requires we write to > every block, plus a second copy of every block, plus write full pages > of both copies, then later set hint bits dirtying pages again and > generating more full pages writes, then later come along and vacuum > which requires two more writes of every block, etc. If we had a > solution for the special case of an update that replaces every row in > a page that I think would complement HOT nicely and go a long way > towards fixing our issues. This case is certainly a problem, but I don't believe it's anywhere close to being our only problem. My experience is that our system works fairly well when there are only a few indexes. However, with heavily indexed tables, all of your updates become non-HOT and then bad things happen. So you can either leave out indexes that you need for good query performance and then you're hosed, or you can add those indexes and then you're hosed anyway because of the bloat and write amplification associated with non-HOT updates. Also, write-heavy workloads that perform OK as long as there are no long-lived snapshots often enter a death spiral when you run reporting queries on the same system. Finer-grained snapshot tracking would help with that problem, but it doesn't solve it: now a single long-lived snapshot can "only" cause the database to double in size instead of increasing without bound, a scant consolation. Nor does it change the fundamental calculus that once a table gets bloated, it's very painful to de-bloat it. The appeal of a design that supports in-place update is that you don't bloat the table in the first place. You still have the bloat, of course, but it's off in a separate data structure that is engineered for efficient deletion. I think that the whole emphasis on whether and to what degree this is like Oracle is somewhat misplaced. I would look at it a different way. We've talked many times over the years about how PostgreSQL is optimized for aborts. Everybody that I've heard comment on that issue thinks that is a bad thing. I am proposing a design that is optimized for commits; that is, if the transaction commits, none of the pages it modified need to be dirtied again afterwards at any point. I think that's an extremely important property and it's one that we are very far from having today. It necessarily implies that you cannot store the old row versions in the heap, because if you do, then you are going to have to dirty the pages again to get rid of them (unless you prefer to just leak the space forever). Now there is plenty of room for argument about whether the specific design I proposed is going to be any good, and I think that would be quite an interesting discussion to have. But I think if you say, well, you know, the fact that we may rewrite the same page 5 or 6 times after a modification to set hint bits (a few at a time) and HOT prune and set all-visible and freeze isn't really any big deal, you must live in a world where the write bandwidth of the I/O channel is a lot less of a problem than it is in mine. And we've been around and around on all of that stuff and people have come up with various ideas to improve the situation - some of which have been implemented - but many of those ideas involve unpleasant trade-offs and so the core problems remain. If we don't do something more fundamental, we're still going to be up against the same basic issues ten years from now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers