On Wed, Jul 27, 2011 at 10:51 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I wonder whether we could do something involving WAL properties --- the >> current tuple visibility logic was designed before WAL existed, so it's >> not exploiting that resource at all. I'm imagining that the kernel of a >> snapshot is just a WAL position, ie the end of WAL as of the time you >> take the snapshot (easy to get in O(1) time). Visibility tests then >> reduce to "did this transaction commit with a WAL record located before >> the specified position?". You'd need some index datastructure that made >> it reasonably cheap to find out the commit locations of recently >> committed transactions, where "recent" means "back to recentGlobalXmin". >> That seems possibly do-able, though I don't have a concrete design in >> mind. > > [discussion of why I don't think an LSN will work] > > But having said that an LSN can't work, I don't see why we can't just > use a 64-bit counter. In fact, the predicate locking code already > does something much like this, using an SLRU, for serializable > transactions only. In more detail, what I'm imagining is an array > with 4 billion entries, one per XID, probably broken up into files of > say 16MB each with 2 million entries per file. Each entry is a 64-bit > value. It is 0 if the XID has not yet started, is still running, or > has aborted. Otherwise, it is the commit sequence number of the > transaction.
I've been giving this quite a bit more thought, and have decided to abandon the scheme described above, at least for now. It has the advantage of avoiding virtually all locking, but it's extremely inefficient in its use of memory in the presence of long-running transactions. For example, if there's an open transaction that's been sitting around for 10 million transactions or so and has an XID assigned, any new snapshot is going to need to probe into the big array for any XID in that range. At 8 bytes per entry, that means we're randomly accessing about ~80MB of memory-mapped data. That seems problematic both in terms of blowing out the cache and (on small machines) possibly even blowing out RAM. Nor is that the worst case scenario: a transaction could sit open for 100 million transactions. Heikki has made the suggestion a few times (and a few other people have since made somewhat similar suggestions in different words) of keeping an-up-to-date snapshot in shared memory such that transactions that need a snapshot can simply copy it. I've since noted that in Hot Standby mode, that's more or less what the KnownAssignedXids stuff already does. I objected that, first, the overhead of updating the snapshot for every commit would be too great, and second, it didn't seem to do a whole lot to reduce the size of the critical section, and therefore probably wouldn't improve performance that much. But I'm coming around to the view that these might be solvable problems rather than reasons to give up on the idea altogether. With respect to the first problem, what I'm imagining is that we not do a complete rewrite of the snapshot in shared memory on every commit. Instead, when a transaction ends, we'll decide whether to (a) write a new snapshot or (b) just record the XIDs that ended. If we do (b), then any backend that wants a snapshot will need to copy from shared memory both the most recently written snapshot and the XIDs that have subsequently ended. From there, it can figure out which XIDs are still running. Of course, if the list of recently-ended XIDs gets too long, then taking a snapshot will start to get expensive, so we'll need to periodically do (a) instead. There are other ways that this could be done as well; for example, the KnownAssignedXids stuff just flags XIDs that should be ignored and then periodically compacts away the ignored entries. I think the real trick is figuring out a design that can improve concurrency. If you keep a snapshot in shared memory and periodically overwrite it in place, I don't think you're going to gain much. Everyone who wants a snapshot still needs a share-lock and everyone who wants to commit still needs an exclusive-lock, and while you might be able to make the critical section a bit shorter, I think it's still going to be hard to make big gains that way. What I'm thinking about instead is using a ring buffer with three pointers: a start pointer, a stop pointer, and a write pointer. When a transaction ends, we advance the write pointer, write the XIDs or a whole new snapshot into the buffer, and then advance the stop pointer. If we wrote a whole new snapshot, we advance the start pointer to the beginning of the data we just wrote. Someone who wants to take a snapshot must read the data between the start and stop pointers, and must then check that the write pointer hasn't advanced so far in the meantime that the data they read might have been overwritten before they finished reading it. Obviously, that's a little risky, since we'll have to do the whole thing over if a wraparound occurs, but if the ring buffer is large enough it shouldn't happen very often. And a typical snapshot is pretty small unless massive numbers of subxids are in use, so it seems like it might not be too bad. Of course, it's pretty hard to know for sure without coding it up and testing it. There are a couple of reasons to think that this design might be better than what we're doing now. Writers (people trying to commit) can overlap with readers (people trying to take a snapshot), so long as any reader is guaranteed to see the most recently completed write (and no portion of a write that is still in progress). Also, this approach admits some optimizations that are hard to manage with our current scheme. For example, you can cache the data that you most recently removed from the array. If you go to take a new snapshot and find that the stop pointer hasn't moved (I'm imagining it as a 64-bit counter that never wraps), then you don't need to copy the data out of the array again; you can just reuse what you got last time. Maybe that doesn't matter, since contents of the ProcArray change pretty quickly, but then again I've seen quite a few profiles where GetSnapshotData() was very high up in the list, so it seems at least possible that cache line contention is making access to shared memory slow enough that there is value in trying to optimize some of it away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers