On Thu, Mar 6, 2014 at 7:07 PM, Greg Stark <st...@mit.edu> wrote: > On Thu, Mar 6, 2014 at 11:14 PM, Robert Haas <robertmh...@gmail.com> wrote: >>> I've been tempted to implement a new type of hash index that allows both WAL >>> and high concurrency, simply by disallowing bucket splits. At the index >>> creation time you use a storage parameter to specify the number of buckets, >>> and that is that. If you mis-planned, build a new index with more buckets, >>> possibly concurrently, and drop the too-small one. >> >> Yeah, we could certainly do something like that. It sort of sucks, >> though. I mean, it's probably pretty easy to know that starting with >> the default 2 buckets is not going to be enough; most people will at >> least be smart enough to start with, say, 1024. But are you going to >> know whether you need 32768 or 1048576 or 33554432? A lot of people >> won't, and we have more than enough reasons for performance to degrade >> over time as it is. > > The other thought I had was that you can do things lazily in vacuum. > So when you probe you need to check multiple pages until vacuum comes > along and rehashes everything.
I was thinking about this, too. Cleaning up the old bucket after the split is pretty similar to vacuuming. And lo and behold, vacuum also needs to lock the entire bucket. AFAICT, there are two reasons for this. First, when we resume a suspended scan, we assume that the next match on the page, if any, will occur on the same page at an offset greater than the offset where we found the previous match. The code copes with the possibility of current insertions by refinding the last item we returned and scanning forward from there, but assumes the pointer we're refinding can't have moved to a lower offset. I think this could be easily fixed - at essentially no cost - by changing hashgettuple so that, if the forward scan errors out, it tries scanning backward rather than just giving up. Second, vacuum compacts each bucket that it modifies using _hash_squeezebucket, which scans from the two ends of the index in toward the middle, filling any free space on earlier pages by pulling tuples from the end of the bucket chain. This is a little thornier. We could change vacuum so that it only removes TIDs from the individual pages, without actually trying to free up pages, but that seems undesirable. However, I think we might be able to get by with making bucket compaction less aggressive, without eliminating it altogether. Suppose that whenever we remove items from a page, we consider consolidating the page with its immediate predecessor and successor in the bucket chain. This means our utilization will be a little over 50% in the worst case where we have a full page, a page with one item, another full page, another page with one item, and so on. But that's not all bad, because compacting a bucket chain to the minimum possible size when it may well be about to suffer more inserts isn't necessarily a good thing anyway. Also, doing this means that we don't need to lock out scans from the entire bucket chain in order to compact. It's sufficient to make sure that nobody's in the middle of scanning the two pages we want to merge. That turns out not to be possible right now. When a scan is suspended, we still hold a pin on the page we're scanning. But when _hash_readnext or _hash_readprev walks the bucket chain, it drops both lock and pin on the current buffer and then pins and locks the new buffer. That, however, seems like it could easily be changed: drop lock on current buffer, acquire pin on new buffer, drop pin on current buffer, lock new buffer. Then, if we want to merge two buffers, it's enough to lock both of them for cleanup. To avoid any risk of deadlock, and also to avoid waiting for a long-running suspended scan to wake up and complete, we can do this conditionally; if we fail to get either cleanup lock, then just don't merge the pages; take an exclusive lock only and remove whatever you need to remove, leaving the rest. Merging pages is only a performance optimization, so if it fails now and then, no real harm done. (A side benefit of this approach is that we could opportunistically attempt to compact pages containing dead items any time we can manage a ConditionalLockBufferForCleanup() on the page, sort of like what HOT pruning does for heap blocks. We could even, after pruning away dead items, attempt to merge the page with siblings, so that even without vacuum the bucket chains can gradually shrink if the index tuples are discovered to be pointing to dead heap tuples.) With the above changes, vacuuming a bucket can happen without taking the heavyweight lock in exclusive mode, leaving bucket splitting as the only operation that requires it. And we could perhaps use basically the same algorithm to clean the tuples out of the old bucket after the split. The problem is that, when we're vacuuming, we know that no scan currently in progress can still care about any of the tuples we're removing. I suppose a SnapshotAny scan might care, but we don't and don't need to support that for hash indexes, I think, or at least not concurrently. When we're cleaning up after a bucket split, however, we don't know whether there are any pre-split scans still in flight. Given the locking changes described above, I had the notion that we could solve that problem by starting a new scan of the bucket, locking each buffer for cleanup. If, contrary to present practice, each scan always kept a pin, when we got to the end of the bucket chain, we'd know that any remaining scans started after ours, and thus had the latest information. There are at least two problems with this idea. First, it might make post-split cleanup take a really long time, if somebody's got a cursor paused in the middle of a bucket chain somewhere. Second, hash scans can be backed up, which is either an undetected-deadlock hazard or a returning-wrong-answers hazard depending on the exactly how you implement this. So I'm still a bit stumped about how to implement the "wait out scans" portion of the design I sketched out previously. In essence, that's what the heavyweight locking is implementing today, for both vacuum and bucket splits, and shuffling the work around between those two code paths in some way may be a good idea, but the core issue is that if we're not going to use heavyweight locks for waiting out scans, then we need some other mechanism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers