On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: >> >> >> I have fixed all other issues you have raised. Updated patch is >> attached with this mail. > > > I am finding the comments (particularly README) quite hard to follow. There > are many references to an "overflow bucket", or similar phrases. I think > these should be "overflow pages". A bucket is a conceptual thing consisting > of a primary page for that bucket and zero or more overflow pages for the > same bucket. There are no overflow buckets, unless you are referring to the > new bucket to which things are being moved. >
Hmm. I think page or block is a concept of database systems and buckets is a general concept used in hashing technology. I think the difference is that there are primary buckets and overflow buckets. I have checked how they are referred in one of the wiki pages [1], search for overflow on that wiki page. Now, I think we shouldn't be inconsistent in using them. I will change to make it same if I find any inconsistency based on what you or other people think is the better way to refer overflow space. > Was maintaining on-disk compatibility a major concern for this patch? Would > you do things differently if that were not a concern? > I would not have done much differently from what it is now, however one thing I have considered during development was to change the hash index tuple structure as below to mark the index tuples as move-by-split: typedef struct { IndexTuple entry; /* tuple to insert */ bool moved_by_split; } HashEntryData; The other alternative was to use the (unused) bit in IndexTupleData->tinfo. I have chosen the later approach, now one could definitely argue that it is the last available bit in IndexTuple and using it for hash indexes might or might not be best thing to do. However, I think it is also not advisable to break the compatibility if we can use some existing bit. In any case, the same question can arise whenever anyone wants to use it for some other purpose. > In particular, I am thinking about the need for every insert to > exclusive-content-lock the meta page to increment the index-wide tuple > count. This is not what this patch has changed. The main purpose of this patch is to change heavy-weight locking to light-weight locking and provide a way to handle the in-complete splits, both of which are required to sensibly write WAL for hash-indexes. Having said that, I agree with your point that we can improve the insertion logic, so that we don't need to Write-lock the meta-page at each insert. I have noticed some other improvements in hash indexes as well during this work like caching the meta page, reduce lock/unlock calls for retrieving tuples from a page by making hash index scans work a page at a time as we do for btree scans, kill_prior_tuple mechanism is current quite naive and needs improvement and the biggest improvement is needed in create index logic where we are inserting tuple-by-tuple whereas btree operates at page level and also by-passes the shared buffers. One of such improvements (cache the meta page) is already being worked upon by my colleague and the patch [2] for same is in CF. The main point I want to high light is that apart from what this patch does, there are number of other potential areas which needs improvements in hash indexes and I think it is better to do those as separate enhancements rather than as a single patch. > I think that this is going to be a huge bottleneck on update > intensive workloads (which I don't believe have been performance tested as > of yet). I have done some performance testing with this patch and I find there was a significant improvement as compare to what we have now in hash indexes even for read-write workload. I think the better idea is to compare it with btree, but in any case, even if this proves to be a bottleneck, we should try to improve it in a separate patch rather than as a part of this patch. > I was wondering if we might not want to change that so that each > bucket keeps a local count, and sweeps that up to the meta page only when it > exceeds a threshold. But this would require the bucket page to have an area > to hold such a count. Another idea would to keep not a count of tuples, but > of buckets with at least one overflow page, and split when there are too > many of those. I think both of these ideas could lead to change the point (tuple count) where we currently split. This might impact the search speed and space usage. Yet another alternative could be to change hashm_ntuples to 64bit and use 64-bit atomics to operate on it or may be use a separate spin-lock to protect it. However, whatever we decide to do with it, I think it is a matter of separate patch. Thanks for looking into patch. [1] - https://en.wikipedia.org/wiki/Linear_hashing [2] - https://commitfest.postgresql.org/10/715/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers