> > Actually, I'd like to improve all partitioned hashes instead of
> > improve only one case.
>
> Yeah. I'm not sure that should be an LWLock rather than a spinlock,
> but we can benchmark it both ways.
I would like to share some preliminary results. I tested four
implementations:
- no locks and no element stealing from other partitions;
- single LWLock per partitioned table;
- single spinlock per partitioned table;
- NUM_LOCK_PARTITIONS spinlocks per partitioned table;
Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128
partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to
128 in commit 3acc10c9:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a
Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
partitions (and assuming that all necessary locking/unlocking is done
on calling side) tables can't have more than NUM_LOCK_PARTITIONS
partitions because it would cause race conditions. For this reason I
had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
behaviour of PostgreSQL depending on different values of
NUM_LOCK_PARTITIONS.
So here are results:
Core i7, pgbench -j 8 -c 8 -T 30 pgbench
(3 tests, TPS excluding connections establishing)
NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock
PARTITIONS | (99ccb2) | | | | array
-----------|----------|----------|----------|----------|----------
| 295.4 | 297.4 | 299.4 | 285.6 | 302.7
(1 << 4) | 286.1 | 300.5 | 283.4 | 300.9 | 300.4
| 300.0 | 300.0 | 302.1 | 300.7 | 300.3
-----------|----------|----------|----------|----------|----------
| | 296.7 | 299.9 | 298.8 | 298.3
(1 << 5) | ---- | 301.9 | 302.2 | 305.7 | 306.3
| | 287.7 | 301.0 | 303.0 | 304.5
-----------|----------|----------|----------|----------|----------
| | 296.4 | 300.5 | 302.9 | 304.6
(1 << 6) | ---- | 301.7 | 305.6 | 306.4 | 302.3
| | 299.6 | 304.5 | 306.6 | 300.4
-----------|----------|----------|----------|----------|----------
| | 295.9 | 298.7 | 295.3 | 305.0
(1 << 7) | ---- | 299.5 | 300.5 | 299.0 | 310.2
| | 287.8 | 285.9 | 300.2 | 302.2
Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database
(3 test, TPS excluding connections establishing)
NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock
PARTITIONS | (99ccb2) | | | | array
-----------|----------|----------|----------|----------|----------
| 505.1 | 521.3 | 511.1 | 524.4 | 501.6
(1 << 4) | 452.4 | 467.4 | 509.2 | 472.3 | 453.7
| 435.2 | 462.4 | 445.8 | 467.9 | 467.0
-----------|----------|----------|----------|----------|----------
| | 514.8 | 476.3 | 507.9 | 510.6
(1 << 5) | ---- | 457.5 | 491.2 | 464.6 | 431.7
| | 442.2 | 457.0 | 495.5 | 448.2
-----------|----------|----------|----------|----------|----------
| | 516.4 | 502.5 | 468.0 | 521.3
(1 << 6) | ---- | 463.6 | 438.7 | 488.8 | 455.4
| | 434.2 | 468.1 | 484.7 | 433.5
-----------|----------|----------|----------|----------|----------
| | 513.6 | 459.4 | 519.6 | 510.3
(1 << 7) | ---- | 470.1 | 454.6 | 445.5 | 415.9
| | 459.4 | 489.7 | 457.1 | 452.8
60-core server, pgbench -j 64 -c 64 -T 30 pgbench
(3 tests, TPS excluding connections establishing)
NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock
PARTITIONS | (99ccb2) | | | | array
-----------|----------|----------|----------|----------|----------
| 3156.2 | 3157.9 | 3542.0 | 3444.3 | 3472.4
(1 << 4) | 3268.5 | 3444.7 | 3485.7 | 3486.0 | 3500.5
| 3251.2 | 3482.3 | 3398.7 | 3587.1 | 3557.7
-----------|----------|----------|----------|----------|----------
| | 3352.7 | 3556.0 | 3543.3 | 3526.8
(1 << 5) | ---- | 3465.0 | 3475.2 | 3486.9 | 3528.4
| | 3410.0 | 3482.0 | 3493.7 | 3444.9
-----------|----------|----------|----------|----------|----------
| | 3437.8 | 3413.1 | 3445.8 | 3481.6
(1 << 6) | ---- | 3470.1 | 3478.4 | 3538.5 | 3579.9
| | 3450.8 | 3431.1 | 3509.0 | 3512.5
-----------|----------|----------|----------|----------|----------
| | 3425.4 | 3534.6 | 3414.7 | 3517.1
(1 << 7) | ---- | 3436.5 | 3430.0 | 3428.0 | 3536.4
| | 3455.6 | 3479.7 | 3573.4 | 3543.0
60-core server, pgbench -j 64 -c 64 -f big_table.sql -T 30 my_database
(3 tests, TPS excluding connections establishing)
NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock
PARTITIONS | (99ccb2) | | | | array
-----------|----------|----------|----------|----------|----------
| 661.1 | 4639.6 | 1435.2 | 445.9 | 1589.6
(1 << 4) | 642.9 | 4566.7 | 1410.3 | 457.1 | 1601.7
| 643.9 | 4621.8 | 1404.8 | 489.0 | 1592.6
-----------|----------|----------|----------|----------|----------
| | 4721.9 | 1543.1 | 499.1 | 1596.9
(1 << 5) | ---- | 4506.8 | 1513.0 | 528.3 | 1594.7
| | 4744.7 | 1540.3 | 524.0 | 1593.0
-----------|----------|----------|----------|----------|----------
| | 4649.1 | 1564.5 | 475.9 | 1580.1
(1 << 6) | ---- | 4671.0 | 1560.5 | 485.6 | 1589.1
| | 4751.0 | 1557.4 | 505.1 | 1580.3
-----------|----------|----------|----------|----------|----------
| | 4657.7 | 1551.8 | 534.7 | 1585.1
(1 << 7) | ---- | 4616.8 | 1546.8 | 495.8 | 1623.4
| | 4779.2 | 1538.5 | 537.4 | 1588.5
All four implementations (W.I.P. quality --- dirty code, no comments,
etc) are attached to this message. Schema of my_database and
big_table.sql file are attached to the first message of this thread.
A large spread of TPS on Core i7 is due to the fact that its actually
my laptop with other applications running beside PostgreSQL. Still we
see that all solutions are equally good on this CPU and there is no
performance degradation.
Now regarding 60-core server:
- One spinlock per hash table doesn't scale. I personally was expecting
this;
- LWLock's and array of spinlocks do scale on NUMA up to a certain
point;
- Best results are shown by "no locks";
I believe that "no locks" implementation is the best one since it is at
least 3 times faster on NUMA then any other implementation. Also it is
simpler and doesn't have stealing-from-other-freelists logic that
executes rarely and therefore is a likely source of bugs. Regarding ~16
elements of freelists which in some corner cases could but wouldn't be
used --- as I mentioned before I believe its not such a big problem.
Also its a small price to pay for 3 times more TPS.
Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
doubts. For sure Robert had a good reason for committing 3acc10c9.
Unfortunately I'm not familiar with a story behind this commit. What do
you think?
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
*/
HTAB *
ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size, /* initial table size */
+ long init_size, /* initial table size */ // AALEKSEEV: is ignored, refactor!
long max_size, /* max size of the table */
HASHCTL *infoP, /* info about key and bucket size */
int hash_flags) /* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;
- return hash_create(name, init_size, infoP, hash_flags);
+ return hash_create(name, max_size, infoP, hash_flags);
}
/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d43fb61..d43c7ca 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -359,6 +359,15 @@ NumLWLocks(void)
/* slot.c needs one for each slot */
numLocks += max_replication_slots;
+ /* buf_table.c needs one for partitioned hash table "Shared Buffer Lookup Table" */
+ numLocks += 1;
+
+ /* lock.c needs two for partitioned hash tables "LOCK hash" and "PROCLOCK hash" */
+ numLocks += 2;
+
+ /* predicate.c need two for partitioned hash tables "PREDICATELOCKTARGET hash" and "PREDICATELOCK hash" */
+ numLocks += 2;
+
/*
* Add any requested by loadable modules; for backwards-compatibility
* reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..1460345 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
#include "access/xact.h"
#include "storage/shmem.h"
#include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
#include "utils/dynahash.h"
#include "utils/memutils.h"
@@ -129,11 +131,16 @@ typedef HASHBUCKET *HASHSEGMENT;
struct HASHHDR
{
/* In a partitioned table, take this lock to touch nentries or freeList */
- slock_t mutex; /* unused if not partitioned table */
+ //slock_t mutex; /* unused if not partitioned table */
+
+ // AALEKSEEV: fix comments
+ LWLock* lock; // only for partitioned hash tables
/* These fields change during entry addition/deletion */
- long nentries; /* number of entries in hash table */
- HASHELEMENT *freeList; /* linked list of free elements */
+ /* number of entries in hash table */
+ long nentries[NUM_LOCK_PARTITIONS];
+ /* linked list of free elements */
+ HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];
/* These fields can change, but not in a partitioned table */
/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +173,9 @@ struct HASHHDR
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)
+// AALEKSEEV: add comment
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
/*
* Top control structure for a hashtable --- in a shared table, each backend
* has its own copy (OK since no fields change at runtime)
@@ -219,10 +229,10 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
-static HASHBUCKET get_hash_entry(HTAB *hashp);
+static HASHBUCKET get_hash_entry(HTAB *hashp, int partition_idx);
static void hdefault(HTAB *hashp);
static int choose_nelem_alloc(Size entrysize);
static bool init_htab(HTAB *hashp, long nelem);
@@ -282,6 +292,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
{
HTAB *hashp;
HASHHDR *hctl;
+ int i, partitions_number, nelem_alloc;
/*
* For shared hash tables, we have a local hash header (HTAB struct) that
@@ -408,7 +419,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if (!hashp->hctl)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (3)"))); // AALEKSEEV: fix string
}
hashp->frozen = false;
@@ -482,10 +493,20 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if ((flags & HASH_SHARED_MEM) ||
nelem < hctl->nelem_alloc)
{
- if (!element_alloc(hashp, (int) nelem))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ if(IS_PARTITIONED(hashp->hctl))
+ partitions_number = NUM_LOCK_PARTITIONS;
+ else
+ partitions_number = 1;
+
+ nelem_alloc = ((int) nelem) / partitions_number;
+ if(nelem_alloc == 0)
+ nelem_alloc = 1;
+
+ for(i = 0; i < partitions_number; i++)
+ if (!element_alloc(hashp, nelem_alloc, i))
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory (1)"))); // AALEKSEEV: fix string
}
if (flags & HASH_FIXED_SIZE)
@@ -503,8 +524,11 @@ hdefault(HTAB *hashp)
MemSet(hctl, 0, sizeof(HASHHDR));
- hctl->nentries = 0;
- hctl->freeList = NULL;
+ // AALEKSEEV: redundant!
+ // hctl->nentries = 0;
+ // hctl->freeList = NULL;
+
+ hctl->lock = NULL;
hctl->dsize = DEF_DIRSIZE;
hctl->nsegs = 0;
@@ -576,8 +600,12 @@ init_htab(HTAB *hashp, long nelem)
/*
* initialize mutex if it's a partitioned table
*/
- if (IS_PARTITIONED(hctl))
- SpinLockInit(&hctl->mutex);
+ // AALEKSEEV: remove
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockInit(&hctl->mutex);
+
+ if(IS_PARTITIONED(hctl))
+ hctl->lock = LWLockAssign();
/*
* Divide number of elements by the fill factor to determine a desired
@@ -648,7 +676,8 @@ init_htab(HTAB *hashp, long nelem)
"HIGH MASK ", hctl->high_mask,
"LOW MASK ", hctl->low_mask,
"NSEGS ", hctl->nsegs,
- "NENTRIES ", hctl->nentries);
+ // AALEKSEEV: fix this
+ "NENTRIES ", hctl->nentries[0]);
#endif
return true;
}
@@ -769,7 +798,8 @@ hash_stats(const char *where, HTAB *hashp)
where, hashp->hctl->accesses, hashp->hctl->collisions);
fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n",
- hashp->hctl->nentries, (long) hashp->hctl->keysize,
+ // AALEKSEEV: fix this
+ hashp->hctl->nentries[0], (long) hashp->hctl->keysize,
hashp->hctl->max_bucket, hashp->hctl->nsegs);
fprintf(stderr, "%s: total accesses %ld total collisions %ld\n",
where, hash_accesses, hash_collisions);
@@ -863,6 +893,7 @@ hash_search_with_hash_value(HTAB *hashp,
HASHBUCKET currBucket;
HASHBUCKET *prevBucketPtr;
HashCompareFunc match;
+ int partition_idx = PARTITION_IDX(hctl, hashvalue);
#if HASH_STATISTICS
hash_accesses++;
@@ -885,7 +916,7 @@ hash_search_with_hash_value(HTAB *hashp,
* order of these tests is to try to check cheaper conditions first.
*/
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
- hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+ hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}
@@ -942,21 +973,26 @@ hash_search_with_hash_value(HTAB *hashp,
if (currBucket != NULL)
{
/* if partitioned, must lock to touch nentries and freeList */
+ // AALEKSEEV: remove this
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ LWLockAcquire(hctl->lock, LW_SHARED);
+ // SpinLockAcquire(&hctl->mutex);
- Assert(hctl->nentries > 0);
- hctl->nentries--;
+ Assert(hctl->nentries[partition_idx] > 0);
+ hctl->nentries[partition_idx]--;
/* remove record from hash bucket's chain. */
*prevBucketPtr = currBucket->link;
/* add the record to the freelist for this table. */
- currBucket->link = hctl->freeList;
- hctl->freeList = currBucket;
+ currBucket->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = currBucket;
+ // AALEKSEEV: remove this
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ LWLockRelease(hctl->lock);
/*
* better hope the caller is synchronizing access to this
@@ -982,7 +1018,7 @@ hash_search_with_hash_value(HTAB *hashp,
elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
hashp->tabname);
- currBucket = get_hash_entry(hashp);
+ currBucket = get_hash_entry(hashp, partition_idx);
if (currBucket == NULL)
{
/* out of memory */
@@ -996,7 +1032,7 @@ hash_search_with_hash_value(HTAB *hashp,
else
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (2)"))); // AALEKSEEV: fix string
}
/* link into hashbucket chain */
@@ -1175,41 +1211,63 @@ hash_update_hash_key(HTAB *hashp,
* create a new entry if possible
*/
static HASHBUCKET
-get_hash_entry(HTAB *hashp)
+get_hash_entry(HTAB *hashp, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
HASHBUCKET newElement;
+ int steal_from_idx;
for (;;)
{
/* if partitioned, must lock to touch nentries and freeList */
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ LWLockAcquire(hctl->lock, LW_SHARED);
/* try to get an entry from the freelist */
- newElement = hctl->freeList;
+ newElement = hctl->freeList[partition_idx];
+
if (newElement != NULL)
- break;
+ {
+ /* remove entry from freelist, bump nentries */
+ hctl->freeList[partition_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
+ if (IS_PARTITIONED(hctl))
+ LWLockRelease(hctl->lock);
+
+ return newElement;
+ }
- /* no free elements. allocate another chunk of buckets */
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ LWLockRelease(hctl->lock);
- if (!element_alloc(hashp, hctl->nelem_alloc))
+ /* no free elements. allocate another chunk of buckets */
+ if (!element_alloc(hashp, hctl->nelem_alloc, partition_idx))
{
- /* out of memory */
- return NULL;
- }
- }
+ if (!IS_PARTITIONED(hctl))
+ return NULL; /* out of memory */
- /* remove entry from freelist, bump nentries */
- hctl->freeList = newElement->link;
- hctl->nentries++;
-
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ /* try to "steal" element from another partition */
+ LWLockAcquire(hctl->lock, LW_EXCLUSIVE);
+ steal_from_idx = partition_idx;
+ for(;;)
+ {
+ steal_from_idx = (steal_from_idx + 1) % NUM_LOCK_PARTITIONS;
+ if(steal_from_idx == partition_idx)
+ break;
+
+ newElement = hctl->freeList[steal_from_idx];
+ if(newElement != NULL)
+ {
+ hctl->freeList[steal_from_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
+ break;
+ }
+ }
- return newElement;
+ LWLockRelease(hctl->lock);
+ return newElement;
+ }
+ }
}
/*
@@ -1218,11 +1276,21 @@ get_hash_entry(HTAB *hashp)
long
hash_get_num_entries(HTAB *hashp)
{
+ int i;
+ long sum = hashp->hctl->nentries[0];
+
/*
* We currently don't bother with the mutex; it's only sensible to call
* this function if you've got lock on all partitions of the table.
*/
- return hashp->hctl->nentries;
+
+ if(!IS_PARTITIONED(hashp->hctl))
+ return sum;
+
+ for(i = 1; i < NUM_LOCK_PARTITIONS; i++)
+ sum += hashp->hctl->nentries[i];
+
+ return sum;
}
/*
@@ -1530,7 +1598,7 @@ seg_alloc(HTAB *hashp)
* allocate some new elements and link them into the free list
*/
static bool
-element_alloc(HTAB *hashp, int nelem)
+element_alloc(HTAB *hashp, int nelem, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
Size elementSize;
@@ -1562,15 +1630,21 @@ element_alloc(HTAB *hashp, int nelem)
}
/* if partitioned, must lock to touch freeList */
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockAcquire(&hctl->mutex);
+
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ LWLockAcquire(hctl->lock, LW_SHARED);
/* freelist could be nonempty if two backends did this concurrently */
- firstElement->link = hctl->freeList;
- hctl->freeList = prevElement;
+ firstElement->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = prevElement;
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ LWLockRelease(hctl->lock);
+
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
return true;
}
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ff34529..767f20b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,13 +128,14 @@ extern char *MainLWLockNames[];
* having this file include lock.h or bufmgr.h would be backwards.
*/
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 128
-
/* Number of partitions the shared lock tables are divided into */
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
+ /* Number of partitions of the shared buffer mapping hashtable */
+ // AALEKSEEV: refactor
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+
/* Number of partitions the shared predicate lock tables are divided into */
#define LOG2_NUM_PREDICATELOCK_PARTITIONS 4
#define NUM_PREDICATELOCK_PARTITIONS (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
*/
HTAB *
ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size, /* initial table size */
+ long init_size, /* initial table size */ // AALEKSEEV: is ignored, refactor!
long max_size, /* max size of the table */
HASHCTL *infoP, /* info about key and bucket size */
int hash_flags) /* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;
- return hash_create(name, init_size, infoP, hash_flags);
+ return hash_create(name, max_size, infoP, hash_flags);
}
/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..2de9ec4 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
#include "access/xact.h"
#include "storage/shmem.h"
#include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
#include "utils/dynahash.h"
#include "utils/memutils.h"
@@ -129,11 +131,15 @@ typedef HASHBUCKET *HASHSEGMENT;
struct HASHHDR
{
/* In a partitioned table, take this lock to touch nentries or freeList */
- slock_t mutex; /* unused if not partitioned table */
+ //slock_t mutex; /* unused if not partitioned table */
+
+ // AALEKSEEV: fix comments
/* These fields change during entry addition/deletion */
- long nentries; /* number of entries in hash table */
- HASHELEMENT *freeList; /* linked list of free elements */
+ /* number of entries in hash table */
+ long nentries[NUM_LOCK_PARTITIONS];
+ /* linked list of free elements */
+ HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];
/* These fields can change, but not in a partitioned table */
/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +172,9 @@ struct HASHHDR
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)
+// AALEKSEEV: add comment
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
/*
* Top control structure for a hashtable --- in a shared table, each backend
* has its own copy (OK since no fields change at runtime)
@@ -219,10 +228,10 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
-static HASHBUCKET get_hash_entry(HTAB *hashp);
+static HASHBUCKET get_hash_entry(HTAB *hashp, int partition_idx);
static void hdefault(HTAB *hashp);
static int choose_nelem_alloc(Size entrysize);
static bool init_htab(HTAB *hashp, long nelem);
@@ -282,6 +291,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
{
HTAB *hashp;
HASHHDR *hctl;
+ int i, partitions_number, nelem_alloc;
/*
* For shared hash tables, we have a local hash header (HTAB struct) that
@@ -408,7 +418,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if (!hashp->hctl)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (3)"))); // AALEKSEEV: fix string
}
hashp->frozen = false;
@@ -482,10 +492,20 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if ((flags & HASH_SHARED_MEM) ||
nelem < hctl->nelem_alloc)
{
- if (!element_alloc(hashp, (int) nelem))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ if(IS_PARTITIONED(hashp->hctl))
+ partitions_number = NUM_LOCK_PARTITIONS;
+ else
+ partitions_number = 1;
+
+ nelem_alloc = ((int) nelem) / partitions_number;
+ if(nelem_alloc == 0)
+ nelem_alloc = 1;
+
+ for(i = 0; i < partitions_number; i++)
+ if (!element_alloc(hashp, nelem_alloc, i))
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory (1)"))); // AALEKSEEV: fix string
}
if (flags & HASH_FIXED_SIZE)
@@ -503,8 +523,9 @@ hdefault(HTAB *hashp)
MemSet(hctl, 0, sizeof(HASHHDR));
- hctl->nentries = 0;
- hctl->freeList = NULL;
+ // AALEKSEEV: redundant!
+ // hctl->nentries = 0;
+ // hctl->freeList = NULL;
hctl->dsize = DEF_DIRSIZE;
hctl->nsegs = 0;
@@ -576,8 +597,9 @@ init_htab(HTAB *hashp, long nelem)
/*
* initialize mutex if it's a partitioned table
*/
- if (IS_PARTITIONED(hctl))
- SpinLockInit(&hctl->mutex);
+ // AALEKSEEV: remove
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockInit(&hctl->mutex);
/*
* Divide number of elements by the fill factor to determine a desired
@@ -648,7 +670,8 @@ init_htab(HTAB *hashp, long nelem)
"HIGH MASK ", hctl->high_mask,
"LOW MASK ", hctl->low_mask,
"NSEGS ", hctl->nsegs,
- "NENTRIES ", hctl->nentries);
+ // AALEKSEEV: fix this
+ "NENTRIES ", hctl->nentries[0]);
#endif
return true;
}
@@ -769,7 +792,8 @@ hash_stats(const char *where, HTAB *hashp)
where, hashp->hctl->accesses, hashp->hctl->collisions);
fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n",
- hashp->hctl->nentries, (long) hashp->hctl->keysize,
+ // AALEKSEEV: fix this
+ hashp->hctl->nentries[0], (long) hashp->hctl->keysize,
hashp->hctl->max_bucket, hashp->hctl->nsegs);
fprintf(stderr, "%s: total accesses %ld total collisions %ld\n",
where, hash_accesses, hash_collisions);
@@ -863,6 +887,7 @@ hash_search_with_hash_value(HTAB *hashp,
HASHBUCKET currBucket;
HASHBUCKET *prevBucketPtr;
HashCompareFunc match;
+ int partition_idx = PARTITION_IDX(hctl, hashvalue);
#if HASH_STATISTICS
hash_accesses++;
@@ -885,7 +910,7 @@ hash_search_with_hash_value(HTAB *hashp,
* order of these tests is to try to check cheaper conditions first.
*/
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
- hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+ hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}
@@ -942,21 +967,23 @@ hash_search_with_hash_value(HTAB *hashp,
if (currBucket != NULL)
{
/* if partitioned, must lock to touch nentries and freeList */
- if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ // AALEKSEEV: remove this
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockAcquire(&hctl->mutex);
- Assert(hctl->nentries > 0);
- hctl->nentries--;
+ Assert(hctl->nentries[partition_idx] > 0);
+ hctl->nentries[partition_idx]--;
/* remove record from hash bucket's chain. */
*prevBucketPtr = currBucket->link;
/* add the record to the freelist for this table. */
- currBucket->link = hctl->freeList;
- hctl->freeList = currBucket;
+ currBucket->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = currBucket;
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // AALEKSEEV: remove this
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
/*
* better hope the caller is synchronizing access to this
@@ -982,7 +1009,7 @@ hash_search_with_hash_value(HTAB *hashp,
elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
hashp->tabname);
- currBucket = get_hash_entry(hashp);
+ currBucket = get_hash_entry(hashp, partition_idx);
if (currBucket == NULL)
{
/* out of memory */
@@ -996,7 +1023,7 @@ hash_search_with_hash_value(HTAB *hashp,
else
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (2)"))); // AALEKSEEV: fix string
}
/* link into hashbucket chain */
@@ -1175,7 +1202,7 @@ hash_update_hash_key(HTAB *hashp,
* create a new entry if possible
*/
static HASHBUCKET
-get_hash_entry(HTAB *hashp)
+get_hash_entry(HTAB *hashp, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
HASHBUCKET newElement;
@@ -1183,19 +1210,19 @@ get_hash_entry(HTAB *hashp)
for (;;)
{
/* if partitioned, must lock to touch nentries and freeList */
- if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockAcquire(&hctl->mutex);
/* try to get an entry from the freelist */
- newElement = hctl->freeList;
+ newElement = hctl->freeList[partition_idx];
if (newElement != NULL)
break;
/* no free elements. allocate another chunk of buckets */
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
- if (!element_alloc(hashp, hctl->nelem_alloc))
+ if (!element_alloc(hashp, hctl->nelem_alloc, partition_idx))
{
/* out of memory */
return NULL;
@@ -1203,11 +1230,11 @@ get_hash_entry(HTAB *hashp)
}
/* remove entry from freelist, bump nentries */
- hctl->freeList = newElement->link;
- hctl->nentries++;
+ hctl->freeList[partition_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
return newElement;
}
@@ -1218,11 +1245,21 @@ get_hash_entry(HTAB *hashp)
long
hash_get_num_entries(HTAB *hashp)
{
+ int i;
+ long sum = hashp->hctl->nentries[0];
+
/*
* We currently don't bother with the mutex; it's only sensible to call
* this function if you've got lock on all partitions of the table.
*/
- return hashp->hctl->nentries;
+
+ if(!IS_PARTITIONED(hashp->hctl))
+ return sum;
+
+ for(i = 1; i < NUM_LOCK_PARTITIONS; i++)
+ sum += hashp->hctl->nentries[i];
+
+ return sum;
}
/*
@@ -1530,7 +1567,7 @@ seg_alloc(HTAB *hashp)
* allocate some new elements and link them into the free list
*/
static bool
-element_alloc(HTAB *hashp, int nelem)
+element_alloc(HTAB *hashp, int nelem, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
Size elementSize;
@@ -1562,15 +1599,15 @@ element_alloc(HTAB *hashp, int nelem)
}
/* if partitioned, must lock to touch freeList */
- if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockAcquire(&hctl->mutex);
/* freelist could be nonempty if two backends did this concurrently */
- firstElement->link = hctl->freeList;
- hctl->freeList = prevElement;
+ firstElement->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = prevElement;
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
return true;
}
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ff34529..051f032 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,13 +128,14 @@ extern char *MainLWLockNames[];
* having this file include lock.h or bufmgr.h would be backwards.
*/
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 128
-
/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS 4
+#define LOG2_NUM_LOCK_PARTITIONS 4 // 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
+ /* Number of partitions of the shared buffer mapping hashtable */
+ // AALEKSEEV: refactor
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+
/* Number of partitions the shared predicate lock tables are divided into */
#define LOG2_NUM_PREDICATELOCK_PARTITIONS 4
#define NUM_PREDICATELOCK_PARTITIONS (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
*/
HTAB *
ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size, /* initial table size */
+ long init_size, /* initial table size */ // AALEKSEEV: is ignored, refactor!
long max_size, /* max size of the table */
HASHCTL *infoP, /* info about key and bucket size */
int hash_flags) /* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;
- return hash_create(name, init_size, infoP, hash_flags);
+ return hash_create(name, max_size, infoP, hash_flags);
}
/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d43fb61..9d66359 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -359,6 +359,16 @@ NumLWLocks(void)
/* slot.c needs one for each slot */
numLocks += max_replication_slots;
+ // AALEKSEEV: refactor!
+ /* buf_table.c needs one for partitioned hash table "Shared Buffer Lookup Table" */
+ // numLocks += 1;
+
+ /* lock.c needs two for partitioned hash tables "LOCK hash" and "PROCLOCK hash" */
+ // numLocks += 2;
+
+ /* predicate.c need two for partitioned hash tables "PREDICATELOCKTARGET hash" and "PREDICATELOCK hash" */
+ // numLocks += 2;
+
/*
* Add any requested by loadable modules; for backwards-compatibility
* reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..cccb453 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
#include "access/xact.h"
#include "storage/shmem.h"
#include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
#include "utils/dynahash.h"
#include "utils/memutils.h"
@@ -129,11 +131,17 @@ typedef HASHBUCKET *HASHSEGMENT;
struct HASHHDR
{
/* In a partitioned table, take this lock to touch nentries or freeList */
- slock_t mutex; /* unused if not partitioned table */
+ //slock_t mutex; /* unused if not partitioned table */
+
+ // AALEKSEEV: fix comments
+ //LWLock* lock; // only for partitioned hash tables
+ slock_t mutex;
/* These fields change during entry addition/deletion */
- long nentries; /* number of entries in hash table */
- HASHELEMENT *freeList; /* linked list of free elements */
+ /* number of entries in hash table */
+ long nentries[NUM_LOCK_PARTITIONS];
+ /* linked list of free elements */
+ HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];
/* These fields can change, but not in a partitioned table */
/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +174,9 @@ struct HASHHDR
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)
+// AALEKSEEV: add comment
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
/*
* Top control structure for a hashtable --- in a shared table, each backend
* has its own copy (OK since no fields change at runtime)
@@ -219,10 +230,10 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
-static HASHBUCKET get_hash_entry(HTAB *hashp);
+static HASHBUCKET get_hash_entry(HTAB *hashp, int partition_idx);
static void hdefault(HTAB *hashp);
static int choose_nelem_alloc(Size entrysize);
static bool init_htab(HTAB *hashp, long nelem);
@@ -282,6 +293,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
{
HTAB *hashp;
HASHHDR *hctl;
+ int i, partitions_number, nelem_alloc;
/*
* For shared hash tables, we have a local hash header (HTAB struct) that
@@ -408,7 +420,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if (!hashp->hctl)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (3)"))); // AALEKSEEV: fix string
}
hashp->frozen = false;
@@ -482,10 +494,20 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if ((flags & HASH_SHARED_MEM) ||
nelem < hctl->nelem_alloc)
{
- if (!element_alloc(hashp, (int) nelem))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ if(IS_PARTITIONED(hashp->hctl))
+ partitions_number = NUM_LOCK_PARTITIONS;
+ else
+ partitions_number = 1;
+
+ nelem_alloc = ((int) nelem) / partitions_number;
+ if(nelem_alloc == 0)
+ nelem_alloc = 1;
+
+ for(i = 0; i < partitions_number; i++)
+ if (!element_alloc(hashp, nelem_alloc, i))
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory (1)"))); // AALEKSEEV: fix string
}
if (flags & HASH_FIXED_SIZE)
@@ -503,8 +525,11 @@ hdefault(HTAB *hashp)
MemSet(hctl, 0, sizeof(HASHHDR));
- hctl->nentries = 0;
- hctl->freeList = NULL;
+ // AALEKSEEV: redundant!
+ // hctl->nentries = 0;
+ // hctl->freeList = NULL;
+
+ // hctl->lock = NULL;
hctl->dsize = DEF_DIRSIZE;
hctl->nsegs = 0;
@@ -576,8 +601,12 @@ init_htab(HTAB *hashp, long nelem)
/*
* initialize mutex if it's a partitioned table
*/
- if (IS_PARTITIONED(hctl))
- SpinLockInit(&hctl->mutex);
+ if (IS_PARTITIONED(hctl))
+ SpinLockInit(&hctl->mutex);
+
+// AALEKSEEV: remove
+ //if(IS_PARTITIONED(hctl))
+ //hctl->lock = LWLockAssign();
/*
* Divide number of elements by the fill factor to determine a desired
@@ -648,7 +677,8 @@ init_htab(HTAB *hashp, long nelem)
"HIGH MASK ", hctl->high_mask,
"LOW MASK ", hctl->low_mask,
"NSEGS ", hctl->nsegs,
- "NENTRIES ", hctl->nentries);
+ // AALEKSEEV: fix this
+ "NENTRIES ", hctl->nentries[0]);
#endif
return true;
}
@@ -769,7 +799,8 @@ hash_stats(const char *where, HTAB *hashp)
where, hashp->hctl->accesses, hashp->hctl->collisions);
fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n",
- hashp->hctl->nentries, (long) hashp->hctl->keysize,
+ // AALEKSEEV: fix this
+ hashp->hctl->nentries[0], (long) hashp->hctl->keysize,
hashp->hctl->max_bucket, hashp->hctl->nsegs);
fprintf(stderr, "%s: total accesses %ld total collisions %ld\n",
where, hash_accesses, hash_collisions);
@@ -863,6 +894,7 @@ hash_search_with_hash_value(HTAB *hashp,
HASHBUCKET currBucket;
HASHBUCKET *prevBucketPtr;
HashCompareFunc match;
+ int partition_idx = PARTITION_IDX(hctl, hashvalue);
#if HASH_STATISTICS
hash_accesses++;
@@ -885,7 +917,7 @@ hash_search_with_hash_value(HTAB *hashp,
* order of these tests is to try to check cheaper conditions first.
*/
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
- hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+ hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}
@@ -942,20 +974,24 @@ hash_search_with_hash_value(HTAB *hashp,
if (currBucket != NULL)
{
/* if partitioned, must lock to touch nentries and freeList */
+ // AALEKSEEV: remove this
if (IS_PARTITIONED(hctl))
+ //LWLockAcquire(hctl->lock, LW_SHARED);
SpinLockAcquire(&hctl->mutex);
- Assert(hctl->nentries > 0);
- hctl->nentries--;
+ Assert(hctl->nentries[partition_idx] > 0);
+ hctl->nentries[partition_idx]--;
/* remove record from hash bucket's chain. */
*prevBucketPtr = currBucket->link;
/* add the record to the freelist for this table. */
- currBucket->link = hctl->freeList;
- hctl->freeList = currBucket;
+ currBucket->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = currBucket;
+ // AALEKSEEV: remove this
if (IS_PARTITIONED(hctl))
+ // LWLockRelease(hctl->lock);
SpinLockRelease(&hctl->mutex);
/*
@@ -982,7 +1018,7 @@ hash_search_with_hash_value(HTAB *hashp,
elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
hashp->tabname);
- currBucket = get_hash_entry(hashp);
+ currBucket = get_hash_entry(hashp, partition_idx);
if (currBucket == NULL)
{
/* out of memory */
@@ -996,7 +1032,7 @@ hash_search_with_hash_value(HTAB *hashp,
else
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (2)"))); // AALEKSEEV: fix string
}
/* link into hashbucket chain */
@@ -1175,41 +1211,68 @@ hash_update_hash_key(HTAB *hashp,
* create a new entry if possible
*/
static HASHBUCKET
-get_hash_entry(HTAB *hashp)
+get_hash_entry(HTAB *hashp, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
HASHBUCKET newElement;
+ int steal_from_idx;
for (;;)
{
/* if partitioned, must lock to touch nentries and freeList */
if (IS_PARTITIONED(hctl))
+ // LWLockAcquire(hctl->lock, LW_SHARED);
SpinLockAcquire(&hctl->mutex);
/* try to get an entry from the freelist */
- newElement = hctl->freeList;
+ newElement = hctl->freeList[partition_idx];
+
if (newElement != NULL)
- break;
+ {
+ /* remove entry from freelist, bump nentries */
+ hctl->freeList[partition_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
+ if (IS_PARTITIONED(hctl))
+ // LWLockRelease(hctl->lock);
+ SpinLockRelease(&hctl->mutex);
+
+ return newElement;
+ }
- /* no free elements. allocate another chunk of buckets */
if (IS_PARTITIONED(hctl))
SpinLockRelease(&hctl->mutex);
+ // LWLockRelease(hctl->lock);
- if (!element_alloc(hashp, hctl->nelem_alloc))
+ /* no free elements. allocate another chunk of buckets */
+ if (!element_alloc(hashp, hctl->nelem_alloc, partition_idx))
{
- /* out of memory */
- return NULL;
- }
- }
+ if (!IS_PARTITIONED(hctl))
+ return NULL; /* out of memory */
- /* remove entry from freelist, bump nentries */
- hctl->freeList = newElement->link;
- hctl->nentries++;
-
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ /* try to "steal" element from another partition */
+ // LWLockAcquire(hctl->lock, LW_EXCLUSIVE);
+ SpinLockAcquire(&hctl->mutex);
+ steal_from_idx = partition_idx;
+ for(;;)
+ {
+ steal_from_idx = (steal_from_idx + 1) % NUM_LOCK_PARTITIONS;
+ if(steal_from_idx == partition_idx)
+ break;
+
+ newElement = hctl->freeList[steal_from_idx];
+ if(newElement != NULL)
+ {
+ hctl->freeList[steal_from_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
+ break;
+ }
+ }
- return newElement;
+ // LWLockRelease(hctl->lock);
+ SpinLockRelease(&hctl->mutex);
+ return newElement;
+ }
+ }
}
/*
@@ -1218,11 +1281,21 @@ get_hash_entry(HTAB *hashp)
long
hash_get_num_entries(HTAB *hashp)
{
+ int i;
+ long sum = hashp->hctl->nentries[0];
+
/*
* We currently don't bother with the mutex; it's only sensible to call
* this function if you've got lock on all partitions of the table.
*/
- return hashp->hctl->nentries;
+
+ if(!IS_PARTITIONED(hashp->hctl))
+ return sum;
+
+ for(i = 1; i < NUM_LOCK_PARTITIONS; i++)
+ sum += hashp->hctl->nentries[i];
+
+ return sum;
}
/*
@@ -1530,7 +1603,7 @@ seg_alloc(HTAB *hashp)
* allocate some new elements and link them into the free list
*/
static bool
-element_alloc(HTAB *hashp, int nelem)
+element_alloc(HTAB *hashp, int nelem, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
Size elementSize;
@@ -1562,16 +1635,24 @@ element_alloc(HTAB *hashp, int nelem)
}
/* if partitioned, must lock to touch freeList */
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockAcquire(&hctl->mutex);
+
if (IS_PARTITIONED(hctl))
+ // LWLockAcquire(hctl->lock, LW_SHARED);
SpinLockAcquire(&hctl->mutex);
/* freelist could be nonempty if two backends did this concurrently */
- firstElement->link = hctl->freeList;
- hctl->freeList = prevElement;
+ firstElement->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = prevElement;
if (IS_PARTITIONED(hctl))
+ // LWLockRelease(hctl->lock);
SpinLockRelease(&hctl->mutex);
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
+
return true;
}
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ff34529..767f20b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,13 +128,14 @@ extern char *MainLWLockNames[];
* having this file include lock.h or bufmgr.h would be backwards.
*/
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 128
-
/* Number of partitions the shared lock tables are divided into */
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
+ /* Number of partitions of the shared buffer mapping hashtable */
+ // AALEKSEEV: refactor
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+
/* Number of partitions the shared predicate lock tables are divided into */
#define LOG2_NUM_PREDICATELOCK_PARTITIONS 4
#define NUM_PREDICATELOCK_PARTITIONS (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 78f15f0..91fcc05 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -265,7 +265,7 @@ InitShmemIndex(void)
*/
HTAB *
ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size, /* initial table size */
+ long init_size, /* initial table size */ // AALEKSEEV: is ignored, refactor!
long max_size, /* max size of the table */
HASHCTL *infoP, /* info about key and bucket size */
int hash_flags) /* info about infoP */
@@ -299,7 +299,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;
- return hash_create(name, init_size, infoP, hash_flags);
+ return hash_create(name, max_size, infoP, hash_flags);
}
/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d43fb61..9d66359 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -359,6 +359,16 @@ NumLWLocks(void)
/* slot.c needs one for each slot */
numLocks += max_replication_slots;
+ // AALEKSEEV: refactor!
+ /* buf_table.c needs one for partitioned hash table "Shared Buffer Lookup Table" */
+ // numLocks += 1;
+
+ /* lock.c needs two for partitioned hash tables "LOCK hash" and "PROCLOCK hash" */
+ // numLocks += 2;
+
+ /* predicate.c need two for partitioned hash tables "PREDICATELOCKTARGET hash" and "PREDICATELOCK hash" */
+ // numLocks += 2;
+
/*
* Add any requested by loadable modules; for backwards-compatibility
* reasons, allocate at least NUM_USER_DEFINED_LWLOCKS of them even if
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index eacffc4..bc4fa4a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -87,6 +87,8 @@
#include "access/xact.h"
#include "storage/shmem.h"
#include "storage/spin.h"
+#include "storage/lock.h"
+#include "storage/lwlock.h"
#include "utils/dynahash.h"
#include "utils/memutils.h"
@@ -129,11 +131,17 @@ typedef HASHBUCKET *HASHSEGMENT;
struct HASHHDR
{
/* In a partitioned table, take this lock to touch nentries or freeList */
- slock_t mutex; /* unused if not partitioned table */
+ //slock_t mutex; /* unused if not partitioned table */
+
+ // AALEKSEEV: fix comments
+ //LWLock* lock; // only for partitioned hash tables
+ slock_t mutex[NUM_LOCK_PARTITIONS];
/* These fields change during entry addition/deletion */
- long nentries; /* number of entries in hash table */
- HASHELEMENT *freeList; /* linked list of free elements */
+ /* number of entries in hash table */
+ long nentries[NUM_LOCK_PARTITIONS];
+ /* linked list of free elements */
+ HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];
/* These fields can change, but not in a partitioned table */
/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +174,9 @@ struct HASHHDR
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)
+// AALEKSEEV: add comment
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
/*
* Top control structure for a hashtable --- in a shared table, each backend
* has its own copy (OK since no fields change at runtime)
@@ -219,10 +230,10 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
-static HASHBUCKET get_hash_entry(HTAB *hashp);
+static HASHBUCKET get_hash_entry(HTAB *hashp, int partition_idx);
static void hdefault(HTAB *hashp);
static int choose_nelem_alloc(Size entrysize);
static bool init_htab(HTAB *hashp, long nelem);
@@ -282,6 +293,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
{
HTAB *hashp;
HASHHDR *hctl;
+ int i, partitions_number, nelem_alloc;
/*
* For shared hash tables, we have a local hash header (HTAB struct) that
@@ -408,7 +420,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if (!hashp->hctl)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (3)"))); // AALEKSEEV: fix string
}
hashp->frozen = false;
@@ -482,10 +494,20 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if ((flags & HASH_SHARED_MEM) ||
nelem < hctl->nelem_alloc)
{
- if (!element_alloc(hashp, (int) nelem))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ if(IS_PARTITIONED(hashp->hctl))
+ partitions_number = NUM_LOCK_PARTITIONS;
+ else
+ partitions_number = 1;
+
+ nelem_alloc = ((int) nelem) / partitions_number;
+ if(nelem_alloc == 0)
+ nelem_alloc = 1;
+
+ for(i = 0; i < partitions_number; i++)
+ if (!element_alloc(hashp, nelem_alloc, i))
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory (1)"))); // AALEKSEEV: fix string
}
if (flags & HASH_FIXED_SIZE)
@@ -503,8 +525,11 @@ hdefault(HTAB *hashp)
MemSet(hctl, 0, sizeof(HASHHDR));
- hctl->nentries = 0;
- hctl->freeList = NULL;
+ // AALEKSEEV: redundant!
+ // hctl->nentries = 0;
+ // hctl->freeList = NULL;
+
+ // hctl->lock = NULL;
hctl->dsize = DEF_DIRSIZE;
hctl->nsegs = 0;
@@ -572,12 +597,19 @@ init_htab(HTAB *hashp, long nelem)
HASHSEGMENT *segp;
int nbuckets;
int nsegs;
-
+ int i;
/*
* initialize mutex if it's a partitioned table
*/
- if (IS_PARTITIONED(hctl))
- SpinLockInit(&hctl->mutex);
+ if (IS_PARTITIONED(hctl))
+ {
+ for(i = 0; i < NUM_LOCK_PARTITIONS; i++)
+ SpinLockInit(&(hctl->mutex[i]));
+ }
+
+// AALEKSEEV: remove
+ //if(IS_PARTITIONED(hctl))
+ //hctl->lock = LWLockAssign();
/*
* Divide number of elements by the fill factor to determine a desired
@@ -648,7 +680,8 @@ init_htab(HTAB *hashp, long nelem)
"HIGH MASK ", hctl->high_mask,
"LOW MASK ", hctl->low_mask,
"NSEGS ", hctl->nsegs,
- "NENTRIES ", hctl->nentries);
+ // AALEKSEEV: fix this
+ "NENTRIES ", hctl->nentries[0]);
#endif
return true;
}
@@ -769,7 +802,8 @@ hash_stats(const char *where, HTAB *hashp)
where, hashp->hctl->accesses, hashp->hctl->collisions);
fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n",
- hashp->hctl->nentries, (long) hashp->hctl->keysize,
+ // AALEKSEEV: fix this
+ hashp->hctl->nentries[0], (long) hashp->hctl->keysize,
hashp->hctl->max_bucket, hashp->hctl->nsegs);
fprintf(stderr, "%s: total accesses %ld total collisions %ld\n",
where, hash_accesses, hash_collisions);
@@ -863,6 +897,7 @@ hash_search_with_hash_value(HTAB *hashp,
HASHBUCKET currBucket;
HASHBUCKET *prevBucketPtr;
HashCompareFunc match;
+ int partition_idx = PARTITION_IDX(hctl, hashvalue);
#if HASH_STATISTICS
hash_accesses++;
@@ -885,7 +920,7 @@ hash_search_with_hash_value(HTAB *hashp,
* order of these tests is to try to check cheaper conditions first.
*/
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
- hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+ hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}
@@ -942,21 +977,25 @@ hash_search_with_hash_value(HTAB *hashp,
if (currBucket != NULL)
{
/* if partitioned, must lock to touch nentries and freeList */
+ // AALEKSEEV: remove this
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ //LWLockAcquire(hctl->lock, LW_SHARED);
+ SpinLockAcquire(&(hctl->mutex[partition_idx]));
- Assert(hctl->nentries > 0);
- hctl->nentries--;
+ Assert(hctl->nentries[partition_idx] > 0);
+ hctl->nentries[partition_idx]--;
/* remove record from hash bucket's chain. */
*prevBucketPtr = currBucket->link;
/* add the record to the freelist for this table. */
- currBucket->link = hctl->freeList;
- hctl->freeList = currBucket;
+ currBucket->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = currBucket;
+ // AALEKSEEV: remove this
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // LWLockRelease(hctl->lock);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
/*
* better hope the caller is synchronizing access to this
@@ -982,7 +1021,7 @@ hash_search_with_hash_value(HTAB *hashp,
elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
hashp->tabname);
- currBucket = get_hash_entry(hashp);
+ currBucket = get_hash_entry(hashp, partition_idx);
if (currBucket == NULL)
{
/* out of memory */
@@ -996,7 +1035,7 @@ hash_search_with_hash_value(HTAB *hashp,
else
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ errmsg("out of memory (2)"))); // AALEKSEEV: fix string
}
/* link into hashbucket chain */
@@ -1175,41 +1214,81 @@ hash_update_hash_key(HTAB *hashp,
* create a new entry if possible
*/
static HASHBUCKET
-get_hash_entry(HTAB *hashp)
+get_hash_entry(HTAB *hashp, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
HASHBUCKET newElement;
+ int i, steal_from_idx;
for (;;)
{
/* if partitioned, must lock to touch nentries and freeList */
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ // LWLockAcquire(hctl->lock, LW_SHARED);
+ SpinLockAcquire(&hctl->mutex[partition_idx]);
/* try to get an entry from the freelist */
- newElement = hctl->freeList;
+ newElement = hctl->freeList[partition_idx];
+
if (newElement != NULL)
- break;
+ {
+ /* remove entry from freelist, bump nentries */
+ hctl->freeList[partition_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
+ if (IS_PARTITIONED(hctl))
+ // LWLockRelease(hctl->lock);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
+
+ return newElement;
+ }
- /* no free elements. allocate another chunk of buckets */
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
+ // LWLockRelease(hctl->lock);
- if (!element_alloc(hashp, hctl->nelem_alloc))
+ /* no free elements. allocate another chunk of buckets */
+ if (!element_alloc(hashp, hctl->nelem_alloc, partition_idx))
{
- /* out of memory */
- return NULL;
- }
- }
+ if (!IS_PARTITIONED(hctl))
+ return NULL; /* out of memory */
- /* remove entry from freelist, bump nentries */
- hctl->freeList = newElement->link;
- hctl->nentries++;
+ /* try to "steal" element from another partition */
+ // LWLockAcquire(hctl->lock, LW_EXCLUSIVE);
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // for(i = 0; i < NUM_LOCK_PARTITIONS; i++)
+ // SpinLockAcquire(&(hctl->mutex[i]));
+
+ steal_from_idx = partition_idx;
+ for(;;)
+ {
+ steal_from_idx = (steal_from_idx + 1) % NUM_LOCK_PARTITIONS;
+ if(steal_from_idx == partition_idx)
+ break;
+
+ SpinLockAcquire(&(hctl->mutex[steal_from_idx]));
+ newElement = hctl->freeList[steal_from_idx];
+
+ if(newElement != NULL)
+ {
+ hctl->freeList[steal_from_idx] = newElement->link;
+ SpinLockRelease(&(hctl->mutex[steal_from_idx]));
+
+ SpinLockAcquire(&hctl->mutex[partition_idx]);
+ hctl->nentries[partition_idx]++;
+ SpinLockRelease(&hctl->mutex[partition_idx]);
+
+ break;
+ }
- return newElement;
+ SpinLockRelease(&(hctl->mutex[steal_from_idx]));
+ }
+
+ // LWLockRelease(hctl->lock);
+ // for(i = 0; i < NUM_LOCK_PARTITIONS; i++)
+ // SpinLockRelease(&(hctl->mutex[i]));
+ return newElement;
+ }
+ }
}
/*
@@ -1218,11 +1297,21 @@ get_hash_entry(HTAB *hashp)
long
hash_get_num_entries(HTAB *hashp)
{
+ int i;
+ long sum = hashp->hctl->nentries[0];
+
/*
* We currently don't bother with the mutex; it's only sensible to call
* this function if you've got lock on all partitions of the table.
*/
- return hashp->hctl->nentries;
+
+ if(!IS_PARTITIONED(hashp->hctl))
+ return sum;
+
+ for(i = 1; i < NUM_LOCK_PARTITIONS; i++)
+ sum += hashp->hctl->nentries[i];
+
+ return sum;
}
/*
@@ -1530,7 +1619,7 @@ seg_alloc(HTAB *hashp)
* allocate some new elements and link them into the free list
*/
static bool
-element_alloc(HTAB *hashp, int nelem)
+element_alloc(HTAB *hashp, int nelem, int partition_idx)
{
HASHHDR *hctl = hashp->hctl;
Size elementSize;
@@ -1562,15 +1651,23 @@ element_alloc(HTAB *hashp, int nelem)
}
/* if partitioned, must lock to touch freeList */
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockAcquire(&hctl->mutex);
+
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ // LWLockAcquire(hctl->lock, LW_SHARED);
+ SpinLockAcquire(&hctl->mutex[partition_idx]);
/* freelist could be nonempty if two backends did this concurrently */
- firstElement->link = hctl->freeList;
- hctl->freeList = prevElement;
+ firstElement->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = prevElement;
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ // LWLockRelease(hctl->lock);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
+
+ // if (IS_PARTITIONED(hctl))
+ // SpinLockRelease(&hctl->mutex);
return true;
}
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ff34529..767f20b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,13 +128,14 @@ extern char *MainLWLockNames[];
* having this file include lock.h or bufmgr.h would be backwards.
*/
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 128
-
/* Number of partitions the shared lock tables are divided into */
#define LOG2_NUM_LOCK_PARTITIONS 4
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
+ /* Number of partitions of the shared buffer mapping hashtable */
+ // AALEKSEEV: refactor
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+
/* Number of partitions the shared predicate lock tables are divided into */
#define LOG2_NUM_PREDICATELOCK_PARTITIONS 4
#define NUM_PREDICATELOCK_PARTITIONS (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers