On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Fri, Sep 16, 2022 at 4:54 PM John Naylor > <john.nay...@enterprisedb.com> wrote:
> > Here again, I'd rather put this off and focus on getting the "large > > details" in good enough shape so we can got towards integrating with > > vacuum. > > Thank you for the comments! These above comments are addressed by > Nathan in a newly derived thread. I'll work on the patch. I still seem to be out-voted on when to tackle this particular optimization, so I've extended the v6 benchmark code with a hackish function that populates a fixed number of keys, but with different fanouts. (diff attached as a text file) I didn't take particular care to make this scientific, but the following seems pretty reproducible. Note what happens to load and search performance when node16 has 15 entries versus 16: fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms --------+--------+------------------+------------+-------------- 15 | 327680 | 3776512 | 39 | 20 (1 row) num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms --------+--------+------------------+------------+-------------- 16 | 327680 | 3514368 | 25 | 11 (1 row) num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256 = 0 In trying to wrap the SIMD code behind layers of abstraction, the latest patch (and Nathan's cleanup) threw it away in almost all cases. To explain, we need to talk about how vectorized code deals with the "tail" that is too small for the register: 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants. 2. Read some junk into the register and mask off false positives from the result. There are advantages to both depending on the situation. Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 elements or less, it will iterate over them one-by-one exactly like a node4. Only when full with 16 will the vector path be taken. When another entry is added, the elements are copied to the next bigger node, so there's a *small* window where it's fast. In short, this code needs to be lower level so that we still have full control while being portable. I will work on this, and also the related code for node dispatch. Since v6 has some good infrastructure to do low-level benchmarking, I also want to do some experiments with memory management. (I have further comments about the code, but I will put that off until later) > I'll consider how to integrate with vacuum as the next step. One > concern for me is how to limit the memory usage to > maintenance_work_mem. Unlike using a flat array, memory space for > adding one TID varies depending on the situation. If we want strictly > not to allow using memory more than maintenance_work_mem, probably we > need to estimate the memory consumption in a conservative way. +1 -- John Naylor EDB: http://www.enterprisedb.com
commit 18407962e96ccec6c9aeeba97412edd762a5a4fe Author: John Naylor <john.nay...@postgresql.org> Date: Wed Sep 21 11:44:43 2022 +0700 Add special benchmark function to test effect of fanout diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile index b8f70e12d1..952bb0ceae 100644 --- a/contrib/bench_radix_tree/Makefile +++ b/contrib/bench_radix_tree/Makefile @@ -7,7 +7,7 @@ OBJS = \ EXTENSION = bench_radix_tree DATA = bench_radix_tree--1.0.sql -REGRESS = bench +REGRESS = bench_fixed_height ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 6663abe6a4..f2fee15b17 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -40,3 +40,15 @@ OUT load_ms int8) returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_fixed_height_search( +fanout int4, +OUT fanout int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT rt_load_ms int8, +OUT rt_search_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index 5806ef7519..0778da2d7b 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -13,6 +13,7 @@ #include "fmgr.h" #include "funcapi.h" #include "lib/radixtree.h" +#include <math.h> #include "miscadmin.h" #include "utils/timestamp.h" @@ -24,6 +25,7 @@ PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(bench_seq_search); PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); +PG_FUNCTION_INFO_V1(bench_fixed_height_search); static radix_tree *rt = NULL; static ItemPointer itemptrs = NULL; @@ -299,3 +301,108 @@ bench_load_random_int(PG_FUNCTION_ARGS) rt_free(rt); PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); } + +Datum +bench_fixed_height_search(PG_FUNCTION_ARGS) +{ + int fanout = PG_GETARG_INT32(0); + radix_tree *rt; + TupleDesc tupdesc; + TimestampTz start_time, end_time; + long secs; + int usecs; + int64 rt_load_ms, rt_search_ms; + Datum values[5]; + bool nulls[5]; + + /* test boundary between vector and iteration */ + const int n_keys = 5 * 16 * 16 * 16 * 16; + uint64 r, h, i, j, k; + int key_id; + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + rt = rt_create(CurrentMemoryContext); + + start_time = GetCurrentTimestamp(); + + key_id = 0; + /* lower nodes have limited fanout, the top is only limited by bits-per-byte */ + for (r=1;;r++) + { + for (h = 1; h <= fanout; h++) + { + for (i = 1; i <= fanout; i++) + { + for (j = 1; j <= fanout; j++) + { + for (k = 1; k <= fanout; k++) + { + uint64 key; + key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k); + + CHECK_FOR_INTERRUPTS(); + + key_id++; + if (key_id > n_keys) + goto finish_set; + + rt_set(rt, key, key_id); + } + } + } + } + } +finish_set: + end_time = GetCurrentTimestamp(); + TimestampDifference(start_time, end_time, &secs, &usecs); + rt_load_ms = secs * 1000 + usecs / 1000; + + rt_stats(rt); + + /* meaure the search time of the radix tree */ + start_time = GetCurrentTimestamp(); + + key_id = 0; + for (r=1;;r++) + { + for (h = 1; h <= fanout; h++) + { + for (i = 1; i <= fanout; i++) + { + for (j = 1; j <= fanout; j++) + { + for (k = 1; k <= fanout; k++) + { + uint64 key, val; + key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k); + + CHECK_FOR_INTERRUPTS(); + + key_id++; + if (key_id > n_keys) + goto finish_search; + + rt_search(rt, key, &val); + } + } + } + } + } +finish_search: + end_time = GetCurrentTimestamp(); + TimestampDifference(start_time, end_time, &secs, &usecs); + rt_search_ms = secs * 1000 + usecs / 1000; + + MemSet(nulls, false, sizeof(nulls)); + values[0] = Int32GetDatum(fanout); + values[1] = Int64GetDatum(rt_num_entries(rt)); + values[2] = Int64GetDatum(rt_memory_usage(rt)); + values[3] = Int64GetDatum(rt_load_ms); + values[4] = Int64GetDatum(rt_search_ms); + + rt_free(rt); + PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); +} diff --git a/contrib/bench_radix_tree/expected/bench_fixed_height.out b/contrib/bench_radix_tree/expected/bench_fixed_height.out new file mode 100644 index 0000000000..c4995afc13 --- /dev/null +++ b/contrib/bench_radix_tree/expected/bench_fixed_height.out @@ -0,0 +1,6 @@ +create extension bench_radix_tree; +\o fixed_height_search.data +begin; +select * from bench_fixed_height_search(15); +select * from bench_fixed_height_search(16); +commit; diff --git a/contrib/bench_radix_tree/sql/bench_fixed_height.sql b/contrib/bench_radix_tree/sql/bench_fixed_height.sql new file mode 100644 index 0000000000..0c06570e9a --- /dev/null +++ b/contrib/bench_radix_tree/sql/bench_fixed_height.sql @@ -0,0 +1,7 @@ +create extension bench_radix_tree; + +\o fixed_height_search.data +begin; +select * from bench_fixed_height_search(15); +select * from bench_fixed_height_search(16); +commit; diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c index b163eac480..4ce8e9ad9d 100644 --- a/src/backend/lib/radixtree.c +++ b/src/backend/lib/radixtree.c @@ -1980,7 +1980,7 @@ rt_verify_node(rt_node *node) void rt_stats(radix_tree *tree) { - fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u", + fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u\n", tree->num_keys, tree->root->shift / RT_NODE_SPAN, tree->cnt[0], diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index 38cc6abf4c..6016d593ee 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -15,7 +15,7 @@ #include "postgres.h" -/* #define RT_DEBUG 1 */ +#define RT_DEBUG 1 typedef struct radix_tree radix_tree; typedef struct rt_iter rt_iter;