Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Wed, 08 Feb 2023 23:08:57 -0800

Hi,

On Tue, Feb 7, 2023 at 6:25 PM John Naylor <john.nay...@enterprisedb.com> wrote:
>
>
> On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> > I've attached v24 patches. The locking support patch is separated
> > (0005 patch). Also I kept the updates for TidStore and the vacuum
> > integration from v23 separate.
>
> Okay, that's a lot more simple, and closer to what I imagined. For v25, I 
> squashed v24's additions and added a couple of my own. I've kept the CF 
> status at "needs review" because no specific action is required at the moment.
>
> I did start to review the TID store some more, but that's on hold because 
> something else came up: On a lark I decided to re-run some benchmarks to see 
> if anything got lost in converting to a template, and that led me down a 
> rabbit hole -- some good and bad news on that below.
>
> 0001:
>
> I removed the uint64 case, as discussed. There is now a brief commit message, 
> but needs to be fleshed out a bit. I took another look at the Arm 
> optimization that Nathan found some month ago, for forming the highbit mask, 
> but that doesn't play nicely with how node32 uses it, so I decided against 
> it. I added a comment to describe the reasoning in case someone else gets a 
> similar idea.
>
> I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their 
> own header to clean up the #ifdef maze.", but decided it wasn't such a clear 
> win to justify starting the work now. It's still in the back of my mind, but 
> I removed the reminder from the commit message.


The changes make sense to me.

>
> 0003:
>
> The template now requires the value to be passed as a pointer. That was a 
> pretty trivial change, but affected multiple other patches, so not sent 
> separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a 
> top comment to the *_impl.h headers. There are some comment fixes. The 
> changes were either trivial or discussed earlier, so also not sent separately.

Great.

>
> 0004/5: I wanted to measure the load time as well as search time in 
> bench_search_random_nodes(). That's kept separate to make it easier to test 
> other patch versions.
>
> The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() 
> has regressed noticeably. I can't reproduce this in any other bench function 
> and was the reason for writing 0005 to begin with. More confusingly, my 
> efforts to fix this improved *other* functions, but the former didn't budge 
> at all. First the patches:
>
> 0006 adds and removes some "inline" declarations (where it made sense), and 
> added some for "pg_noinline" based on Andres' advice some months ago.

Agreed.

>
> 0007 removes some dead code. RT_NODE_INSERT_INNER is only called during 
> RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change 
> is much easier with the inner/node cases handled together in a template, as 
> far as being sure of how those cases are different. I thought about trying 
> the search in assert builds and verifying it doesn't exist, but thought yet 
> another #ifdef would be too messy.

Agreed.

>
> v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key 
> chunks in order for the linear-search nodes. I believe the TID store no 
> longer cares about the ordering, but this is a text file for now because I 
> don't want to clutter the CI with a behavior change. Also, the second ART 
> paper (on concurrency) mentioned that some locking schemes don't allow these 
> arrays to be shifted. So it might make sense to give up entirely on 
> guaranteeing ordered iteration, or at least make it optional as in the patch.

I think it's still important for lazy vacuum that an iteration over a
TID store returns TIDs in ascending order, because otherwise a heap
vacuum does random writes. That being said, we can have
RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
the key chunks are stored in a node.

> ========================================
> psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 
> 1000)"
> (min load time of three)
>
> v15:
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         113 |          455
>
> v25-0005:
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         135 |          456
>
> v25-0006 (inlining or not):
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         136 |          455
>
> v25-0007 (remove dead code):
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         135 |          455
>
> v25-addendum...txt (no ordering):
>  rt_load_ms | rt_search_ms
> ------------+--------------
>         134 |          455
>
> Note: The regression seems to have started in v17, which is the first with a 
> full template.
>
> Nothing so far has helped here, and previous experience has shown that trying 
> to profile 100ms will not be useful. Instead of putting more effort into 
> diving deeper, it seems a better use of time to write a benchmark that calls 
> the tid store itself. That's more realistic, since this function was intended 
> to test load and search of tids, but the tid store doesn't quite operate so 
> simply anymore. What do you think, Masahiko?

Yeah, that's more realistic. TidStore now encodes TIDs slightly
differently from the benchmark test.

I've attached the patch that adds a simple benchmark test using
TidStore. With this test, I got similar trends of results to yours
with gcc, but I've not analyzed them in depth yet.

query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)

v15:
 load_ms
---------
     816

v25-0007 (remove dead code):
load_ms
---------
     839

v25-addendum...txt (no ordering):
 load_ms
---------
     820

BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.

>
> I'm inclined to keep 0006, because it might give a slight boost, and 0007 
> because it's never a bad idea to remove dead code.

Yeah, these two changes make sense to me too.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

From e056133360436e115a434a8a21685a99602a5b5d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.m...@gmail.com>
Date: Wed, 8 Feb 2023 15:53:14 +0900
Subject: [PATCH] Add bench_tidstore_load()

---
 .../bench_radix_tree--1.0.sql                 | 10 ++++
 contrib/bench_radix_tree/bench_radix_tree.c   | 46 +++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 95eedbbe10..fbf51c1086 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -75,3 +75,13 @@ OUT rt_sparseload_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 7d1e2eee57..3c2caa3b90 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -9,6 +9,7 @@
  */
 #include "postgres.h"
 
+#include "access/tidstore.h"
 #include "common/pg_prng.h"
 #include "fmgr.h"
 #include "funcapi.h"
@@ -54,6 +55,7 @@ PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -168,6 +170,50 @@ vac_cmp_itemptr(const void *left, const void *right)
 }
 #endif
 
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+       BlockNumber minblk = PG_GETARG_INT32(0);
+       BlockNumber maxblk = PG_GETARG_INT32(1);
+       TidStore        *ts;
+       OffsetNumber *offs;
+       TimestampTz start_time,
+                               end_time;
+       long            secs;
+       int                     usecs;
+       int64           load_ms;
+       TupleDesc       tupdesc;
+       Datum           values[2];
+       bool            nulls[2] = {false};
+
+       /* Build a tuple descriptor for our result type */
+       if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+               elog(ERROR, "return type must be a row type");
+
+       offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+       for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+               offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+       ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, 
NULL);
+
+       elog(NOTICE, "sleeping for 2 seconds...");
+       pg_usleep(2 * 1000000L);
+
+       /* load tids */
+       start_time = GetCurrentTimestamp();
+       for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+               tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+       end_time = GetCurrentTimestamp();
+       TimestampDifference(start_time, end_time, &secs, &usecs);
+       load_ms = secs * 1000 + usecs / 1000;
+
+       values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+       values[1] = Int64GetDatum(load_ms);
+
+       tidstore_destroy(ts);
+       PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
+}
+
 static Datum
 bench_search(FunctionCallInfo fcinfo, bool shuffle)
 {
-- 
2.31.1

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to