On Wed, 25 May 2022 at 04:02, Tom Lane <t...@sss.pgh.pa.us> wrote: > Here's a draft patch for the other way of doing it. I'd first tried > to make the side-effects completely local to generation.c, but that > fails in view of code like > > MemoryContextAlloc(GetMemoryChunkContext(x), ...) > > Thus we pretty much have to have some explicit awareness of this scheme > in GetMemoryChunkContext(). There's more than one way it could be > done, but I thought a clean way is to invent a separate NodeTag type > to identify the indirection case.
Thanks for coding that up. This seems like a much better idea than mine. I ran the same benchmark as I did in the blog on your patch and got the attached sort_bench.png. You can see the patch fixes the 64MB work_mem performance regression, as we'd expect. To get an idea of the overhead of this I came up with the attached allocate_performance_function.patch which basically just adds a function named pg_allocate_generation_memory() which you can pass a chunk_size for the number of bytes to allocate at once, and also a keep_memory to tell the function how much memory to keep around before starting to pfree previous allocations. The final parameter defines the total amount of memory to allocate. The attached script calls the function with varying numbers of chunk sizes from 8 up to 2048, multiplying by 2 each step. It keeps 1MB of memory and does a total of 1GB of allocations. I ran the script against today's master and master + the invent-MemoryContextLink-1.patch and got the attached tps_results.txt. The worst-case is the 8-byte allocation size where performance drops around 7%. For the larger chunk sizes, the drop is much less, mostly just around <1% to ~6%. For the 2048 byte size chunks, the performance seems to improve (?). Obviously, the test is pretty demanding as far as palloc and pfree go. I imagine we don't come close to anything like that in the actual code. This test was just aimed to give us an idea of the overhead. It might not be enough information to judge if we should be concerned about more realistic palloc/pfree workloads. I didn't test the performance of an aset.c context. I imagine it's likely to be less overhead due to aset.c being generally slower from having to jump through a few more hoops during palloc/pfree. David
diff --git a/src/backend/utils/adt/mcxtfuncs.c b/src/backend/utils/adt/mcxtfuncs.c index bb7cc94024..d11495b716 100644 --- a/src/backend/utils/adt/mcxtfuncs.c +++ b/src/backend/utils/adt/mcxtfuncs.c @@ -193,3 +193,122 @@ pg_log_backend_memory_contexts(PG_FUNCTION_ARGS) PG_RETURN_BOOL(true); } + +typedef struct AllocateTestNext +{ + struct AllocateTestNext *next; /* ptr to the next allocation */ +} AllocateTestNext; + +/* #define ALLOCATE_TEST_DEBUG */ +/* + * pg_allocate_generation_memory + * Used to test the performance of a generation memory context + */ +Datum +pg_allocate_generation_memory(PG_FUNCTION_ARGS) +{ + int32 chunk_size = PG_GETARG_INT32(0); + int64 keep_memory = PG_GETARG_INT64(1); + int64 total_alloc = PG_GETARG_INT64(2); + int64 curr_memory_use = 0; + int64 remaining_alloc_bytes = total_alloc; + MemoryContext context; + MemoryContext oldContext; + AllocateTestNext *next_free_ptr = NULL; + AllocateTestNext *last_alloc = NULL; + + if (chunk_size < sizeof(AllocateTestNext)) + elog(ERROR, "chunk_size (%d) must be at least %ld bytes", chunk_size, + sizeof(AllocateTestNext)); + if (keep_memory > total_alloc) + elog(ERROR, "keep_memory (" INT64_FORMAT ") must be less than total_alloc (" INT64_FORMAT ")", + keep_memory, total_alloc); + + context = GenerationContextCreate(CurrentMemoryContext, + "pg_allocate_generation_memory", + ALLOCSET_DEFAULT_SIZES); + + oldContext = MemoryContextSwitchTo(context); + + while (remaining_alloc_bytes > 0) + { + AllocateTestNext *curr_alloc; + + CHECK_FOR_INTERRUPTS(); + + /* Allocate the memory and update the counters */ + curr_alloc = (AllocateTestNext *) palloc(chunk_size); + remaining_alloc_bytes -= chunk_size; + curr_memory_use += chunk_size; + +#ifdef ALLOCATE_TEST_DEBUG + elog(NOTICE, "alloc %p (curr_memory_use " INT64_FORMAT " bytes, remaining_alloc_bytes " INT64_FORMAT ")", curr_alloc, curr_memory_use, remaining_alloc_bytes); +#endif + + /* + * This might be the last allocation, so point this to NULL so know + * when to stop looping in the cleanup loop. + */ + curr_alloc->next = NULL; + + /* + * If the currently allocated memory has reached or exceeded the amount + * of memory we want to keep allocated at once then we'd better free + * some. Since all allocations are the same size we only need to free + * one allocation per loop. + */ + if (curr_memory_use >= keep_memory) + { + AllocateTestNext *next = next_free_ptr->next; + + /* free the memory and update the current memory usage */ + pfree(next_free_ptr); + curr_memory_use -= chunk_size; + +#ifdef ALLOCATE_TEST_DEBUG + elog(NOTICE, "free %p (curr_memory_use " INT64_FORMAT " bytes, remaining_alloc_bytes " INT64_FORMAT ")", next_free_ptr, curr_memory_use, remaining_alloc_bytes); +#endif + /* get the next chunk to free */ + next_free_ptr = next; + } + else if (next_free_ptr == NULL) + { + /* + * Remember the first chunk to free. We will follow the ->next + * pointers to find the next chunk to free when freeing memory + */ + next_free_ptr = curr_alloc; + } + + /* + * Store a pointer to curr_alloc in the memory we allocated in the + * the last iteration. This allows us to use the memory we're + * allocating to store a pointer to the next allocation. + */ + if (last_alloc != NULL) + last_alloc->next = curr_alloc; + + /* remember the this allocation so we have it in the next loop */ + last_alloc = curr_alloc; + } + + /* cleanup loop -- pfree remaining memory */ + while (next_free_ptr != NULL) + { + AllocateTestNext *next = next_free_ptr->next; + + /* free the memory and update the current memory usage */ + pfree(next_free_ptr); + curr_memory_use -= chunk_size; + +#ifdef ALLOCATE_TEST_DEBUG + elog(NOTICE, "free %p (curr_memory_use " INT64_FORMAT " bytes, remaining_alloc_bytes " INT64_FORMAT ")", next_free_ptr, curr_memory_use, remaining_alloc_bytes); +#endif + + next_free_ptr = next; + } + + MemoryContextSwitchTo(oldContext); + + PG_RETURN_VOID(); +} diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 87aa571a33..12ed69408b 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -8093,6 +8093,12 @@ prorettype => 'bool', proargtypes => 'int4', prosrc => 'pg_log_backend_memory_contexts' }, +# just for testing memory context allocation speed +{ oid => '9319', descr => 'for testing performance of generation allocation and freeing', + proname => 'pg_allocate_generation_memory', provolatile => 'v', + prorettype => 'void', proargtypes => 'int4 int8 int8', + prosrc => 'pg_allocate_generation_memory' }, + # non-persistent series generator { oid => '1066', descr => 'non-persistent series generator', proname => 'generate_series', prorows => '1000',
#!/bin/bash dbname=postgres sec=20 for bytes in 8 16 32 64 128 256 512 1024 2048 do for sql in "select pg_allocate_generation_memory($bytes, 1024*1024, 1024*1024*1024);" do echo "$sql" > bench.sql pgbench -n -f bench.sql -T $sec -M prepared $dbname done done
alloc size MemoryContextLink (tps) master (tps) compare 8 1.16201 1.24689 93.19% 16 2.43831 2.45643 99.26% 32 4.72368 4.87890 96.82% 64 8.90167 9.01659 98.73% 128 17.64886 16.97465 103.97% 256 31.71387 33.66108 94.22% 512 64.51953 65.44095 98.59% 1024 119.99962 123.36183 97.27% 2048 248.73356 214.43335 116.00%