Re: Do not lock temp relations
On Mon, 30 Sept 2024 at 18:05, Tom Lane wrote: > Yes. Our implementation restrictions preclude access to the contents > of another session's temp tables, but it is not forbidden to do DDL > on them so long as no content access is required. (Without this, > it'd be problematic for example to clean out a crashed session's temp > tables. See the "orphan temporary tables" logic in autovacuum.c.) Indeed, a potentially dangerous situation may arise when both the autovacuum and client process attempt to delete the contents of a temporary namespace. But there is a patch (6faca9ae2878c8f642a2e5748d2dbb2b91341bec) that protects us from race condition in this case. -- Best regards, Daniil Davydov On Tue, Oct 22, 2024 at 5:55 PM Tom Lane wrote: > Maxim Orlov writes: > > But for the second one: do we really need any lock for temp relations? > > Yes. Our implementation restrictions preclude access to the contents > of another session's temp tables, but it is not forbidden to do DDL > on them so long as no content access is required. (Without this, > it'd be problematic for example to clean out a crashed session's temp > tables. See the "orphan temporary tables" logic in autovacuum.c.) > You need fairly realistic locking to ensure that's OK. > > regards, tom lane > > > > >
Re: Do not lock temp relations
> Yes. Our implementation restrictions preclude access to the contents > of another session's temp tables, but it is not forbidden to do DDL > on them so long as no content access is required. (Without this, > it'd be problematic for example to clean out a crashed session's temp > tables. See the "orphan temporary tables" logic in autovacuum.c.) Indeed, a potentially dangerous situation may arise when both the autovacuum and client process attempt to delete the contents of a temporary namespace. But there is a patch (6faca9ae2878c8f642a2e5748d2dbb2b91341bec) that protects us from race condition in this case. -- Best regards, Daniil Davydov.
Forbid to DROP temp tables of other sessions
Hi, I noticed that TRUNCATE and ALTER commands on temporary tables of other sessions produce an error "cannot truncate/alter temporary tables of other sessions". But are there any reasons to allow us to DROP such tables? It seems to me that the only case when we may need it is the removal of orphan tables. But the autovacuum is responsible for this and it uses a different functionality. I'm wondering if there are any other cases. If not, can we just handle it for example in ExecDropStmt and produce an error like "cannot drop temporary tables of other sessions"? -- Best regards, Daniil Davydov
Re: Forbid to DROP temp tables of other sessions
Hi, Thanks for your comments, I appreciate them. As I continued to deal with the topic of working with temp tables of other sessions, I noticed something like a bug. For example (REL_17_STABLE): Session 1: =# CREATE TEMP TABLE test(id int); Session 2: =# INSERT INTO pg_temp_0.test VALUES (1); =# INSERT INTO pg_temp_0.test VALUES (2); Second INSERT command will end with an error "cannot access temporary tables of other sessions". I checked why this is happening and found errors in several places. So, I attach two files to this email : 1) Isolation test, that shows an error in REL_17_STABLE (iso_1.patch) 2) Patch that fixes code that mistakenly considered temporary tables to be permanent (I will be glad to receive feedback on these fixes) + isolation test, which shows that now any action with temp table of other session leads to error (temp_tbl_fix.patch) Tests look kinda ugly, but I think it's inevitable, given that we don't know exactly what the name of the temporary schema of other session will be. -- Best regards, Daniil Davydov From 995002793b0af57b660565e6027615c083d8ce5e Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 28 Oct 2024 17:19:07 +0700 Subject: [PATCH] Add isolaton test for work with other session's temp table --- src/test/isolation/expected/temp-tables.out | 75 +++ src/test/isolation/isolation_schedule | 1 + src/test/isolation/specs/temp-tables.spec | 100 3 files changed, 176 insertions(+) create mode 100644 src/test/isolation/expected/temp-tables.out create mode 100644 src/test/isolation/specs/temp-tables.spec diff --git a/src/test/isolation/expected/temp-tables.out b/src/test/isolation/expected/temp-tables.out new file mode 100644 index 00..dbd6ffc2e5 --- /dev/null +++ b/src/test/isolation/expected/temp-tables.out @@ -0,0 +1,75 @@ +Parsed test spec with 2 sessions + +starting permutation: s1_st1 s1_st2 s2_st1 s1_st3 s2_st2 +step s1_st1: CREATE TEMP TABLE test_tmp(id int); +step s1_st2: INSERT INTO temp_schema_id SELECT pg_my_temp_schema(); +s2: NOTICE: cannot access temporary tables of other sessions +s2: NOTICE: cannot access temporary tables of other sessions +step s2_st1: +DO $$ +DECLARE +schema_name text; +BEGIN +-- Find out name of temporary schema of first session +SELECT nspname INTO schema_name + FROM pg_namespace + WHERE oid = (SELECT oid FROM temp_schema_id LIMIT 1); + +-- Execute few insert operations. We expect that behavior to be the +-- same (both will complete successfully or both will fail). Since +-- we have RELATION_IS_OTHER_TEMP() check in PrefetchBuffer and +-- ReadBufferExtended functions (src/backend/storage/buffer/bufmgr.c), +-- let's assume that both operations must fail (this is reflected +-- in expected file) +BEGIN +EXECUTE format('INSERT INTO %I.test_tmp VALUES (1);', schema_name); +EXCEPTION +WHEN feature_not_supported +THEN RAISE NOTICE 'cannot access temporary tables of other sessions'; +END; + +BEGIN +EXECUTE format('INSERT INTO %I.test_tmp VALUES (2);', schema_name); +EXCEPTION +WHEN feature_not_supported +THEN RAISE NOTICE 'cannot access temporary tables of other sessions'; +END; +END +$$; + +step s1_st3: INSERT INTO test_tmp VALUES (3); +s2: NOTICE: (3) +s2: NOTICE: cannot access temporary tables of other sessions +s2: NOTICE: cannot access temporary tables of other sessions +step s2_st2: +DO $$ +DECLARE +schema_name text; +result RECORD; +BEGIN +-- Find out name of temporary schema of first session +SELECT nspname INTO schema_name + FROM pg_namespace + WHERE oid = (SELECT oid FROM temp_schema_id LIMIT 1); + +-- Before this step call, first session inserted few tuples into +-- test_tmp table. Let's assume that SELECT result must contain all +-- of these tuples (based on current logic) +FOR result IN + EXECUTE format('SELECT * FROM %I.test_tmp;', schema_name) + LOOP + RAISE NOTICE '%', result; + END LOOP; + +-- Now lets try to update or delete tuples from test_tmp. If we +-- cannot insert into this table, lets assume that both UPDATE and +-- DELETE operations must return same error as INSERT +BEGIN +EXECUTE format('UPDATE %I.test_tmp SET id = 100 WHERE id = 3;', schema_name); +EXCEPTION +WHEN feature_not_supporte
Re: Forbid to DROP temp tables of other sessions
On Wed, Oct 30, 2024 at 7:32 PM Rafia Sabih wrote: > Good catch. I agree with this being an unwarranted behaviour. > A minor comment from my end is the wording of the error message. > Based on the Postgresql error message style huide, something like this could > be better, > "could not access temporary relations of other sessions". > -- > Regards, > Rafia Sabih > CYBERTEC PostgreSQL International GmbH > Thanks for your comment. I attach a patch with a fixed error message. Also you can find it in commit fest (https://commitfest.postgresql.org/51/5379/) -- Best Regards, Daniil Davydov From 2c2e1b2fd75d38d43fbec5023691b93dceb18dbb Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Sun, 27 Oct 2024 16:15:50 +0700 Subject: [PATCH] Add fixes so now we cannot access to temporary tables of other sessions --- src/backend/catalog/namespace.c | 52 -- src/backend/nodes/makefuncs.c | 6 +- src/backend/parser/gram.y | 11 ++- src/test/isolation/expected/temp-tables.out | 92 + src/test/isolation/isolation_schedule | 1 + src/test/isolation/specs/temp-tables.spec | 103 6 files changed, 231 insertions(+), 34 deletions(-) create mode 100644 src/test/isolation/expected/temp-tables.out create mode 100644 src/test/isolation/specs/temp-tables.spec diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 30807f9190..1a8e8c8844 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -499,28 +499,19 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode, */ if (relation->relpersistence == RELPERSISTENCE_TEMP) { - if (!OidIsValid(myTempNamespace)) -relId = InvalidOid; /* this probably can't happen? */ - else + if (relation->schemaname) { -if (relation->schemaname) -{ - Oid namespaceId; +Oid namespaceId; - namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); - - /* - * For missing_ok, allow a non-existent schema name to - * return InvalidOid. - */ - if (namespaceId != myTempNamespace) - ereport(ERROR, -(errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("temporary tables cannot specify a schema name"))); -} - -relId = get_relname_relid(relation->relname, myTempNamespace); +namespaceId = LookupExplicitNamespace(relation->schemaname, + missing_ok); +if (namespaceId != myTempNamespace) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("could not access temporary relations of other sessions"))); } + + relId = get_relname_relid(relation->relname, myTempNamespace); } else if (relation->schemaname) { @@ -3387,17 +3378,14 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok) Oid namespaceId; AclResult aclresult; - /* check for pg_temp alias */ if (strcmp(nspname, "pg_temp") == 0) { - if (OidIsValid(myTempNamespace)) - return myTempNamespace; - /* - * Since this is used only for looking up existing objects, there is - * no point in trying to initialize the temp namespace here; and doing - * so might create problems for some callers --- just fall through. + * If namespace specified in 'pg_temp' format + * (without owner proc number specification), we assume that this is + * our temporary namespace */ + return myTempNamespace; } namespaceId = get_namespace_oid(nspname, missing_ok); @@ -3553,21 +3541,19 @@ get_namespace_oid(const char *nspname, bool missing_ok) RangeVar * makeRangeVarFromNameList(const List *names) { - RangeVar *rel = makeRangeVar(NULL, NULL, -1); + RangeVar *rel; switch (list_length(names)) { case 1: - rel->relname = strVal(linitial(names)); + rel = makeRangeVar(NULL, strVal(linitial(names)), -1); break; case 2: - rel->schemaname = strVal(linitial(names)); - rel->relname = strVal(lsecond(names)); + rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1); break; case 3: + rel = makeRangeVar(strVal(lsecond(names)), strVal(lthird(names)), -1); rel->catalogname = strVal(linitial(names)); - rel->schemaname = strVal(lsecond(names)); - rel->relname = strVal(lthird(names)); break; default: ereport(ERROR, @@ -3774,6 +3760,8 @@ GetTempNamespaceProcNumber(Oid namespaceId) return INVALID_PROC_NUMBER; /* no such namespace? */ if (strncmp(nspname, "pg_temp_", 8) == 0) result = atoi(nspname + 8); + else if (strncmp(nspname, "pg_temp", 7) == 0) + result = MyProcNumber; else if (strncmp(nspname, "pg_toast_temp_", 14) == 0) result = atoi(nspname + 14); else diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c index 61ac172a85..e60044c9d5 10
[BUG] Assert always fails while checking whether local buffer is dirty of is exclusively locked
Hi, Commit 00d7fb5e2e39198387ae00af8dd18b787b6a4d63 adds BufferIsExclusiveLocked and BufferIsDirty functions. They take into account that they can also be called for temporary tables, but they both have assert, that will always fail for temporary tables : *** Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE)); *** They fails because LockBuffer function just skip acquiring lock for temporary table : *** if (BufferIsLocal(buffer)) return; *** I suppose that fix can be done as in the attached to this email patch (for REL_17_STABLE) -- Best regards, Daniil Davydov From ee418040d8fb1988fdb2c79ae018e5ee9266c0e9 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Thu, 5 Dec 2024 15:07:14 +0700 Subject: [PATCH] Fix isDirty and isLocked for temp tables --- src/backend/storage/buffer/bufmgr.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6181673095..51e90f6d23 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -2465,11 +2465,12 @@ BufferIsExclusiveLocked(Buffer buffer) int bufid = -buffer - 1; bufHdr = GetLocalBufferDescriptor(bufid); + Assert(BufferIsPinned(buffer)); + + return true; } - else - { - bufHdr = GetBufferDescriptor(buffer - 1); - } + + bufHdr = GetBufferDescriptor(buffer - 1); Assert(BufferIsPinned(buffer)); return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), @@ -2498,11 +2499,11 @@ BufferIsDirty(Buffer buffer) else { bufHdr = GetBufferDescriptor(buffer - 1); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), + LW_EXCLUSIVE)); } Assert(BufferIsPinned(buffer)); - Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), -LW_EXCLUSIVE)); return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY; } -- 2.43.0
Improve DDL related functionality in table AM
Hi, Table AM has a callback for assigning a new file locator to the table (and, accordingly, creating its own storage). That is, as far as I understand, the design implies that the creator of the access method has the right to fully dispose of the table storage. However, for example, there is an "ALTER COLUMN ... TYPE ..." command, inside which two storage swaps occur (the source table and the temporary one). Wouldn't it be better to have a separate callback for such an operation? I am writing my own access method, and for each table I store some information in the local memory of the process. In this case, the information is "linked" to the oid of the table. At this time, ALTER creates a temporary table, fills it with data, and then deletes it. And I have no way of knowing that the storage of this table (which I should be able to manage myself) now belongs to another table (with the exception of POST_ALTER object access hook, but it looks more like a hack). -- Best regards, Daniil Davydov
[BUG] temp_table_max_size parameter may entail an error within ReadBuffer function
Hi, I noticed that this sequence of actions (for temporary table) leads to an error, if temp_table_max_size parameter is set : *** LockRelationForExtension(relation, ExclusiveLock); buf = ReadBuffer(relation, P_NEW); *** Error occurs during total temporary table size calculation (find_total_temp_relation_size function, that can be called only if temp_table_max_size parameter > 0) : we are trying to open all table's indexes and append their size to the total size of the temporary table. But inside relation_open function (called for index) we meet this assert (and, of course, fail on it): *** /* * We don't acquire any other heavyweight lock while holding the relation * extension lock. We do allow to acquire the same relation extension * lock more than once but that case won't reach here. */ Assert(!IsRelationExtensionLockHeld); *** I suppose that the simplest way out of the situation would be to skip locking temporary tables for extension -- Best regards, Daniil Davydov From da84d916e568be0d2414303f4a1e1d01b0bc6abd Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Fri, 6 Dec 2024 16:41:53 +0700 Subject: [PATCH] Fix bug with locking temp relation for extension --- src/backend/storage/lmgr/lmgr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index 094522acb4..0a91109d03 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -421,6 +421,9 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode) { LOCKTAG tag; + if (relation->rd_islocaltemp) + return; + SET_LOCKTAG_RELATION_EXTEND(tag, relation->rd_lockInfo.lockRelId.dbId, relation->rd_lockInfo.lockRelId.relId); -- 2.43.0
SLRU_PAGES_PER_SEGMENT as configure parameter
Hi, The constant SLRU_PAGES_PER_SEGMENT defines the size of the SLRU segments. It is currently hardcoded in slru.h. It would be nice to be able to set this parameter during configuration (for example, by analogy with --with-segsize-blocks). What will it give us: 1) The ability to test pg_upgrade more extensively, since without it you need to generate too much data to fill a significant number of segments. 2) The number of segments is arbitrary: 32. However, I have not heard about any estimates in terms of performance; with such a patch (for REL_17_STABLE), it will be realistic to do this. -- Best regards, Daniil Davydov From d962b1a9b5e2852706e256abbb6d9ca5155842a7 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Thu, 6 Feb 2025 14:34:36 +0700 Subject: [PATCH] SLRU_PAGES_PER_SEGMENT as configure parameter --- configure | 52 src/backend/utils/misc/guc_tables.c | 12 ++ src/include/access/slru.h | 15 --- src/include/pg_config.h.in| 13 ++ .../modules/test_slru/expected/test_slru.out | 123 ++ src/test/modules/test_slru/sql/test_slru.sql | 122 +++-- src/test/modules/test_slru/test_slru.c| 6 - 7 files changed, 228 insertions(+), 115 deletions(-) diff --git a/configure b/configure index 97996b7f6b..514418e589 100755 --- a/configure +++ b/configure @@ -845,6 +845,7 @@ enable_dtrace enable_tap_tests enable_injection_points with_blocksize +with_slru_pps with_segsize with_segsize_blocks with_wal_blocksize @@ -1559,6 +1560,7 @@ Optional Packages: set table segment size in blocks [0] --with-wal-blocksize=BLOCKSIZE set WAL block size in kB [8] + --with-slru-pps=SLRUPPS set slru segment size in blocks --with-llvm build with LLVM based JIT support --without-icu build without ICU support --with-tcl build Tcl modules (PL/Tcl) @@ -3762,6 +3764,56 @@ cat >>confdefs.h <<_ACEOF _ACEOF +# +# SLRU pages per segment +# +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for slru pages per segment" >&5 +$as_echo_n "checking for slru pages per segment... " >&6; } + + + +# Check whether --with-slru-pps was given. +if test "${with_slru_pps+set}" = set; then : + withval=$with_slru_pps; + case $withval in +yes) + as_fn_error $? "argument required for --with_slru_pps option" "$LINENO" 5 + ;; +no) + as_fn_error $? "argument required for --with_slru_pps option" "$LINENO" 5 + ;; +*) + slru_pages_per_segment=$withval + ;; + esac + +else + slru_pages_per_segment=32 +fi + + +INT_MAX=2147483647 +if test "${slru_pages_per_segment}" -lt 32; then + as_fn_error $? "Invalid slru_pps value. It cannot be less, then 32." "$LINENO" 5 + exit 1 +elif test "${slru_pages_per_segment}" -gt "${INT_MAX}"; then + as_fn_error $? "Invalid slru_pps value. It cannot be greater, then INT_MAX." "$LINENO" 5 + exit 1 +elif test $((${slru_pages_per_segment} % 2)) -ne 0; then + as_fn_error $? "Invalid slru_pps value. It must be a power of 2." "$LINENO" 5 + exit 1 +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${slru_pages_per_segment}" >&5 +$as_echo "${slru_pages_per_segment}" >&6; } + + +cat >>confdefs.h <<_ACEOF +#define SLRU_PAGES_PER_SEGMENT ${slru_pages_per_segment} +_ACEOF + + + + # # Relation segment size # diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index c42fccf3fe..bf1aa3a56c 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -596,6 +596,7 @@ static int max_function_args; static int max_index_keys; static int max_identifier_length; static int block_size; +static int slru_pages_per_segment; static int segment_size; static int shared_memory_size_mb; static int shared_memory_size_in_huge_pages; @@ -3266,6 +3267,17 @@ struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"slru_pps", PGC_INTERNAL, PRESET_OPTIONS, + gettext_noop("Shows the number of slru pages per segment."), + NULL, + GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE + }, + &slru_pages_per_segment, + SLRU_PAGES_PER_SEGMENT, SLRU_PAGES_PER_SEGMENT, SLRU_PAGES_PER_SEGMENT, + NULL, NULL, NULL + }, + { {"segment_size", PGC_INTERNAL, PRESET_OPTIONS, gettext_noop("Shows the number of pages per disk file."), diff --git a/src/include/access/slru.h b/src/include/access/slru.h index 02fcb3bca5..c75fc215d1 100644 --- a/src/include/access/slru.h +++ b/src/include/access/slru.h @@ -23,21 +23,6 @@ */ #define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ) -/* - * Define SLRU segment size.
Re: SLRU_PAGES_PER_SEGMENT as configure parameter
On Thu, Feb 6, 2025 at 8:23 PM Daniel Gustafsson wrote: > > You should include the configure.ac changes and not just the generated code in > configure (which is fine to include for review, but the committer will > re-generate it regardless). Please also include the corresponding Meson > change > to keep the buildsystems in sync. Thank you for your comment. I'm applying a patch that also includes configure.ac, meson_options.txt and meson.build changes. It would also be interesting to know what you think about the idea of allowing such a parameter to be configured. -- Best regards, Daniil Davydov From 836d648297b8ede905115f4e823953ebfa273642 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Thu, 6 Feb 2025 14:34:36 +0700 Subject: [PATCH] SLRU_PAGES_PER_SEGMENT as configure parameter --- configure | 53 configure.ac | 36 + meson.build | 17 +++ meson_options.txt | 3 + src/backend/utils/misc/guc_tables.c | 12 ++ src/include/access/slru.h | 15 --- src/include/pg_config.h.in| 15 +++ .../modules/test_slru/expected/test_slru.out | 123 ++ src/test/modules/test_slru/sql/test_slru.sql | 122 +++-- src/test/modules/test_slru/test_slru.c| 6 - 10 files changed, 287 insertions(+), 115 deletions(-) diff --git a/configure b/configure index 97996b7f6b..ff15660f5f 100755 --- a/configure +++ b/configure @@ -845,6 +845,7 @@ enable_dtrace enable_tap_tests enable_injection_points with_blocksize +with_slru_pps with_segsize with_segsize_blocks with_wal_blocksize @@ -1554,6 +1555,7 @@ Optional Packages: --with-pgport=PORTNUM set default port number [5432] --with-blocksize=BLOCKSIZE set table block size in kB [8] + --with-slru-pps=SLRUPPS set slru segment size in blocks --with-segsize=SEGSIZE set table segment size in GB [1] --with-segsize-blocks=SEGSIZE_BLOCKS set table segment size in blocks [0] @@ -3762,6 +3764,57 @@ cat >>confdefs.h <<_ACEOF _ACEOF +# +# SLRU pages per segment +# +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for slru pages per segment" >&5 +$as_echo_n "checking for slru pages per segment... " >&6; } + + + +# Check whether --with-slru-pps was given. +if test "${with_slru_pps+set}" = set; then : + withval=$with_slru_pps; + case $withval in +yes) + as_fn_error $? "argument required for --with-slru-pps option" "$LINENO" 5 + ;; +no) + as_fn_error $? "argument required for --with-slru-pps option" "$LINENO" 5 + ;; +*) + slru_pages_per_segment=$withval + ;; + esac + +else + slru_pages_per_segment=32 +fi + + + +INT_MAX=2147483647 +if test "${slru_pages_per_segment}" -lt 32; then : + as_fn_error $? "Invalid slru_pps value. It cannot be less than 32." "$LINENO" 5 +fi + +if test "${slru_pages_per_segment}" -gt "${INT_MAX}"; then : + as_fn_error $? "Invalid slru_pps value. It cannot be greater than INT_MAX." "$LINENO" 5 +fi + +if test $(expr ${slru_pages_per_segment} % 2) -ne 0; then : + as_fn_error $? "Invalid slru_pps value. It must be a power of 2." "$LINENO" 5 +fi + +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${slru_pages_per_segment} pages" >&5 +$as_echo "${slru_pages_per_segment} pages" >&6; } + + +cat >>confdefs.h <<_ACEOF +#define SLRU_PAGES_PER_SEGMENT ${slru_pages_per_segment} +_ACEOF + + # # Relation segment size # diff --git a/configure.ac b/configure.ac index 3c76c9ebc8..2fd0e331c0 100644 --- a/configure.ac +++ b/configure.ac @@ -288,6 +288,42 @@ AC_DEFINE_UNQUOTED([BLCKSZ], ${BLCKSZ}, [ Changing BLCKSZ requires an initdb. ]) +# +# SLRU pages per segment +# +AC_MSG_CHECKING([for slru pages per segment]) +PGAC_ARG_REQ(with, slru-pps, [SLRUPPS], [set slru segment size in blocks], + [slru_pages_per_segment=$withval], + [slru_pages_per_segment=32]) + +INT_MAX=2147483647 +AS_IF([test "${slru_pages_per_segment}" -lt 32], + [AC_MSG_ERROR([Invalid slru_pps value. It cannot be less than 32.])]) + +AS_IF([test "${slru_pages_per_segment}" -gt "${INT_MAX}"], + [AC_MSG_ERROR([Invalid slru_pps value. It cannot be greater than INT_MAX.])]) + +AS_IF([test $(expr ${slru_pages_per_segment} % 2) -ne 0], + [AC_MSG_ERROR([Invalid slru_pps value. It must be a power of 2.])]) + +AC_MSG_RESULT([${slru_pages_per_segment} pages]) + +AC_DEFINE_UNQUOTED([SLRU_PAGES_PER_SEGMENT], ${slru_pages_per_segment}, [ + SLRU segment size in pages. A page is the same BLCKSZ as is used everywhere + else in Postgres. The segmen
[BUG] Possible occurrence of segfault in ecpg test
Hi, The src/interfaces/ecpg/test/sql/bytea.pgc file contains the following code : *** init(); exec sql truncate test; exec sql insert into test values(:send_buf[0], :send_buf[1]); exec sql insert into test values(:send_buf[0], :send_buf[1]); exec sql select data1 into :recv_vlen_buf from test; dump_binary(recv_vlen_buf[0].arr, recv_vlen_buf[0].len, 0); dump_binary(recv_vlen_buf[1].arr, recv_vlen_buf[1].len, 0); free(recv_vlen_buf); *** recv_vlen_buf is initialized in the following way : *** bytea recv_vlen_buf[][DATA_SIZE]; recv_vlen_buf = NULL *** Thus, if the program behaves in an unexpected way and the transaction is aborted before it executes the "select data1 into :recv_vlen_buf from test" query, dump_binary will refer to a null pointer. So, instead of an error message, the user will see a segfault. I think that in all such cases it is worth adding some checks into .pgc and .c files (like in attached patch) -- Best regards, Daniil Davydov From 36b01661fd6b9083b4cd1620020ebf4f737740b9 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Tue, 10 Dec 2024 12:34:59 +0700 Subject: [PATCH] Fix refer to a null pointer --- src/interfaces/ecpg/test/expected/sql-bytea.c | 62 ++- .../ecpg/test/expected/sql-bytea.stderr | 62 +-- src/interfaces/ecpg/test/sql/bytea.pgc| 10 ++- 3 files changed, 71 insertions(+), 63 deletions(-) diff --git a/src/interfaces/ecpg/test/expected/sql-bytea.c b/src/interfaces/ecpg/test/expected/sql-bytea.c index 8338c6008d..c0cd962d90 100644 --- a/src/interfaces/ecpg/test/expected/sql-bytea.c +++ b/src/interfaces/ecpg/test/expected/sql-bytea.c @@ -254,37 +254,41 @@ if (sqlca.sqlcode < 0) sqlprint();} if (sqlca.sqlcode < 0) sqlprint();} #line 90 "bytea.pgc" - dump_binary(recv_vlen_buf[0].arr, recv_vlen_buf[0].len, 0); - dump_binary(recv_vlen_buf[1].arr, recv_vlen_buf[1].len, 0); - free(recv_vlen_buf); + + if (recv_vlen_buf != NULL) + { + dump_binary(recv_vlen_buf[0].arr, recv_vlen_buf[0].len, 0); + dump_binary(recv_vlen_buf[1].arr, recv_vlen_buf[1].len, 0); + free(recv_vlen_buf); + } /* Test for dynamic sql statement with normal host variable, indicator */ init(); { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "truncate test", ECPGt_EOIT, ECPGt_EORT); -#line 97 "bytea.pgc" +#line 101 "bytea.pgc" if (sqlca.sqlcode < 0) sqlprint();} -#line 97 "bytea.pgc" +#line 101 "bytea.pgc" { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "ins_stmt", ECPGt_bytea,&(send_buf[0]),(long)512,(long)1,sizeof(struct bytea_1), ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_bytea,&(send_buf[1]),(long)512,(long)1,sizeof(struct bytea_1), ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT, ECPGt_EORT); -#line 98 "bytea.pgc" +#line 102 "bytea.pgc" if (sqlca.sqlcode < 0) sqlprint();} -#line 98 "bytea.pgc" +#line 102 "bytea.pgc" { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "sel_stmt", ECPGt_EOIT, ECPGt_bytea,&(recv_buf[0]),(long)DATA_SIZE,(long)1,sizeof(struct bytea_2), ECPGt_int,&(ind[0]),(long)1,(long)1,sizeof(int), ECPGt_bytea,&(recv_short_buf),(long)DATA_SIZE - LACK_SIZE,(long)1,sizeof(struct bytea_4), ECPGt_int,&(ind[1]),(long)1,(long)1,sizeof(int), ECPGt_EORT); -#line 99 "bytea.pgc" +#line 103 "bytea.pgc" if (sqlca.sqlcode < 0) sqlprint();} -#line 99 "bytea.pgc" +#line 103 "bytea.pgc" dump_binary(recv_buf[0].arr, recv_buf[0].len, ind[0]); dump_binary(recv_short_buf.arr, recv_short_buf.len, ind[1]); @@ -292,81 +296,81 @@ if (sqlca.sqlcode < 0) sqlprint();} /* Test for dynamic sql statement with sql descriptor */ init(); { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "truncate test", ECPGt_EOIT, ECPGt_EORT); -#line 105 "bytea.pgc" +#line 109 "bytea.pgc" if (sqlca.sqlcode < 0) sqlprint();} -#line 105 "bytea.pgc" +#line 109 "bytea.pgc" { ECPGset_desc(__LINE__, "idesc", 1,ECPGd_data, ECPGt_bytea,&(send_buf[0]),(long)512,(long)1,sizeof(struct bytea_1), ECPGd_EODT); -#line 106 "bytea.pgc" +#line 110 "bytea.pgc" if (sqlca.sqlcode < 0) sqlprint();} -#line 106 "bytea.pgc" +#line 110 "bytea.pgc" { ECPGset_desc(__LINE__, "idesc", 2,ECPGd_data, ECPGt_bytea,&(send_buf[1]),(long)512,(long)1,sizeof(struct bytea_1), ECPGd_EODT); -#line 107 "bytea.pgc" +#line 111 "bytea.pgc" if (sqlca.sqlcode < 0) sqlprint();} -#line 107 "bytea.pgc" +#line 111 "bytea.pgc" { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "ins_stmt", ECPGt_descriptor, "idesc", 1L, 1L, 1L, ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT, ECPGt_EORT); -#line 108 "bytea.pgc" +#line 112 "bytea.
Re: Repeatable read transaction doesn't see dropped table
On Mon, Dec 23, 2024 at 3:17 PM David G. Johnston wrote: > The quoted section describes how two consecutive select queries will see the > same data. Your example shows how a single query behaves in isolation. The > “as the first query saw it” is fundamentally important since until it > successfully executes there are no locks being held restricting the changing > of non-data structural aspects of the database. In short, the snapshot > doesn’t include an object until it is requested. It’s a repeatable read, not > a frozen point-in-time read. The performance implications for the later > would be unacceptable. > > Thus, the behavior is expected and needed as-is; but I would say that the > concurrency control chapter of the documentation is one of the harder to > actually learn and understand. It is a challenging topic, so I get why. In > its defense, the commentary surrounding the regarding control record and > detail does try to make this distinction clear to the reader. YMMV as to its > effectiveness in this regard. > > David J. > Thank you for your comments. OK, I agree that there are no contradictions from the point of view of the source code. But essentially, snapshot is just a range of xids, and we must not see changes of transactions that started after the snapshot was taken. As far as I understand, documentation says that repeatable read transactions take a snapshot at first non-transaction-control statement, and this snapshot remains relevant for all statements within the transaction. My example shows that the second session sees changes that have been made by a transaction that started after snapshot creation (and it sees them only because of cache optimization). It might be unexpected behavior for users. I also agree that taking it into account will reduce performance, but maybe we can clarify this aspect in documentation (?) -- Best regards, Daniil Davydov
Repeatable read transaction doesn't see dropped table
Hi, The documentation for PostgreSQL 17 says the following : "query in a repeatable read transaction sees a snapshot as of the start of the first non-transaction-control statement in the transaction, not as of the start of the current statement within the transaction" But I noticed this behavior (REL_17_STABLE): *** SESSION 1: create two user tables and fill them with data CREATE TABLE test (id INT); CREATE TABLE test_1 (id INT); INSERT INTO test VALUES (1); INSERT INTO test_1 VALUES (1); SESSION 2 : begin transaction and allow it to take snapshot BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ; SELECT * FROM test_1; SESSION 1 : drop table, that was not accessed from second session DROP TABLE test; SESSION 2 : SELECT * FROM test; *** If I'm not mistaken, second transaction must see all data in table test (according to documentation), but an error occurs: *** ERROR: relation "test" does not exist LINE 1: select * from test; *** We are getting this behavior due to the fact that the second session searching for table's oid in cache via RelnameGetRelid, but first session already invalidated it. It seems like a bug to me, so I suggest in such cases to additionally scan pg_class. I would like to know your opinion. -- Best regards, Daniil Davydov
Accessing an invalid pointer in BufferManagerRelation structure
Hi, Postgres allows us to pass BufferManagerRelation structure to functions in two ways : BMR_REL and BMR_SMGR. In case we use BMR_REL, the "smgr" field of structure initialized this way : *** if (bmr.smgr == NULL) { bmr.smgr = RelationGetSmgr(bmr.rel); bmr.relpersistence = bmr.rel->rd_rel->relpersistence; } *** Thus, we set the "smgr" field only once. But in case of frequent cache invalidation (for example with debug_discard_caches parameter enabled), this pointer may become invalid (because RelationCloseSmgr will be called). I have not found any places in the current code where this could happen. But if (just for example) we add acquiring of new lock into ExtendBufferedRelLocal or ExtendBufferedRelShared, relation cache will be invalidated (inside AcceptInvalidationMessages). I would suggest adding a special macro to access the "smgr" field (check attached patch for REL_17_STABLE). What do you think about this? -- Best regards, Daniil Davydov From 828e798d2a4d42aa901f4c02e9a0c76ebb057c41 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 27 Jan 2025 18:36:10 +0700 Subject: [PATCH] Add marcos for safety access to smgr field of BufferManagerRelation structure --- src/backend/storage/buffer/bufmgr.c | 54 --- src/backend/storage/buffer/localbuf.c | 8 ++-- src/include/storage/bufmgr.h | 13 +++ 3 files changed, 41 insertions(+), 34 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6181673095..d2c9297edb 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -886,11 +886,8 @@ ExtendBufferedRelBy(BufferManagerRelation bmr, Assert(bmr.smgr == NULL || bmr.relpersistence != 0); Assert(extend_by > 0); - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); + if (bmr.relpersistence == 0) bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } return ExtendBufferedRelCommon(bmr, fork, strategy, flags, extend_by, InvalidBlockNumber, @@ -922,11 +919,8 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, Assert(bmr.smgr == NULL || bmr.relpersistence != 0); Assert(extend_to != InvalidBlockNumber && extend_to > 0); - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); + if (bmr.relpersistence == 0) bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } /* * If desired, create the file if it doesn't exist. If @@ -934,15 +928,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * an smgrexists call. */ if ((flags & EB_CREATE_FORK_IF_NEEDED) && - (bmr.smgr->smgr_cached_nblocks[fork] == 0 || - bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) && - !smgrexists(bmr.smgr, fork)) + (BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 || + BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) && + !smgrexists(BMR_GET_SMGR(bmr), fork)) { LockRelationForExtension(bmr.rel, ExclusiveLock); /* recheck, fork might have been created concurrently */ - if (!smgrexists(bmr.smgr, fork)) - smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY); + if (!smgrexists(BMR_GET_SMGR(bmr), fork)) + smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY); UnlockRelationForExtension(bmr.rel, ExclusiveLock); } @@ -952,13 +946,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * kernel. */ if (flags & EB_CLEAR_SIZE_CACHE) - bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber; + BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber; /* * Estimate how many pages we'll need to extend by. This avoids acquiring * unnecessarily many victim buffers. */ - current_size = smgrnblocks(bmr.smgr, fork); + current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork); /* * Since no-one else can be looking at the page contents yet, there is no @@ -1002,7 +996,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, if (buffer == InvalidBuffer) { Assert(extended_by == 0); - buffer = ReadBuffer_common(bmr.rel, bmr.smgr, 0, + buffer = ReadBuffer_common(bmr.rel, BMR_GET_SMGR(bmr), 0, fork, extend_to - 1, mode, strategy); } @@ -2144,10 +2138,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr, BlockNumber first_block; TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork, - bmr.smgr->smgr_rlocator.locator.spcOid, - bmr.smgr->smgr_rlocator.locator.dbOid, - bmr.smgr->smgr_rlocator.locator.relNumber, - bmr.smgr->smgr_rlocator.backend, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.dbOid, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.relNumber, + BMR_GET_SMGR(bmr)->smgr_rlocator.backend, extend_by); if (bmr.relpersistence == RELPERSIS
Improve consistency checks in tbm_prepare_shared_iterate
Hi, Currently, iterating through TIDBitmap contains this code (REL_16_STABLE) : *** while ((page = pagetable_iterate(tbm->pagetable, &i)) != NULL) { idx = page - ptbase->ptentry; if (page->ischunk) ptchunks->index[nchunks++] = idx; else ptpages->index[npages++] = idx; } Assert(npages == tbm->npages); Assert(nchunks == tbm->nchunks); *** Two asserts in the end seem to be overdue to me, because if (for example) nchunks > tbm->nchunks (due to another error), we have already accessed to invalid memory chunk and overwritten it. If we want to monitor the correctness of these variables, it might be better to add a few checks, as in the attached patch. I'm not sure if the comment in the error message is written correctly, but first I would like to hear your opinion. -- Best regards, Daniil Davydov From 04e41969ef9ea2c96af218a5519b5efb19baeed5 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Wed, 25 Dec 2024 19:32:55 +0700 Subject: [PATCH] Add new checks for tidbitmap scan --- src/backend/nodes/tidbitmap.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index 29a18584410..34a015e5ea7 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -827,10 +827,17 @@ tbm_prepare_shared_iterate(TIDBitmap *tbm) while ((page = pagetable_iterate(tbm->pagetable, &i)) != NULL) { idx = page - ptbase->ptentry; -if (page->ischunk) +if (page->ischunk && (nchunks + 1 <= tbm->nchunks)) ptchunks->index[nchunks++] = idx; -else +else if (npages + 1 <= tbm->npages) ptpages->index[npages++] = idx; +else + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("invalid chunks/pages num in tidbitmap : %d" + " chunks met of %d available, %d pages met " + "of %d available", nchunks, tbm->nchunks, + npages, tbm->npages))); } Assert(npages == tbm->npages); -- 2.43.0
Re: Forbid to DROP temp tables of other sessions
Hi, On Mon, Mar 17, 2025 at 1:15 PM David G. Johnston wrote: > > It’s seems like the bug “session A can read and write to session B’s tables” > has gotten lost in this thread that pertains to something related but > different. Solving the bug is not going to involve adding a new GUC. I don't think it's worth putting this in a separate discussion. Now everything is moving towards the fact that the superuser will be prohibited from changing the temporary tables of other sessions. In fact, he can't do it anyway (except for DROP command) - see [1]. But now the error for INSERT, UPDATE, DELETE and SELECT commands may not surface immediately due to errors in the code. The only question now is whether superuser should be allowed to DROP these other temp tables. Since opinions are divided, I suggested adding a GUC that will only control this feature. > If they specify the precise pg_temp schema to affect they likely didn’t make > a mistake. Yep, I agree. However, the features of the postgres kernel do not allow the superuser to correctly perform INSERT, UPDATE, DELETE, SELECT operations, because temporary table's pages cannot be stored in shared buffers. > If instead we are discussing the owner of the temporary table there is > probably a discussion to be had and decision to be documented somewhere - > maybe that central place of testing being wished for. As far as I understand, only superuser and table's owner (within session) can access the temp tables of session. We want CRUD operations to be performed only by the owner. -- Best regards, Daniil Davydov [1] https://www.postgresql.org/message-id/CAJDiXgj6TBzn=6ezx7+9bna9hpbitbu+muv-n3mhen_zs3n...@mail.gmail.com
Re: Forbid to DROP temp tables of other sessions
Hi, On Mon, Mar 17, 2025 at 8:29 PM Steven Niu wrote: > > If the (relation->relpersistence == RELPERSISTENCE_TEMP) can ensure the > myTempNamespace is always valid, then my comment can be ignored. No, even if we see a temporary table in RangeVarGetRelidExtended, MyTempNamespace still can be `InvalidOid` (I mentioned it in the previous letter). Thus, comment like "this probably can't happen?" should be removed. > Otherwise I think the modified RangeVarGetRelidExtended() should keep > check of myTempNamespace, like this: > > if (relation->relpersistence == RELPERSISTENCE_TEMP) > { >Oid namespaceId; > > if (!OidIsValid(myTempNamespace)) > relId = InvalidOid; /* this probably can't happen? */ > ... OK, let's assume that MyTempNamespace == InvalidOid and caller specified "pg_temp_N" in his query. In this case we want to throw an error, because access to the other temp tables is prohibited. If we keep code like "if (!OidIsValid(myTempNamespace)) => relId = InvalidOid", eventually the caller will get an error "relation ... does not exist". Yes, we have saved the caller from making mistakes, but we are giving the wrong error message. In my realization the caller will get the correct error message, and I still think that we should keep v3 patch logic. -- Best regards, Daniil Davydov
Re: Forbid to DROP temp tables of other sessions
Hi, On Tue, Mar 18, 2025 at 2:36 AM David G. Johnston wrote: > > "I want to give a better error message" is not a good enough reason to change > this long-standing behavior in a back-patchable bug fix. > and I do not agree that it is an improvement worth making in HEAD. OK, In this case I'd rather agree. On Tue, Mar 18, 2025 at 3:30 AM David G. Johnston wrote: > > I sound like a broken record but this is a bug fix; introducing a GUC and > changing unrelated behaviors is not acceptable in a back-patch. > > Superusers have been able to drop the temporary tables of other sessions for > a long time - now is not the place to change that behavior. Well, we can keep this option for the superuser, but we need to explicitly indicate that we are inside the DROP operation. Please see v5 patch. It is a bit experimental : 1) The original behavior in the LookupExplicitNamespace function has been returned. 2) New RVROption has been added to indicate that we are inside the DROP operation. Thus, superuser can drop other temp tables. > > Separately: > > Is the general logic here that we assume r->relpersistence is > RELPERSISTENCE_PERMANENT until and unless we can prove it to be > RELPERSISTENCE_TEMP? > > I'm trying to figure out whether there is still an issue when dealing with an > unqualified name that would be found in the temporary schema. > We've obviously already marked it permanent since we didn't have pg_temp in > the query to inform us otherwise. > But if we are changing permanent to temporary later because of search_path > resolution then why did we have to get it right in the case where pg_temp is > specified? > I question whether "persistence" is something that gram.y should be dealing > with at all. Shouldn't that property be solely determined in post-parse > analysis via catalog lookup? Hm, let's dive into this question. Why do I consider the original behavior to be incorrect? 1) If schema is not specified, then gram.y marks the table as PERMANENT. 1.1) If we are looking for our temporary table, then "pg_temp_N" is present in search_path, and RelnameGetRelid will return us this table. 1.2) If we are looking for other temporary table, then RelnameGetRelid will return `InvalidOid` and we will get a "relation not found" error. 2) If schema is specified, then gram.y ALSO marks the table as PERMANENT. 2.1) If we are looking for our temporary table, then LookupExplicitNamespace will return us MyTempNamespace and we can get our table without search_path lookup. 2.2) If we are looking for other temporary table, then LookupExplicitNamespace will return some valid oid, and we can get other temp table without search_path lookup. Then, we will perform any specified operation, assuming that we are working with a PERMANENT table. This is a bug. Let's summarize. If schema is not specified, we can safely mark the table as PERMANENT, because we always can get the table from search_path (and if the schema is not specified, then it is obvious that the user wants to refer specifically to his table). BUT, if schema is specified, we must know that the given relation is TEMP before calling RangeVarGetRelidExtended (otherwise, there will be an error, which I wrote about in paragraph 2.2). > > IOW, the original code here seems incorrect if this is the definitive place > to determine relpersistence. in "case 1:", where there is no schema name, > one cannot deduce relpersistence and it should remain NULL, IMO (not knowing > what that might break...). The code this patch replaces is wrong for the > same reason. I hope that I managed to clarify this issue. As far as "pg_temp_" prefix is reserved by the postgres kernel, we can definitely say that we have encountered a temporary table when we see this prefix. IMO there is no problem, that gram.y will do it. > > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y > index 271ae26cbaf..f68948d475c 100644 > --- a/src/backend/parser/gram.y > +++ b/src/backend/parser/gram.y > @@ -19424,7 +19424,11 @@ makeRangeVarFromAnyName(List *names, int position, > core_yyscan_t yyscanner) > break; > } > > - r->relpersistence = RELPERSISTENCE_PERMANENT; > + if (r->schemaname && strncmp(r->schemaname, "pg_temp", 7) == 0) > + r->relpersistence = RELPERSISTENCE_TEMP; > + else > + r->relpersistence = RELPERSISTENCE_PERMANENT; > + > r->location = position; > > The qualified name variant is fine since r->schemaname must be present by > definition. Passing through the same if block, once with schemaname null (in > makeRangeVar) and once with it populated (end of > makeRangeVarFromQualifiedName) is a bit annoying. > > makeRangeVar has the same prob
Re: Forbid to DROP temp tables of other sessions
Hi, On Wed, Mar 19, 2025 at 6:32 AM David G. Johnston wrote: > > I'm probably done trying to help with this one - it's beyond my ability and > desire to contribute to meaningfully. It seems to need to be escalated if > the regression has a chance of being understood and fixed. > Some good candidates for helping out are copied already so I'm just hoping > one or more of them chimes in. I'll repeat my suggestion to start a new > thread and let this one die. The answer to $subject is no. Well, the name of the thread has stopped reflecting what is being discussed here. I may start a new discussion in the near future, but as part of this discussion I would like to make sure that people agree with the statements from my previous letter. Thanks for your help. -- Best regards, Daniil Davydov
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
Hi, Recently I took more careful measurements of the performance. I compared three branches with each other: HEAD, Patched and Patched with tuplestore. Here are the results : 1) Test case : matview creation test attached in the email from Jingtang Zhang. 10 measurements for each branch. Result in wall clock execution time : HEAD 30.532 +- 0.59 seconds elapsed Patched 20.454 +- 0.114 seconds elapsed Patched with tuplestore 19.653 +- 0.111 seconds elapsed 2) -- init.sql drop table test_insert; vacuum; checkpoint; create table test_insert(i int, f float); -- iowrite.sql insert into test_insert select g, (g % 100) / 100.0 from generate_series(1, 100) as g; Test case : pgbench -f iowrite.sql -n -j 4 -c 10 -T 40 5 measurements for each branch. Result in tps : HEAD 1.025 +- 0.009 Patched 2.923 +- 0.032 Patched with tuplestore 2.987 +- 0.011 P.S. I cannot find a commitfest entry for this patch. Should we add it there? -- Best regards, Daniil Davydov
Re: Forbid to DROP temp tables of other sessions
Hi, I see that the presence of isolation tests in the patch is controversial. First things first, let's concentrate on fixing the bug. I attach a new version of patch (for `master` branch) to this letter. It contains better comments and a few small improvements. P.S. Sorry for bad formatting in previous letter (idk how to fix it in gmail client) -- Best regards, Daniil Davydov From e5f568ece877a33d307ac6480cef9dd088412aae Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 17 Mar 2025 15:43:15 +0700 Subject: [PATCH v2] Fix accessing other sessions temp tables --- src/backend/catalog/namespace.c | 71 +++-- src/backend/nodes/makefuncs.c | 6 ++- src/backend/parser/gram.y | 11 - 3 files changed, 57 insertions(+), 31 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index d97d632a7ef..688819299af 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -499,28 +499,40 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode, */ if (relation->relpersistence == RELPERSISTENCE_TEMP) { - if (!OidIsValid(myTempNamespace)) -relId = InvalidOid; /* this probably can't happen? */ - else - { -if (relation->schemaname) -{ - Oid namespaceId; + Oid namespaceId; - namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); + if (relation->schemaname) + { +namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); +if (namespaceId != myTempNamespace) +{ /* - * For missing_ok, allow a non-existent schema name to - * return InvalidOid. + * We don't allow users to access temp tables of other + * sessions (consider adding GUC for to allow to DROP such + * tables?). */ - if (namespaceId != myTempNamespace) - ereport(ERROR, -(errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("temporary tables cannot specify a schema name"))); + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("could not access temporary relations of other sessions"))); } + } + else + { +namespaceId = myTempNamespace; -relId = get_relname_relid(relation->relname, myTempNamespace); +/* + * If this table was recognized as temporary, it means that we + * found it because backend's temporary namespace was specified + * in search_path. Thus, MyTempNamespace must contain valid oid. + */ +Assert(OidIsValid(namespaceId)); } + + if (missing_ok && !OidIsValid(namespaceId)) +relId = InvalidOid; + else +relId = get_relname_relid(relation->relname, namespaceId); } else if (relation->schemaname) { @@ -3387,17 +3399,18 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok) Oid namespaceId; AclResult aclresult; - /* check for pg_temp alias */ + /* + * If namespace specified in 'pg_temp' format (without owner backend id + * specification), we assume that this is our temporary namespace. + */ if (strcmp(nspname, "pg_temp") == 0) { - if (OidIsValid(myTempNamespace)) - return myTempNamespace; + if (!OidIsValid(myTempNamespace)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_SCHEMA_NAME), + errmsg("pg_temp was specified, but you have not any temporary relations"))); - /* - * Since this is used only for looking up existing objects, there is - * no point in trying to initialize the temp namespace here; and doing - * so might create problems for some callers --- just fall through. - */ + return myTempNamespace; } namespaceId = get_namespace_oid(nspname, missing_ok); @@ -3553,21 +3566,19 @@ get_namespace_oid(const char *nspname, bool missing_ok) RangeVar * makeRangeVarFromNameList(const List *names) { - RangeVar *rel = makeRangeVar(NULL, NULL, -1); + RangeVar *rel; switch (list_length(names)) { case 1: - rel->relname = strVal(linitial(names)); + rel = makeRangeVar(NULL, strVal(linitial(names)), -1); break; case 2: - rel->schemaname = strVal(linitial(names)); - rel->relname = strVal(lsecond(names)); + rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1); break; case 3: + rel = makeRangeVar(strVal(lsecond(names)), strVal(lthird(names)), -1); rel->catalogname = strVal(linitial(names)); - rel->schemaname = strVal(lsecond(names)); - rel->relname = strVal(lthird(names)); break; default: ereport(ERROR, @@ -3774,6 +3785,8 @@ GetTempNamespaceProcNumber(Oid namespaceId) return INVALID_PROC_NUMBER; /* no such namespace? */ if (strncmp(nspname, "pg_temp_", 8) == 0) result = atoi(nspname + 8); + else if (strncmp(nspname, "pg_temp", 7) == 0) + result = MyProcNumber; else if (strncmp(nspname, "pg_toast_temp_", 14) == 0)
Re: Forbid to DROP temp tables of other sessions
Hi, On Mon, Mar 17, 2025 at 5:33 PM Steven Niu wrote: > > I mean RangeVarGetRelidExtended(), you deleted the following code: > > if (!OidIsValid(myTempNamespace)) > relId = InvalidOid; /* this probably can't happen? */ Hm, I got it. Let's start with the fact that the comment "this probably can't happen?" is incorrect. Even if we don't have a temporary namespace in our session, we still can specify "pg_temp_N" in the psql query. Next, if relation->schemaname is pg_temp, then we firstly call LookupExplicitNamespace, where you can find if (!OidIsValid(myTempNamespace)) check. -- Best regards, Daniil Davydov
Re: Forbid to DROP temp tables of other sessions
Hi, On Mon, Mar 17, 2025 at 4:48 PM Steven Niu wrote: > > 1. namespace.c, if relation->schemaname is pg_temp but myTempNamespace > isn't set, the error information might be misleading. Consider checking > OidIsValid(myTempNamespace) first. Could you please clarify exactly which place in the code we are talking about? I think we handle this case in the LookupExplicitNamespace call (with all appropriate error information). > > 2."you have not any temporary relations" --> "you have no any temporary > relations" I am not an English speaker, but it seems that "have not" would be more correct. Someone has to judge us :) > > 3. Regarding to the code "strncmp(nspname, "pg_temp", 7)", is it ok when > the nspname contains something like "pg_temp_1234"? I think we should > use strcmp instead of strncmp for exact matching. Great catch! I'll fix it. Please, see v3 patch. -- Best regards, Daniil Davydov From 74dc11678c55eb80ad48ad44fc77d770620ea0a7 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 17 Mar 2025 15:43:15 +0700 Subject: [PATCH v3] Fix accessing other sessions temp tables --- src/backend/catalog/namespace.c | 71 +++-- src/backend/nodes/makefuncs.c | 6 ++- src/backend/parser/gram.y | 11 - 3 files changed, 57 insertions(+), 31 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index d97d632a7ef..b19c496a0ec 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -499,28 +499,40 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode, */ if (relation->relpersistence == RELPERSISTENCE_TEMP) { - if (!OidIsValid(myTempNamespace)) -relId = InvalidOid; /* this probably can't happen? */ - else - { -if (relation->schemaname) -{ - Oid namespaceId; + Oid namespaceId; - namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); + if (relation->schemaname) + { +namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); +if (namespaceId != myTempNamespace) +{ /* - * For missing_ok, allow a non-existent schema name to - * return InvalidOid. + * We don't allow users to access temp tables of other + * sessions (consider adding GUC for to allow to DROP such + * tables?). */ - if (namespaceId != myTempNamespace) - ereport(ERROR, -(errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("temporary tables cannot specify a schema name"))); + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("could not access temporary relations of other sessions"))); } + } + else + { +namespaceId = myTempNamespace; -relId = get_relname_relid(relation->relname, myTempNamespace); +/* + * If this table was recognized as temporary, it means that we + * found it because backend's temporary namespace was specified + * in search_path. Thus, MyTempNamespace must contain valid oid. + */ +Assert(OidIsValid(namespaceId)); } + + if (missing_ok && !OidIsValid(namespaceId)) +relId = InvalidOid; + else +relId = get_relname_relid(relation->relname, namespaceId); } else if (relation->schemaname) { @@ -3387,17 +3399,18 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok) Oid namespaceId; AclResult aclresult; - /* check for pg_temp alias */ + /* + * If namespace specified in 'pg_temp' format (without owner backend id + * specification), we assume that this is our temporary namespace. + */ if (strcmp(nspname, "pg_temp") == 0) { - if (OidIsValid(myTempNamespace)) - return myTempNamespace; + if (!OidIsValid(myTempNamespace)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_SCHEMA_NAME), + errmsg("pg_temp was specified, but you have not any temporary relations"))); - /* - * Since this is used only for looking up existing objects, there is - * no point in trying to initialize the temp namespace here; and doing - * so might create problems for some callers --- just fall through. - */ + return myTempNamespace; } namespaceId = get_namespace_oid(nspname, missing_ok); @@ -3553,21 +3566,19 @@ get_namespace_oid(const char *nspname, bool missing_ok) RangeVar * makeRangeVarFromNameList(const List *names) { - RangeVar *rel = makeRangeVar(NULL, NULL, -1); + RangeVar *rel; switch (list_length(names)) { case 1: - rel->relname = strVal(linitial(names)); + rel = makeRangeVar(NULL, strVal(linitial(names)), -1); break; case 2: - rel->schemaname = strVal(linitial(names)); - rel->relname = strVal(lsecond(names)); + rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1); brea
Re: Forbid to DROP temp tables of other sessions
Hi, On Sun, Mar 16, 2025 at 7:53 PM vignesh C wrote: > I noticed that the following Andrey's comment regarding the isolation > test from [1] and Andres's comment from [2] are pending. I'm changing > the commitfest entry to Waiting on Author, please provide an updated > patch and update it to Needs review. Thanks for reading it. I saw [2] and introduced a possible solution in my last letter. In short : we can have a GUC variable that will permit superuser to drop temp tables of other sessions. Thus, we have a single location in code that will check whether we can perform operations with other temp tables. As far as I understand, this is exactly what Andres wrote about. Also, it is difficult for me to express my opinion on [1] at the moment. I can say for sure that the tests will change when we agree on the behavior of the code. Therefore, I suggest postponing the resolution of this issue. > I suggest adding a GUC that will allow superuser to do this Waiting for your feedback on this issue :) -- Best regards, Daniil Davydov
Re: Forbid to DROP temp tables of other sessions
Hi, On Mon, Mar 17, 2025 at 1:52 PM David G. Johnston wrote: > > My understanding is the limitation of an owner of a temporary relation in one > session being disallowed to alter its contents from another session is an > implementation consequence, and not some fundamental model restriction. I would say this: working with temporary tables (in the postgres kernel) is so very different from working with regular heap tables that such a limitation can be considered fundamental. We simply will not be able to organize coordinated work on the temporary table from different OS processes (for INSERT, UPDATE, DELETE). > Minimally informed thinking, associate the specific pg_temp namespace with a > procid. Where this limitation exists, which seems like middle management, > compare the proc of the namespace to the executor. Pass the role and also an > enum of action type (CRUD, drop, truncate, lock, etc…). If the procs match > all good. Superuser cannot bypass CRUD and similar as that is the limitation > being implemented here. And the owner cannot bypass anything (exceptions > could be added as desired). > > Centralizing things a bit though…maybe something like the relcache (for > namespaces…) so you cannot even get a handle on the namespace if you don’t > supply the info and pass the checks. Don’t really know enough to say > where/how to implement “if you forget to call this check all commands that > can reference tables will fail”. I'm sorry, I probably don't quite understand what this is about, so I'll just describe how it works now. If a superuser wants to access other temp table, he must specify schema_name in request (for example : INSERT INTO pg_temp_N.test .). N is the id of the owner process. Thus, postgres will call RangeVarGetRelidExtended to find the given relation's oid. It is at this point that we can step in and check whether the caller can work with the specified schema. It is elementary to understand that the schema does belong to another session. Right now, there is a bug in the code that mistakenly recognizes the table in such a case as permanent (not temporary), so we cannot do what I just described. So, we have to get rid of this bug and decide whether we reserve the right for the superuser to DROP such tables. I hope this remark will be useful. -- Best regards, Daniil Davydov
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
Hi, On Sun, Apr 6, 2025 at 8:55 PM Jingtang Zhang wrote: > > It was quite a while since I last looked at the patch. I've tested it again, > and still get regression on patched version where a table has many columns. > And it is totally CPU-bounded on tts_virtual_copyslot. > > Unpatched version: > 1 col: > Time: 8909.714 ms (00:08.910) > Time: 8803.579 ms (00:08.804) > Time: 8600.415 ms (00:08.600) > 32 cols: > Time: 12911.699 ms (00:12.912) > Time: 13543.491 ms (00:13.543) > Time: 13325.368 ms (00:13.325) > > Patched version: > 1 col: > Time: 3532.841 ms (00:03.533) > Time: 3598.223 ms (00:03.598) > Time: 3515.858 ms (00:03.516) > 32 cols: > Time: 35647.724 ms (00:35.648) > Time: 35596.233 ms (00:35.596) > Time: 35669.106 ms (00:35.669) > Hm, maybe I didn't choose the best way to measure performance. Can you please share how you do it? > I've tested your patch with tuplestore and found the regression does not exist > anymore, but I haven't look deep inside it. > > Patched version (with tuplestore): > 1 col: > Time: 3500.502 ms (00:03.501) > Time: 3486.886 ms (00:03.487) > Time: 3514.233 ms (00:03.514) > 32 cols: > Time: 10375.391 ms (00:10.375) > Time: 10248.256 ms (00:10.248) > Time: 10248.289 ms (00:10.248) > > It seems to be a good idea if there is no other issue with your patch. As far as I understand, the use of multi inserts for queries like "INSERT INTO ... SELECT FROM" is not discussed here anymore due to the fact that in such cases we will have to take into account the volatile functions and ROW triggers. I've been thinking about this for a while and made a patch as an experiment. The principles that the patch works on are listed below. 1) Since performance decreases for single INSERTs (within a multi inserts mechanism), I designed this feature as an option for the table. Thus, if the user knows that he will perform a lot of inserts on the table, he can specify "WITH (append_optimized=true)". 2) The availability of volatile functions is monitored during the construction of a subtree for a ModifyTable node. I'm not that familiar with the query plan construction mechanism, but it seems to me that this way we can track any occurrence of volatile functions. Of course, most volatile functions don't stop us from using multi inserts, but checking each such function would take a very long time, so the very fact of having a volatile function is enough for us to abandon multi-inserts. 3) Default expressions of the target table are also checked for volatile functions. The same rules apply to them as in (2). As an exception, I allowed the use of SERIAL in the column data type, since this is a fairly common use case. 4) If the target table contains any ROW triggers, we don't use multi insert. 5) Patch also contains a regression test. This is a "sandbox" where you can do some experiments with append-optimized tables. I hope that patch (targeted on 'master' branch, 2c7bd2ba507e273f2d7fe1b2f6d30775ed4f3c09) will be useful for this thread. -- Best regards, Daniil Davydov From 224378c11d270aabe28bdd32efacd37ed1984bd1 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 7 Apr 2025 12:55:50 +0700 Subject: [PATCH v1] Meet append optimized tables --- src/backend/access/common/reloptions.c| 11 + src/backend/access/heap/heapam.c | 205 ++ src/backend/access/heap/heapam_handler.c | 5 + src/backend/access/table/tableamapi.c | 5 + src/backend/commands/explain.c| 5 +- src/backend/executor/execExpr.c | 17 +- src/backend/executor/execProcnode.c | 9 + src/backend/executor/nodeModifyTable.c| 194 - src/backend/optimizer/plan/createplan.c | 1 + src/backend/optimizer/util/clauses.c | 28 ++- src/include/access/heapam.h | 41 src/include/access/tableam.h | 84 +++ src/include/nodes/execnodes.h | 6 + src/include/nodes/plannodes.h | 2 + src/include/optimizer/optimizer.h | 3 + src/include/utils/rel.h | 10 + .../regress/expected/append_optimized.out | 161 ++ src/test/regress/parallel_schedule| 2 + src/test/regress/sql/append_optimized.sql | 105 + 19 files changed, 879 insertions(+), 15 deletions(-) create mode 100644 src/test/regress/expected/append_optimized.out create mode 100644 src/test/regress/sql/append_optimized.sql diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 46c1dce222d..9652cf4179b 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -166,6 +166,15 @@ static relopt_bool boolRelOpts[] = }, true
Re: Forbid to DROP temp tables of other sessions
Hi, On Mon, Mar 17, 2025 at 10:09 PM David G. Johnston wrote: > > On Monday, March 17, 2025, Daniil Davydov <3daniss...@gmail.com> wrote: >> >> >> > >> > 2."you have not any temporary relations" --> "you have no any temporary >> > relations" >> I am not an English speaker, but it seems that "have not" would be >> more correct. Someone has to judge us :) >> > > Both are not good. > > “pg_temp was specified but it contains no relations” [1] That sounds reasonable. I'll fix it. Thanks! > But why are we promoting this situation to an error? It should be a relation > not found error just like any other and not its own special case. > The fact we create pg_temp only as it is needed is an implementation detail > that should not be visible to the user. 1) Error message has no mention of a non-existent schema. "Schema has no relations" => "Given relation not found in empty schema". It seems to me that these are equivalent statements. 2) Is this really the implementation detail that we want to hide from the user? User can just run "select pg_my_temp_schema();" and see that there is no temp schema in the current session. Don't get me wrong - I can agree with that, but for now it seems odd to me... Steven Niu also mentioned this issue, but IMO we must give the most accurate description of the problem - tell "relation not found" only if we have temp namespace, but not specified relation in it. Please see v4 patch (only comment change). -- Best regards, Daniil Davydov From 33e2b77607c26bb82a807711ecdec5174be64874 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 17 Mar 2025 15:43:15 +0700 Subject: [PATCH v4] Fix accessing other sessions temp tables --- src/backend/catalog/namespace.c | 71 +++-- src/backend/nodes/makefuncs.c | 6 ++- src/backend/parser/gram.y | 11 - 3 files changed, 57 insertions(+), 31 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index d97d632a7ef..9e96189920d 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -499,28 +499,40 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode, */ if (relation->relpersistence == RELPERSISTENCE_TEMP) { - if (!OidIsValid(myTempNamespace)) -relId = InvalidOid; /* this probably can't happen? */ - else - { -if (relation->schemaname) -{ - Oid namespaceId; + Oid namespaceId; - namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); + if (relation->schemaname) + { +namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); +if (namespaceId != myTempNamespace) +{ /* - * For missing_ok, allow a non-existent schema name to - * return InvalidOid. + * We don't allow users to access temp tables of other + * sessions (consider adding GUC for to allow to DROP such + * tables?). */ - if (namespaceId != myTempNamespace) - ereport(ERROR, -(errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("temporary tables cannot specify a schema name"))); + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("could not access temporary relations of other sessions"))); } + } + else + { +namespaceId = myTempNamespace; -relId = get_relname_relid(relation->relname, myTempNamespace); +/* + * If this table was recognized as temporary, it means that we + * found it because backend's temporary namespace was specified + * in search_path. Thus, MyTempNamespace must contain valid oid. + */ +Assert(OidIsValid(namespaceId)); } + + if (missing_ok && !OidIsValid(namespaceId)) +relId = InvalidOid; + else +relId = get_relname_relid(relation->relname, namespaceId); } else if (relation->schemaname) { @@ -3387,17 +3399,18 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok) Oid namespaceId; AclResult aclresult; - /* check for pg_temp alias */ + /* + * If namespace specified in 'pg_temp' format (without owner backend id + * specification), we assume that this is our temporary namespace. + */ if (strcmp(nspname, "pg_temp") == 0) { - if (OidIsValid(myTempNamespace)) - return myTempNamespace; + if (!OidIsValid(myTempNamespace)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_SCHEMA_NAME), + errmsg("pg_temp was specified but it contains no relations"))); - /* - * Since this is used only for looking up existing objects, there is - * no point in trying to initialize the temp namespace here; and doing - * so might create problems for some callers --- just fall through. - */ + return myTempName
Re: Accessing an invalid pointer in BufferManagerRelation structure
Hi, On Wed, Mar 26, 2025 at 2:14 PM Stepan Neretin wrote: > > The first thing we both noticed is that the macro calls a function that won't > be available without an additional header. This seems a bit inconvenient. Well, I rebased the patch onto the latest `master` (b51f86e49a7f119004c0ce5d0be89cdf98309141) and noticed that we don't need to include `rel.h` in `localbuf.c` directly anymore, because `#include lmgr.h` was added in memutils.h I guess it solves this issue. Please, see v3 patch. > I also have a question: is the logic correct that if the relation is valid, > we should fetch it rather than the other way around? Additionally, is > checking only the `rd_isvalid` flag sufficient, or should we also consider > the flag below? > > ``` > bool rd_isvalid; /* relcache entry is valid */ > I don't think that we should check any Relation's flags here. We are checking `RelationIsValid((bmr).rel) ?` to decide whether BufferManagerRelation was created via BMR_REL or BMR_SMGR. If the `rel` field is not NULL, we can definitely say that BMR_REL was used, so we should call RelationGetSmgr in order to access smgr. -- Best regards, Daniil Davydov From a1c572652e69f13b954ec3a915ed55c1ec11c772 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 14 Apr 2025 13:07:38 +0700 Subject: [PATCH v3] Add macros for safety access to smgr --- src/backend/storage/buffer/bufmgr.c | 55 --- src/backend/storage/buffer/localbuf.c | 9 +++-- src/include/storage/bufmgr.h | 13 +++ 3 files changed, 43 insertions(+), 34 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0b0e056eea2..f2b117088a8 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -892,11 +892,8 @@ ExtendBufferedRelBy(BufferManagerRelation bmr, Assert(bmr.smgr == NULL || bmr.relpersistence != 0); Assert(extend_by > 0); - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); + if (bmr.relpersistence == 0) bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } return ExtendBufferedRelCommon(bmr, fork, strategy, flags, extend_by, InvalidBlockNumber, @@ -928,11 +925,8 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, Assert(bmr.smgr == NULL || bmr.relpersistence != 0); Assert(extend_to != InvalidBlockNumber && extend_to > 0); - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); + if (bmr.relpersistence == 0) bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } /* * If desired, create the file if it doesn't exist. If @@ -940,15 +934,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * an smgrexists call. */ if ((flags & EB_CREATE_FORK_IF_NEEDED) && - (bmr.smgr->smgr_cached_nblocks[fork] == 0 || - bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) && - !smgrexists(bmr.smgr, fork)) + (BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 || + BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) && + !smgrexists(BMR_GET_SMGR(bmr), fork)) { LockRelationForExtension(bmr.rel, ExclusiveLock); /* recheck, fork might have been created concurrently */ - if (!smgrexists(bmr.smgr, fork)) - smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY); + if (!smgrexists(BMR_GET_SMGR(bmr), fork)) + smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY); UnlockRelationForExtension(bmr.rel, ExclusiveLock); } @@ -958,13 +952,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * kernel. */ if (flags & EB_CLEAR_SIZE_CACHE) - bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber; + BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber; /* * Estimate how many pages we'll need to extend by. This avoids acquiring * unnecessarily many victim buffers. */ - current_size = smgrnblocks(bmr.smgr, fork); + current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork); /* * Since no-one else can be looking at the page contents yet, there is no @@ -1008,7 +1002,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, if (buffer == InvalidBuffer) { Assert(extended_by == 0); - buffer = ReadBuffer_common(bmr.rel, bmr.smgr, bmr.relpersistence, + buffer = ReadBuffer_common(bmr.rel, BMR_GET_SMGR(bmr), bmr.relpersistence, fork, extend_to - 1, mode, strategy); } @@ -2568,10 +2562,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr, BlockNumber first_block; TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork, - bmr.smgr->smgr_rlocator.locator.spcOid, - bmr.smgr->smgr_rlocator.locator.dbOid, - bmr.smgr->smgr_rlocator.locator.relNumber, - bmr.smgr->smgr_rlocator.backend, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid, + BMR_GET_SMGR(bmr)->smg
Fix bug with accessing to temporary tables of other sessions
Hi, During previous commitfest this topic already has been discussed within the "Forbid to DROP temp tables of other sessions" thread [1]. Unfortunately its name doesn't reflect the real problem, so I decided to start a new thread, as David G. Johnston advised. Here are the summary results of the discussion [1] : The superuser is only allowed to DROP temporary relations of other sessions. Other commands (like SELECT, INSERT, UPDATE, DELETE ...) must be forbidden to him. Error message for this case will look like this : `could not access temporary relations of other sessions`. For now, superuser still can specify such operations because of a bug in the code that mistakenly recognizes other session's temp table as permanent (I've covered this topic in more detail in [2]). Attached patch fixes this bug (targeted on b51f86e49a7f119004c0ce5d0be89cdf98309141). Opened issue: Not everyone liked the idea that table's persistence can be assigned to table during makeRangeVarXXX calls (inside gram.y). My opinion - `As far as "pg_temp_" prefix is reserved by the postgres kernel, we can definitely say that we have encountered a temporary table when we see this prefix.` I will be glad to hear your opinion. -- Best regards, Daniil Davydov [1] https://www.postgresql.org/message-id/CAJDiXgj72Axj0d4ojKdRWG_rnkfs4uWY414NL%3D15sCvh7-9rwg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAJDiXgj%2B5UKLWSUT5605rJhuw438NmEKecvhFAF2nnrMsgGK3w%40mail.gmail.com From c1415e457edd8e1cf38fd78ac55c93dbd8617d55 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Mon, 14 Apr 2025 11:01:56 +0700 Subject: [PATCH v5] Fix accessing other sessions temp tables --- src/backend/catalog/namespace.c | 56 src/backend/commands/tablecmds.c | 3 +- src/backend/nodes/makefuncs.c| 6 +++- src/backend/parser/gram.y| 11 ++- src/include/catalog/namespace.h | 2 ++ 5 files changed, 55 insertions(+), 23 deletions(-) diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index d97d632a7ef..f407efd9447 100644 --- a/src/backend/catalog/namespace.c +++ b/src/backend/catalog/namespace.c @@ -499,28 +499,44 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode, */ if (relation->relpersistence == RELPERSISTENCE_TEMP) { - if (!OidIsValid(myTempNamespace)) -relId = InvalidOid; /* this probably can't happen? */ - else - { -if (relation->schemaname) -{ - Oid namespaceId; + Oid namespaceId; - namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); + if (relation->schemaname) + { +namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok); +/* + * If the user has specified an existing temporary schema + * owned by another user. + */ +if (OidIsValid(namespaceId) && namespaceId != myTempNamespace) +{ /* - * For missing_ok, allow a non-existent schema name to - * return InvalidOid. + * We don't allow users to access temp tables of other + * sessions except for the case of dropping tables. */ - if (namespaceId != myTempNamespace) + if (!(flags & RVR_OTHER_TEMP_OK)) ereport(ERROR, -(errcode(ERRCODE_INVALID_TABLE_DEFINITION), - errmsg("temporary tables cannot specify a schema name"))); +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("could not access temporary relations of other sessions"))); } + } + else + { +namespaceId = myTempNamespace; -relId = get_relname_relid(relation->relname, myTempNamespace); +/* + * If this table was recognized as temporary, it means that we + * found it because backend's temporary namespace was specified + * in search_path. Thus, MyTempNamespace must contain valid oid. + */ +Assert(OidIsValid(namespaceId)); } + + if (missing_ok && !OidIsValid(namespaceId)) +relId = InvalidOid; + else +relId = get_relname_relid(relation->relname, namespaceId); } else if (relation->schemaname) { @@ -3553,21 +3569,19 @@ get_namespace_oid(const char *nspname, bool missing_ok) RangeVar * makeRangeVarFromNameList(const List *names) { - RangeVar *rel = makeRangeVar(NULL, NULL, -1); + RangeVar *rel; switch (list_length(names)) { case 1: - rel->relname = strVal(linitial(names)); + rel = makeRangeVar(NULL, strVal(linitial(names)), -1); break; case 2: - rel->schemaname = strVal(linitial(names)); - rel->relname = strVal(lsecond(names)); + rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1); break; case 3: + rel = makeRangeVar(strVal(lsecond(names)), strVal(lthird(names)), -1); rel->catalogname = strVal(linitial(names)); - rel->schemaname = strVal(lsecond(names)); - rel->relname = str
Re: POC: Parallel processing of indexes in autovacuum
On Tue, May 6, 2025 at 6:57 AM Masahiko Sawada wrote: > > What I roughly imagined is that we don't need to change the entire > autovacuum scheduling, but would like autovacuum workers to decides > whether or not to use parallel vacuum during its vacuum operation > based on GUC parameters (having a global effect) or storage parameters > (having an effect on the particular table). The criteria of triggering > parallel vacuum in autovacuum might need to be somewhat pessimistic so > that we don't unnecessarily use parallel vacuum on many tables. > +1, I think about it in the same way. I will expand on this topic in more detail in response to Sami's letter [1], so as not to repeat myself. > > Here are my thoughts on this. A/v worker has a very simple role - it > > is born after the launcher's request and must do exactly one 'task' - > > vacuum table or participate in parallel index vacuum. > > We also have a dedicated 'launcher' role, meaning the whole design > > implies that only the launcher is able to launch processes. > > > > If we allow a/v worker to use bgworkers, then : > > 1) A/v worker will go far beyond his responsibility. > > 2) Its functionality will overlap with the functionality of the launcher. > > While I agree that the launcher process is responsible for launching > autovacuum worker processes but I'm not sure it should be for > launching everything related autovacuums. It's quite possible that we > have parallel heap vacuum and processing the particular index with > parallel workers in the future. The code could get more complex if we > have the autovacuum launcher process launch such parallel workers too. > I believe it's more straightforward to divide the responsibility like > in a way that the autovacuum launcher is responsible for launching > autovacuum workers and autovacuum workers are responsible for > vacuuming tables no matter how to do that. It sounds very tempting. At the very beginning I did exactly that (to make sure that nothing would break in a parallel autovacuum). Only later it was decided to abandon the use of bgworkers. For now both approaches look fair for me. What do you think - will others agree that we can provide more responsibility to a/v workers? > > 3) Resource consumption can jump dramatically, which is unexpected for > > the user. > > What extra resources could be used if we use background workers > instead of autovacuum workers? I meant that more processes are starting to participate in the autovacuum than indicated in autovacuum_max_workers. And if a/v worker will use additional bgworkers => other operations cannot get these resources. > > Autovacuum will also be dependent on other resources > > (bgworkers pool). The current design does not imply this. > > I see your point but I think it doesn't necessarily need to reflect it > at the infrastructure layer. For example, we can internally allocate > extra background worker slots for parallel vacuum workers based on > max_parallel_index_autovac_workers in addition to > max_worker_processes. Anyway we might need something to check or > validate max_worker_processes value to make sure that every autovacuum > worker can use the specified number of parallel workers for parallel > vacuum. I don't think that we can provide all supportive workers for each parallel index vacuuming request. But I got your point - always keep several bgworkers that only a/v workers can use if needed and the size of this additional pool (depending on max_worker_processes) must be user-configurable. > > I wanted to create a patch that would fit into the existing mechanism > > without drastic innovations. But if you think that the above is not so > > important, then we can reuse VACUUM PARALLEL code and it would > > simplify the final implementation) > > I'd suggest using the existing infrastructure if we can achieve the > goal with it. If we find out there are some technical difficulties to > implement it without new infrastructure, we can revisit this approach. OK, in the near future I'll implement it and send a new patch to this thread. I'll be glad if you will take a look on it) [1] https://www.postgresql.org/message-id/CAA5RZ0vfBg%3Dc_0Sa1Tpxv8tueeBk8C5qTf9TrxKBbXUqPc99Ag%40mail.gmail.com -- Best regards, Daniil Davydov
Re: POC: Parallel processing of indexes in autovacuum
On Tue, May 6, 2025 at 7:21 AM Sami Imseih wrote: > > Perhaps we should only provide a reloption, therefore only tables specified > by the user via the reloption can be autovacuumed in parallel? Аfter your comments (earlier in this thread) I decided to do just that. For now we have reloption, so the user can decide which tables are "important" for parallel index vacuuming. We also set lower bounds (hardcoded) on the number of indexes and the number of dead tuples. For example, there is no need to use a parallel vacuum if the table has only one index. The situation is more complicated with the number of dead tuples - we need tests that would show the optimal minimum value. This issue is still being worked out. > This gives a targeted approach. Of course if multiple of these allowed tables > are to be autovacuumed at the same time, some may not get all the workers, > But that’s not different from if you are to manually vacuum in parallel the > tables > at the same time. I fully agree. Recently v2 patch has been supplemented with a new feature [1] - multiple tables in a cluster can be processed in parallel during autovacuum. And of course, not every a/v worker can get enough supportive processes, but this is considered normal behavior. Maximum number of supportive workers is limited by the GUC variable. [1] I guess that I'll send it within the v3 patch, that will also contain logic that was discussed in the letter above - using bgworkers instead of additional a/v workers. BTW, what do you think about this idea? -- Best regards, Daniil Davydov
Re: POC: Parallel processing of indexes in autovacuum
On Thu, May 1, 2025 at 8:03 AM Masahiko Sawada wrote: > > As I understand it, we initially disabled parallel vacuum for > autovacuum because their objectives are somewhat contradictory. > Parallel vacuum aims to accelerate the process by utilizing additional > resources, while autovacuum is designed to perform cleaning operations > with minimal impact on foreground transaction processing (e.g., > through vacuum delay). > Yep, we also decided that we must not create more a/v workers for index processing. In current implementation, the leader process sends a signal to the a/v launcher, and the launcher tries to launch all requested workers. But the number of workers never exceeds `autovacuum_max_workers`. Thus, we will never have more a/v workers than in the standard case (without this feature). > Nevertheless, I see your point about the potential benefits of using > parallel vacuum within autovacuum in specific scenarios. The crucial > consideration is determining appropriate criteria for triggering > parallel vacuum in autovacuum. Given that we currently support only > parallel index processing, suitable candidates might be autovacuum > operations on large tables that have a substantial number of > sufficiently large indexes and a high volume of garbage tuples. > > Although the actual number of parallel workers ultimately depends on > the number of eligible indexes, it might be beneficial to introduce a > storage parameter, say parallel_vacuum_workers, that allows control > over the number of parallel vacuum workers on a per-table basis. > For now, we have three GUC variables for this purpose: max_parallel_index_autovac_workers, autovac_idx_parallel_min_rows, autovac_idx_parallel_min_indexes. That is, everything is as you said. But we are still conducting research on this issue. I would like to get rid of some of these parameters. > Regarding implementation: I notice the WIP patch implements its own > parallel vacuum mechanism for autovacuum. Have you considered simply > setting at_params.nworkers to a value greater than zero? > About `at_params.nworkers = N` - that's exactly what we're doing (you can see it in the `vacuum_rel` function). But we cannot fully reuse code of VACUUM PARALLEL, because it creates its own processes via dynamic bgworkers machinery. As I said above - we don't want to consume additional resources. Also we don't want to complicate communication between processes (the idea is that a/v workers can only send signals to the a/v launcher). As a result, we created our own implementation of parallel index processing control - see changes in vacuumparallel.c and autovacuum.c. -- Best regards, Daniil Davydov
Re: POC: Parallel processing of indexes in autovacuum
On Fri, May 2, 2025 at 11:58 PM Sami Imseih wrote: > > I am generally -1 on the idea of autovacuum performing parallel > index vacuum, because I always felt that the parallel option should > be employed in a targeted manner for a specific table. if you have a bunch > of large tables, some more important than others, a/c may end > up using parallel resources on the least important tables and you > will have to adjust a/v settings per table, etc to get the right table > to be parallel index vacuumed by a/v. Hm, this is a good point. I think I should clarify one moment - in practice, there is a common situation when users have one huge table among all databases (with 80+ indexes created on it). But, of course, in general there may be few such tables. But we can still adjust the autovac_idx_parallel_min_rows parameter. If a table has a lot of dead tuples => it is actively used => table is important (?). Also, if the user can really determine the "importance" of each of the tables - we can provide an appropriate table option. Tables with this option set will be processed in parallel in priority order. What do you think about such an idea? > > Also, with the TIDStore improvements for index cleanup, and the practical > elimination of multi-pass index vacuums, I see this being even less > convincing as something to add to a/v. If I understood correctly, then we are talking about the fact that TIDStore can store so many tuples that in fact a second pass is never needed. But the number of passes does not affect the presented optimization in any way. We must think about a large number of indexes that must be processed. Even within a single pass we can have a 40% increase in speed. > > Now, If I am going to allocate extra workers to run vacuum in parallel, why > not just provide more autovacuum workers instead so I can get more tables > vacuumed within a span of time? For now, only one process can clean up indexes, so I don't see how increasing the number of a/v workers will help in the situation that I mentioned above. Also, we don't consume additional resources during autovacuum in this patch - total number of a/v workers always <= autovacuum_max_workers. BTW, see v2 patch, attached to this letter (bug fixes) :-) -- Best regards, Daniil Davydov From 1c93a729b844a1dfe109e8d9e54d5cc0a941d061 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Sat, 3 May 2025 00:27:45 +0700 Subject: [PATCH v2] WIP Allow autovacuum to process indexes of single table in parallel --- src/backend/commands/vacuum.c | 27 + src/backend/commands/vacuumparallel.c | 289 +- src/backend/postmaster/autovacuum.c | 906 +- src/backend/utils/misc/guc_tables.c | 30 + src/backend/utils/misc/postgresql.conf.sample | 6 + src/include/postmaster/autovacuum.h | 23 + src/test/modules/autovacuum/.gitignore| 1 + src/test/modules/autovacuum/Makefile | 14 + .../autovacuum/t/001_autovac_parallel.pl | 137 +++ 9 files changed, 1387 insertions(+), 46 deletions(-) create mode 100644 src/test/modules/autovacuum/.gitignore create mode 100644 src/test/modules/autovacuum/Makefile create mode 100644 src/test/modules/autovacuum/t/001_autovac_parallel.pl diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 33a33bf6b1c..a5ef5319ccc 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -2234,6 +2234,33 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, else toast_relid = InvalidOid; + /* + * Decide whether we need to process table with given oid in parallel mode + * during autovacuum. + */ + if (AmAutoVacuumWorkerProcess() && + params->index_cleanup != VACOPTVALUE_DISABLED) + { + PgStat_StatTabEntry *tabentry; + + /* fetch the pgstat table entry */ + tabentry = pgstat_fetch_stat_tabentry_ext(rel->rd_rel->relisshared, + rel->rd_id); + if (tabentry && tabentry->dead_tuples >= autovac_idx_parallel_min_rows) + { + List *indexes = RelationGetIndexList(rel); + int num_indexes = list_length(indexes); + + list_free(indexes); + + if (num_indexes >= autovac_idx_parallel_min_indexes && +max_parallel_index_autovac_workers > 0) + { +params->nworkers = max_parallel_index_autovac_workers; + } + } + } + /* * Switch to the table owner's userid, so that any index functions are run * as that user. Also lock down security-restricted operations and diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c index 2b9d548cdeb..cb4b7c23010 100644 --- a/src/backend/commands/vacuumparallel.c +++ b/src/backend/commands/vacuumparallel.c @@ -1,20 +1,23 @@ /*- * * vacuumparallel.c - * Support routines fo
Re: POC: Parallel processing of indexes in autovacuum
On Sat, May 3, 2025 at 5:28 AM Masahiko Sawada wrote: > > > In current implementation, the leader process sends a signal to the > > a/v launcher, and the launcher tries to launch all requested workers. > > But the number of workers never exceeds `autovacuum_max_workers`. > > Thus, we will never have more a/v workers than in the standard case > > (without this feature). > > I have concerns about this design. When autovacuuming on a single > table consumes all available autovacuum_max_workers slots with > parallel vacuum workers, the system becomes incapable of processing > other tables. This means that when determining the appropriate > autovacuum_max_workers value, users must consider not only the number > of tables to be processed concurrently but also the potential number > of parallel workers that might be launched. I think it would more make > sense to maintain the existing autovacuum_max_workers parameter while > introducing a new parameter that would either control the maximum > number of parallel vacuum workers per autovacuum worker or set a > system-wide cap on the total number of parallel vacuum workers. > For now we have max_parallel_index_autovac_workers - this GUC limits the number of parallel a/v workers that can process a single table. I agree that the scenario you provided is problematic. The proposal to limit the total number of supportive a/v workers seems attractive to me (I'll implement it as an experiment). It seems to me that this question is becoming a key one. First we need to determine the role of the user in the whole scheduling mechanism. Should we allow users to determine priority? Will this priority affect only within a single vacuuming cycle, or it will be more 'global'? I guess I don't have enough expertise to determine this alone. I will be glad to receive any suggestions. > > About `at_params.nworkers = N` - that's exactly what we're doing (you > > can see it in the `vacuum_rel` function). But we cannot fully reuse > > code of VACUUM PARALLEL, because it creates its own processes via > > dynamic bgworkers machinery. > > As I said above - we don't want to consume additional resources. Also > > we don't want to complicate communication between processes (the idea > > is that a/v workers can only send signals to the a/v launcher). > > Could you elaborate on the reasons why you don't want to use > background workers and avoid complicated communication between > processes? I'm not sure whether these concerns provide sufficient > justification for implementing its own parallel index processing. > Here are my thoughts on this. A/v worker has a very simple role - it is born after the launcher's request and must do exactly one 'task' - vacuum table or participate in parallel index vacuum. We also have a dedicated 'launcher' role, meaning the whole design implies that only the launcher is able to launch processes. If we allow a/v worker to use bgworkers, then : 1) A/v worker will go far beyond his responsibility. 2) Its functionality will overlap with the functionality of the launcher. 3) Resource consumption can jump dramatically, which is unexpected for the user. Autovacuum will also be dependent on other resources (bgworkers pool). The current design does not imply this. I wanted to create a patch that would fit into the existing mechanism without drastic innovations. But if you think that the above is not so important, then we can reuse VACUUM PARALLEL code and it would simplify the final implementation) -- Best regards, Daniil Davydov
Re: POC: Parallel processing of indexes in autovacuum
On Sat, May 3, 2025 at 5:59 AM Sami Imseih wrote: > > > I think it would more make > > sense to maintain the existing autovacuum_max_workers parameter while > > introducing a new parameter that would either control the maximum > > number of parallel vacuum workers per autovacuum worker or set a > > system-wide cap on the total number of parallel vacuum workers. > > +1, and would it make sense for parallel workers to come from > max_parallel_maintenance_workers? This is capped by > max_parallel_workers and max_worker_processes, so increasing > the defaults for all 3 will be needed as well. I may be wrong, but the `max_parallel_maintenance_workers` parameter is only used for commands that are explicitly run by the user. We already have `autovacuum_max_workers` and I think that code will be more consistent, if we adapt this particular parameter (perhaps with the addition of a new one, as I wrote in the previous letter). -- Best regards, Daniil Davydov
Re: POC: Parallel processing of indexes in autovacuum
On Sat, May 3, 2025 at 3:17 AM Sami Imseih wrote: > > I think in most cases, the user will want to determine the priority of > a table getting parallel vacuum cycles rather than having the autovacuum > determine the priority. I also see users wanting to stagger > vacuums of large tables with many indexes through some time period, > and give the > tables the full amount of parallel workers they can afford at these > specific periods > of time. A/V currently does not really allow for this type of > scheduling, and if we > give some kind of GUC to prioritize tables, I think users will constantly have > to be modifying this priority. If the user wants to determine priority himself, we anyway need to introduce some parameter (GUC or table option) that will give us a hint how we should schedule a/v work. You think that we should think about a more comprehensive behavior for such a parameter (so that the user doesn't have to change it often)? I will be glad to know your thoughts. > > If I understood correctly, then we are talking about the fact that > > TIDStore can store so many tuples that in fact a second pass is never > > needed. > > But the number of passes does not affect the presented optimization in > > any way. We must think about a large number of indexes that must be > > processed. Even within a single pass we can have a 40% increase in > > speed. > > I am not discounting that a single table vacuum with many indexes will > maybe perform better with parallel index scan, I am merely saying that > the TIDStore optimization now makes index vacuums better and perhaps > there is less of an incentive to use parallel. I still insist that this does not affect the parallel index vacuum, because we don't get an advantage in repeated passes. We get the same speed increase whether we have this optimization or not. Although it's even possible that the opposite is true - the situation will be better with the new TIDStore, but I can't say for sure. > > > Now, If I am going to allocate extra workers to run vacuum in parallel, > > > why > > > not just provide more autovacuum workers instead so I can get more tables > > > vacuumed within a span of time? > > > > For now, only one process can clean up indexes, so I don't see how > > increasing the number of a/v workers will help in the situation that I > > mentioned above. > > Also, we don't consume additional resources during autovacuum in this > > patch - total number of a/v workers always <= autovacuum_max_workers. > > Increasing a/v workers will not help speed up a specific table, what I > am suggesting is that instead of speeding up one table, let's just allow > other tables to not be starved of a/v cycles due to lack of a/v workers. OK, I got it. But what if vacuuming of a single table will take (for example) 60% of all time? This is still a possible situation, and the fast vacuum of all other tables will not help us. -- Best regards, Daniil Davydov
Re: Forbid to DROP temp tables of other sessions
Hi, I'm sorry for the long lull. Considering that it is still important for some users to be able to clean other sessions' temporary directories, I suggest adding a GUC that will allow superuser to do this. We keep only one option - to drop tables, because only this feature works properly in postgres by now. The ability to read and modify other session's temp tables is broken (I mean INSERT, UPDATE, DELETE, SELECT queries) and needs to be disabled. I wrote about broken INSERTs here : https://www.postgresql.org/message-id/CAJDiXgj6TBzn%3D6Ezx7%2B9BNa9HpBitBU%2BMuv-N3mHeN_Zs3NBDw%40mail.gmail.com What do you think about this idea? -- Best regards, Daniil Davydov
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
Hi, A few days ago I came up with an idea to implement multi insert optimization wherever possible. I prepared a raw patch and it showed a great performance gain (up to 4 times during INSERT ... INTO ... in the best case). Then I was very happy to find this thread. You did a great job and I want to help you to bring the matter to an end. On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang wrote: > I did some performance test these days, and I have some findings. > HEAD: > 12.29% postgres[.] pg_checksum_block >6.33% postgres[.] GetPrivateRefCountEntry >5.40% postgres[.] pg_comp_crc32c_sse42 >4.54% [kernel][k] copy_user_enhanced_fast_string >2.69% postgres[.] BufferIsValid >1.52% postgres[.] XLogRecordAssemble > > Patched: > 11.75% postgres[.] tts_virtual_materialize >8.87% postgres[.] pg_checksum_block >8.17% postgres[.] slot_deform_heap_tuple >8.09% postgres[.] heap_compute_data_size >6.17% postgres[.] fill_val >3.81% postgres[.] heap_fill_tuple >3.37% postgres[.] tts_virtual_copyslot >2.62% [kernel][k] copy_user_enhanced_fast_string I applied v25 patches on the master branch and made some measurements to find out what is the bottleneck in this case. The 'time' utility showed that without a patch, this query will run 1.5 times slower. I also made a few flamegraphs for this test. Most of the time is spent calling these two functions : tts_virtual_copyslot and heap_form_tuple. All tests were run in virtual machine with these CPU characteristics: Architecture: x86_64 CPU(s): 2 On-line CPU(s) list:0,1 Virtualization features: Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type:full Caches (sum of all): L1d:128 KiB (2 instances) L1i:128 KiB (2 instances) L2: 1 MiB (2 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 In my implementation, I used Tuplestore functionality to store tuples. In order to get rid of getting stuck in the above mentioned functions, I crossed it with the current implementation (v25 patches) and got a 10% increase in performance (for the test above). I also set up v22 patches to compare performance (with/without tuplestore) for INSERT ... INTO ... queries (with -j 4 -c 10 parameters for pgbech), and there also was an increase in TPS (about 3-4%). I attach a patch that adds Tuplestore to v25. What do you think about this idea? -- Best regards, Daniil Davydov From a59cfcbb05bb07c94a4c0ad6531baa5e531629ae Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Sun, 9 Mar 2025 16:37:44 +0700 Subject: [PATCH] Replace holding tuples in virtual slots with tuplestorage During performance testing, it was found out that in the current implementation a lot of the program's time is spent calling two functions : tts_virtual_copyslot and heap_fill_tuple. Calls to these functions are related to the fact that tuples are stored in virtual_tts, so I propose to replace this logic with Tuplestore functionality. Discussion: https://www.postgresql.org/message-id/9F9326B4-8AD9-4858-B1C1-559FC64E6E93%40gmail.com --- src/backend/access/heap/heapam.c | 67 +++- src/include/access/heapam.h | 9 - 2 files changed, 38 insertions(+), 38 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index acdce1a4b4..276480213a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2665,7 +2665,6 @@ void heap_modify_buffer_insert(TableModifyState *state, TupleTableSlot *slot) { - TupleTableSlot *dstslot; HeapInsertState *istate; HeapMultiInsertState *mistate; MemoryContext oldcontext; @@ -2682,8 +2681,10 @@ heap_modify_buffer_insert(TableModifyState *state, mistate = (HeapMultiInsertState *) palloc(sizeof(HeapMultiInsertState)); mistate->slots = - (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS); - mistate->cur_slots = 0; + (TupleTableSlot **) palloc0(sizeof(void *) * HEAP_MAX_BUFFERED_SLOTS); + mistate->tstore = tuplestore_begin_heap(false, false, work_mem); + mistate->nused = 0; + istate->mistate = mistate; /* @@ -2702,36 +2703,11 @@ heap_modify_buffer_insert(TableModifyState *state, istate = (HeapInsertState *) state->data; Assert(istate->mistate != NULL); mistate = istate->mistate; - dstslot = mistate->slots[mistate->cur_slots]; - - if (dstslot == NULL) - { - /* - * We use virtual tuple slots buffered slots for leveraging the - * optimization it provides to minimize physical data copying. The -
Add arbitrary xid and mxid to pg_resetwal
Hi, I prepared a patch that will allow us to set arbitrary values in -m and -x options for pg_resetwal. For now, it is not possible to specify a value that does not fit into existing SLRU segments, and main idea of this patch (for REL_17_STABLE) is to create such segments inside pg_resetwal's main() function. In my opinion, this will be useful primarily to simplify testing, since at the moment you have to create segments manually (as in this article <https://www.cybertec-postgresql.com/en/transaction-id-wraparound-a-walk-on-the-wild-side/> ). Patch also contains additional tests for pg_resetwal (regression is called to make sure that all postgres components are working correctly, but maybe it can be replaced with something more "compact"). What do you think about the idea of adding such functionality? -- Best regards, Daniil Davydov From 2993b376674cc0b565abd42a7d85eae8c8856428 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Wed, 5 Mar 2025 14:07:06 +0700 Subject: [PATCH] Allow to set any value for -m and -x options --- src/bin/pg_resetwal/Makefile | 4 + src/bin/pg_resetwal/pg_resetwal.c| 145 +++ src/bin/pg_resetwal/t/003_advance.pl | 135 + 3 files changed, 284 insertions(+) create mode 100644 src/bin/pg_resetwal/t/003_advance.pl diff --git a/src/bin/pg_resetwal/Makefile b/src/bin/pg_resetwal/Makefile index 4228a5a772..c890b1c5c6 100644 --- a/src/bin/pg_resetwal/Makefile +++ b/src/bin/pg_resetwal/Makefile @@ -17,6 +17,10 @@ include $(top_builddir)/src/Makefile.global LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils +# required for 03_advance.pl +REGRESS_SHLIB=$(top_builddir)/src/test/regress/regress$(DLSUFFIX) +export REGRESS_SHLIB + OBJS = \ $(WIN32RES) \ pg_resetwal.o diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c index e9dcb5a6d8..50f6b9ca2f 100644 --- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -76,6 +76,9 @@ static XLogSegNo minXlogSegNo = 0; static int WalSegSz; static int set_wal_segsize; +static void AdvanceNextXid(FullTransactionId oldval, FullTransactionId newval); +static void AdvanceNextMultiXid(MultiXactId oldval, MultiXactId newval); + static void CheckDataVersion(void); static bool read_controlfile(void); static void GuessControlValues(void); @@ -90,6 +93,29 @@ static void WriteEmptyXLOG(void); static void usage(void); +typedef struct CommitTimestampEntry +{ + TimestampTz time; + RepOriginId nodeid; +} CommitTimestampEntry; + +// typedef uint64 MultiXactOffset; + +/* + * Note: these macros are copied from clog.c, commit_ts.c and subtrans.c and + * should be kept in sync. + */ +#define CLOG_XACTS_PER_BYTE 4 +#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE) +#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId)) +#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \ + sizeof(RepOriginId)) +#define COMMIT_TS_XACTS_PER_PAGE \ + (BLCKSZ / SizeOfCommitTimestampEntry) +#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset)) + +#define SLRU_PAGES_PER_SEGMENT 32 + int main(int argc, char *argv[]) { @@ -113,6 +139,7 @@ main(int argc, char *argv[]) bool force = false; bool noupdate = false; MultiXactId set_oldestmxid = 0; + FullTransactionId current_fxid = {0}; char *endptr; char *endptr2; char *DataDir = NULL; @@ -406,6 +433,11 @@ main(int argc, char *argv[]) if ((guessed && !force) || noupdate) PrintControlValues(guessed); + /* + * Remember full id of next free transaction + */ + current_fxid = ControlFile.checkPointCopy.nextXid; + /* * Adjust fields if required by switches. (Do this now so that printout, * if any, includes these values.) @@ -422,10 +454,18 @@ main(int argc, char *argv[]) } if (set_xid != 0) + { ControlFile.checkPointCopy.nextXid = FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid), set_xid); + if (FullTransactionIdPrecedes(current_fxid, ControlFile.checkPointCopy.nextXid) && + !noupdate) + { + AdvanceNextXid(current_fxid, ControlFile.checkPointCopy.nextXid); + } + } + if (set_oldest_commit_ts_xid != 0) ControlFile.checkPointCopy.oldestCommitTsXid = set_oldest_commit_ts_xid; if (set_newest_commit_ts_xid != 0) @@ -436,12 +476,19 @@ main(int argc, char *argv[]) if (set_mxid != 0) { + MultiXactId current_mxid = ControlFile.checkPointCopy.nextMulti; ControlFile.checkPointCopy.nextMulti = set_mxid; ControlFile.checkPointCopy.oldestMulti = set_oldestmxid; if (ControlFile.checkPointCopy.oldestMulti < FirstMultiXactId) ControlFile.checkPointCopy.oldestMulti += FirstMultiXactId; ControlFile.checkPointCopy.oldestMultiDB = InvalidOid; + + /* + * If current_mxid precedes set_mxid + */ + if (((int32) (current_mxid - set_mxid) < 0) && !noupd
Re: Add arbitrary xid and mxid to pg_resetwal
Hi, On Fri, Mar 7, 2025 at 9:35 PM Aleksander Alekseev wrote: > Your patch should target the `master` branch. Also please add a > corresponding entry to the nearest open commitfest [1]. OK, thanks for the notice! I attach the v2 patch for the `master` branch to this letter. Now you can also find it in commitfest in System Administration topic. On Fri, Mar 7, 2025 at 9:35 PM Aleksander Alekseev wrote: > > In my opinion, this will be useful primarily to simplify testing, since at > > the moment you have to create segments manually (as in this article). > > In this case you should add a test or two that demonstrate this. As > separate commits perhaps. Well, I just saw that people have a request for such functionality. More specifically, I think it's worth taking a look at the src/test/modules/xid_wraparound. There, the transaction counter is just jumping forward to check the autovacuum and postgres wraparound limits. These tests are using the `xid_wraparound` extension, but it can be replaced with new pg_resetwal feature. Measurements on my machine showed that the test execution time (with advancing xid up to 10 billions with 100 millions values by step) takes 40% less time to complete. P.S. v2 patch looks a bit scary, because it contains a lot of raw I/O operations. At the moment, it is the best I came up with, but the logic of the code (I hope) is correct. -- Best regards, Daniil Davydov From bcca199a1787399c58c959a4357c2daefb8335f5 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Tue, 11 Mar 2025 10:34:12 +0700 Subject: [PATCH v2] Arbitrary xid and mxid for resetwal --- src/bin/pg_resetwal/Makefile | 4 + src/bin/pg_resetwal/pg_resetwal.c| 338 ++- src/bin/pg_resetwal/t/003_advance.pl | 137 +++ 3 files changed, 476 insertions(+), 3 deletions(-) create mode 100644 src/bin/pg_resetwal/t/003_advance.pl diff --git a/src/bin/pg_resetwal/Makefile b/src/bin/pg_resetwal/Makefile index 82bea06dee5..1bd76180864 100644 --- a/src/bin/pg_resetwal/Makefile +++ b/src/bin/pg_resetwal/Makefile @@ -17,6 +17,10 @@ include $(top_builddir)/src/Makefile.global LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils +# required for 03_advance.pl +REGRESS_SHLIB=$(top_builddir)/src/test/regress/regress$(DLSUFFIX) +export REGRESS_SHLIB + OBJS = \ $(WIN32RES) \ pg_resetwal.o diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c index 31bc0abff16..18649127446 100644 --- a/src/bin/pg_resetwal/pg_resetwal.c +++ b/src/bin/pg_resetwal/pg_resetwal.c @@ -65,7 +65,7 @@ static bool guessed = false; /* T if we had to guess at any values */ static const char *progname; static uint32 set_xid_epoch = (uint32) -1; static TransactionId set_oldest_xid = 0; -static TransactionId set_xid = 0; +static uint64 set_xid = 0; static TransactionId set_oldest_commit_ts_xid = 0; static TransactionId set_newest_commit_ts_xid = 0; static Oid set_oid = 0; @@ -89,7 +89,41 @@ static void KillExistingArchiveStatus(void); static void KillExistingWALSummaries(void); static void WriteEmptyXLOG(void); static void usage(void); +static void AdvanceNextXid(TransactionId oldval, TransactionId newval); +static void AdvanceNextMultiXid(MultiXactId oldval, MultiXactId newval); +/* + * Note: this structure is copied from commit_ts.c and should be kept in sync. + */ +typedef struct CommitTimestampEntry +{ + TimestampTz time; + RepOriginId nodeid; +} CommitTimestampEntry; + +/* + * Note: these macros are copied from clog.c, commit_ts.c and subtrans.c and + * should be kept in sync. + */ +#define CLOG_BITS_PER_XACT 2 +#define CLOG_XACTS_PER_BYTE 4 +#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE) + +#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) +#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) +#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE) + +#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId)) + +#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \ + sizeof(RepOriginId)) + +#define COMMIT_TS_XACTS_PER_PAGE \ + (BLCKSZ / SizeOfCommitTimestampEntry) + +#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset)) + +#define SLRU_PAGES_PER_SEGMENT 32 int main(int argc, char *argv[]) @@ -441,9 +475,47 @@ main(int argc, char *argv[]) } if (set_xid != 0) + { + FullTransactionId current_fxid = ControlFile.checkPointCopy.nextXid; + FullTransactionId full_datfrozenxid; + uint32 current_epoch; + + full_datfrozenxid = + FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(current_fxid), + ControlFile.checkPointCopy.oldestXid); + + if (set_xid > full_datfrozenxid.value && + (set_xid - full_datfrozenxid.value) > INT32_MAX) + { + /* + * Cannot advance transaction ID in this case, because all unfrozen + *
Re: Accessing an invalid pointer in BufferManagerRelation structure
Hi, I am posting the new v2 patch, which is now rebased on the `master` branch. Waiting for your feedback :) -- Best regards, Daniil Davydov From 2e43a1411ebcb37b2c0c1d6bac758e48799d2c4e Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Tue, 11 Mar 2025 17:11:16 +0700 Subject: [PATCH v2] Add macros for safety access to smgr --- src/backend/storage/buffer/bufmgr.c | 55 --- src/backend/storage/buffer/localbuf.c | 10 +++-- src/include/storage/bufmgr.h | 14 +++ 3 files changed, 45 insertions(+), 34 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 7915ed624c1..c55228f298a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -887,11 +887,8 @@ ExtendBufferedRelBy(BufferManagerRelation bmr, Assert(bmr.smgr == NULL || bmr.relpersistence != 0); Assert(extend_by > 0); - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); + if (bmr.relpersistence == 0) bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } return ExtendBufferedRelCommon(bmr, fork, strategy, flags, extend_by, InvalidBlockNumber, @@ -923,11 +920,8 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, Assert(bmr.smgr == NULL || bmr.relpersistence != 0); Assert(extend_to != InvalidBlockNumber && extend_to > 0); - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); + if (bmr.relpersistence == 0) bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } /* * If desired, create the file if it doesn't exist. If @@ -935,15 +929,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * an smgrexists call. */ if ((flags & EB_CREATE_FORK_IF_NEEDED) && - (bmr.smgr->smgr_cached_nblocks[fork] == 0 || - bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) && - !smgrexists(bmr.smgr, fork)) + (BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 || + BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) && + !smgrexists(BMR_GET_SMGR(bmr), fork)) { LockRelationForExtension(bmr.rel, ExclusiveLock); /* recheck, fork might have been created concurrently */ - if (!smgrexists(bmr.smgr, fork)) - smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY); + if (!smgrexists(BMR_GET_SMGR(bmr), fork)) + smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY); UnlockRelationForExtension(bmr.rel, ExclusiveLock); } @@ -953,13 +947,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, * kernel. */ if (flags & EB_CLEAR_SIZE_CACHE) - bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber; + BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber; /* * Estimate how many pages we'll need to extend by. This avoids acquiring * unnecessarily many victim buffers. */ - current_size = smgrnblocks(bmr.smgr, fork); + current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork); /* * Since no-one else can be looking at the page contents yet, there is no @@ -1003,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr, if (buffer == InvalidBuffer) { Assert(extended_by == 0); - buffer = ReadBuffer_common(bmr.rel, bmr.smgr, bmr.relpersistence, + buffer = ReadBuffer_common(bmr.rel, BMR_GET_SMGR(bmr), bmr.relpersistence, fork, extend_to - 1, mode, strategy); } @@ -2153,10 +2147,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr, BlockNumber first_block; TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork, - bmr.smgr->smgr_rlocator.locator.spcOid, - bmr.smgr->smgr_rlocator.locator.dbOid, - bmr.smgr->smgr_rlocator.locator.relNumber, - bmr.smgr->smgr_rlocator.backend, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.dbOid, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.relNumber, + BMR_GET_SMGR(bmr)->smgr_rlocator.backend, extend_by); if (bmr.relpersistence == RELPERSISTENCE_TEMP) @@ -2170,10 +2164,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr, *extended_by = extend_by; TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork, - bmr.smgr->smgr_rlocator.locator.spcOid, - bmr.smgr->smgr_rlocator.locator.dbOid, - bmr.smgr->smgr_rlocator.locator.relNumber, - bmr.smgr->smgr_rlocator.backend, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.dbOid, + BMR_GET_SMGR(bmr)->smgr_rlocator.locator.relNumber, + BMR_GET_SMGR(bmr)->smgr_rlocator.backend, *extended_by, first_block); @@ -2239,9 +2233,9 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, * kernel. */ if (flags & EB_CLEAR_SIZE_CACHE) - bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNum
Re: POC: Parallel processing of indexes in autovacuum
Hi, As I promised - meet parallel index autovacuum with bgworkers (Parallel-index-autovacuum-with-bgworkers.patch). This is pretty simple implementation : 1) Added new table option `parallel_idx_autovac_enabled` that must be set to `true` if user wants autovacuum to process table in parallel. 2) Added new GUC variable `autovacuum_reserved_workers_num`. This is number of parallel workers from bgworkers pool that can be used only by autovacuum workers. The `autovacuum_reserved_workers_num` parameter actually reserves a requested part of the processes, the total number of which is equal to `max_worker_processes`. 3) When an autovacuum worker decides to process some table in parallel, it just sets `VacuumParams->nworkers` to appropriate value (> 0) and then the code is executed as if it were a regular VACUUM PARALLEL. 4) I kept test/modules/autovacuum as sandbox where you can play with parallel index autovacuum a bit. What do you think about this implementation? P.S. I also improved "self-managed" parallel autovacuum implementation (Self-managed-parallel-index-autovacuum.patch). For now it needs a lot of refactoring, but all features are working good. Both patches are targeting on master branch (bc35adee8d7ad38e7bef40052f196be55decddec) -- Best regards, Daniil Davydov From cfb7e675d9a1b05aef0cdaeeca5f6edd4bcd3b70 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Sat, 10 May 2025 01:07:42 +0700 Subject: [PATCH v1] Parallel index autovacuum with bgworkers --- src/backend/access/common/reloptions.c| 11 ++ src/backend/commands/vacuum.c | 55 src/backend/commands/vacuumparallel.c | 46 --- src/backend/postmaster/autovacuum.c | 9 ++ src/backend/postmaster/bgworker.c | 33 - src/backend/utils/init/globals.c | 1 + src/backend/utils/misc/guc_tables.c | 13 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/miscadmin.h | 1 + src/include/utils/guc_hooks.h | 2 + src/include/utils/rel.h | 10 ++ src/test/modules/autovacuum/.gitignore| 1 + src/test/modules/autovacuum/Makefile | 14 ++ .../autovacuum/t/001_autovac_parallel.pl | 129 ++ 14 files changed, 307 insertions(+), 19 deletions(-) create mode 100644 src/test/modules/autovacuum/.gitignore create mode 100644 src/test/modules/autovacuum/Makefile create mode 100644 src/test/modules/autovacuum/t/001_autovac_parallel.pl diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 46c1dce222d..ccf59208783 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -166,6 +166,15 @@ static relopt_bool boolRelOpts[] = }, true }, + { + { + "parallel_idx_autovac_enabled", + "Allows autovacuum to process indexes of this table in parallel mode", + RELOPT_KIND_HEAP, + ShareUpdateExclusiveLock + }, + false + }, /* list terminator */ {{NULL}} }; @@ -1863,6 +1872,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind) {"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)}, {"autovacuum_enabled", RELOPT_TYPE_BOOL, offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)}, + {"parallel_idx_autovac_enabled", RELOPT_TYPE_BOOL, + offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, parallel_idx_autovac_enabled)}, {"autovacuum_vacuum_threshold", RELOPT_TYPE_INT, offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)}, {"autovacuum_vacuum_max_threshold", RELOPT_TYPE_INT, diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 33a33bf6b1c..f7667f14147 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -57,9 +57,21 @@ #include "utils/guc.h" #include "utils/guc_hooks.h" #include "utils/memutils.h" +#include "utils/rel.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +/* + * Minimum number of dead tuples required for the table's indexes to be + * processed in parallel during autovacuum. + */ +#define AV_PARALLEL_DEADTUP_THRESHOLD 1024 + +/* + * How many indexes should process each parallel worker during autovacuum. + */ +#define NUM_INDEXES_PER_PARALLEL_WORKER 30 + /* * Minimum interval for cost-based vacuum delay reports from a parallel worker. * This aims to avoid sending too many messages and waking up the leader too @@ -2234,6 +2246,49 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, else toast_relid = InvalidOid; + /* + * If we are running autovacuum - decide whether we need to process indexes + * of table with given oid in parallel. + */ + if (AmAutoVacuumWorkerProcess() && + params->index_clea
Re: [BUG] Skipped initialization of some xl_xact_parsed_prepare fields
Hi, On Wed, May 21, 2025 at 9:59 AM Fujii Masao wrote: > I've pushed the patch. Thanks! > Glad to hear it, thank you! -- Best regards, Daniil Davydov
Re: POC: Parallel processing of indexes in autovacuum
Hi, On Fri, May 23, 2025 at 6:12 AM Masahiko Sawada wrote: > > On Thu, May 22, 2025 at 12:44 AM Daniil Davydov <3daniss...@gmail.com> wrote: > > > > On Wed, May 21, 2025 at 5:30 AM Masahiko Sawada > > wrote: > > > > > > I find that the name "autovacuum_reserved_workers_num" is generic. It > > > would be better to have a more specific name for parallel vacuum such > > > as autovacuum_max_parallel_workers. This parameter is related to > > > neither autovacuum_worker_slots nor autovacuum_max_workers, which > > > seems fine to me. Also, max_parallel_maintenance_workers doesn't > > > affect this parameter. > > > > This was my headache when I created names for variables. Autovacuum > > initially implies parallelism, because we have several parallel a/v > > workers. > > I'm not sure if it's parallelism. We can have multiple autovacuum > workers simultaneously working on different tables, which seems not > parallelism to me. Hm, I didn't thought about the 'parallelism' definition in this way. But I see your point - the next v4 patch will contain the naming that you suggest. > > > So I think that parameter like > > `autovacuum_max_parallel_workers` will confuse somebody. > > If we want to have a more specific name, I would prefer > > `max_parallel_index_autovacuum_workers`. > > It's better not to use 'index' as we're trying to extend parallel > vacuum to heap scanning/vacuuming as well[1]. OK, I'll fix it. > > > + /* > > > +* If we are running autovacuum - decide whether we need to process > > > indexes > > > +* of table with given oid in parallel. > > > +*/ > > > + if (AmAutoVacuumWorkerProcess() && > > > + params->index_cleanup != VACOPTVALUE_DISABLED && > > > + RelationAllowsParallelIdxAutovac(rel)) > > > > > > I think that this should be done in autovacuum code. > > > > We need params->index cleanup variable to decide whether we need to > > use parallel index a/v. In autovacuum.c we have this code : > > *** > > /* > > * index_cleanup and truncate are unspecified at first in autovacuum. > > * They will be filled in with usable values using their reloptions > > * (or reloption defaults) later. > > */ > > tab->at_params.index_cleanup = VACOPTVALUE_UNSPECIFIED; > > tab->at_params.truncate = VACOPTVALUE_UNSPECIFIED; > > *** > > This variable is filled in inside the `vacuum_rel` function, so I > > think we should keep the above logic in vacuum.c. > > I guess that we can specify the parallel degree even if index_cleanup > is still UNSPECIFIED. vacuum_rel() would then decide whether to use > index vacuuming and vacuumlazy.c would decide whether to use parallel > vacuum based on the specified parallel degree and index_cleanup value. > > > > > > +#define AV_PARALLEL_DEADTUP_THRESHOLD 1024 > > > > > > These fixed values really useful in common cases? I think we already > > > have an optimization where we skip vacuum indexes if the table has > > > fewer dead tuples (see BYPASS_THRESHOLD_PAGES). > > > > When we allocate dead items (and optionally init parallel autocuum) we > > don't have sane value for `vacrel->lpdead_item_pages` (which should be > > compared with BYPASS_THRESHOLD_PAGES). > > The only criterion that we can focus on is the number of dead tuples > > indicated in the PgStat_StatTabEntry. > > My point is that this criterion might not be useful. We have the > bypass optimization for index vacuuming and having many dead tuples > doesn't necessarily mean index vacuuming taking a long time. For > example, even if the table has a few dead tuples, index vacuuming > could take a very long time and parallel index vacuuming would help > the situation, if the table is very large and has many indexes. That sounds reasonable. I'll fix it. > > But autovacuum (as I think) should work as stable as possible and > > `unnoticed` by other processes. Thus, we must : > > 1) Compute resources (such as the number of parallel workers for a > > single table's indexes vacuuming) as efficiently as possible. > > 2) Provide a guarantee that as many tables as possible (among > > requested) will be processed in parallel. > > > > (1) can be achieved by calculating the parameters on the fly. > > NUM_INDEXES_PER_PARALLEL_WORKER is a rough mock. I can provide more > > accurate value in the near future. > > I think it requires more things than the number of indexes
Re: POC: Parallel processing of indexes in autovacuum
Hi, On Fri, May 16, 2025 at 4:06 AM Matheus Alcantara wrote: > I've reviewed the v1-0001 patch, the build on MacOS using meson+ninja is > failing: > ❯❯❯ ninja -C build install > ninja: Entering directory `build' > [1/126] Compiling C object > src/backend/postgres_lib.a.p/utils_misc_guc_tables.c.o > FAILED: src/backend/postgres_lib.a.p/utils_misc_guc_tables.c.o > ../src/backend/utils/misc/guc_tables.c:3613:4: error: incompatible > pointer to integer conversion initializing 'int' with an expression of > type 'void *' [-Wint-conversion] > 3613 | NULL, > | ^~~~ > Thank you for reviewing this patch! > It seems that the "autovacuum_reserved_workers_num" declaration on > guc_tables.c has an extra gettext_noop() call? Good catch, I fixed this warning in the v2 version. > > One other point is that as you've added TAP tests for the autovacuum I > think you also need to create a meson.build file as you already create > the Makefile. > > You also need to update the src/test/modules/meson.build and > src/test/modules/Makefile to include the new test/modules/autovacuum > path. > OK, I should clarify this moment : modules/autovacuum is not a normal test but a sandbox - just an example of how we can trigger parallel index autovacuum. Also it may be used for debugging purposes. In fact, 001_autovac_parallel.pl is not verifying anything. I'll do as you asked (add all meson and Make stuff), but please don't focus on it. The creation of the real test is still in progress. (I'll try to complete it as soon as possible). In this letter I will divide the patch into 2 parts : implementation and sandbox. What do you think about implementation? -- Best regards, Daniil Davydov From 5a25535f5f4212ca756b9c67bcecf3a271ceb215 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Fri, 16 May 2025 11:59:03 +0700 Subject: [PATCH v2 2/2] Sandbox for parallel index autovacuum --- src/test/modules/Makefile | 1 + src/test/modules/autovacuum/.gitignore| 1 + src/test/modules/autovacuum/Makefile | 14 ++ src/test/modules/autovacuum/meson.build | 12 ++ .../autovacuum/t/001_autovac_parallel.pl | 129 ++ src/test/modules/meson.build | 1 + 6 files changed, 158 insertions(+) create mode 100644 src/test/modules/autovacuum/.gitignore create mode 100644 src/test/modules/autovacuum/Makefile create mode 100644 src/test/modules/autovacuum/meson.build create mode 100644 src/test/modules/autovacuum/t/001_autovac_parallel.pl diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index aa1d27bbed3..b7f3e342e82 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -5,6 +5,7 @@ top_builddir = ../../.. include $(top_builddir)/src/Makefile.global SUBDIRS = \ + autovacuum \ brin \ commit_ts \ delay_execution \ diff --git a/src/test/modules/autovacuum/.gitignore b/src/test/modules/autovacuum/.gitignore new file mode 100644 index 000..0b54641bceb --- /dev/null +++ b/src/test/modules/autovacuum/.gitignore @@ -0,0 +1 @@ +/tmp_check/ \ No newline at end of file diff --git a/src/test/modules/autovacuum/Makefile b/src/test/modules/autovacuum/Makefile new file mode 100644 index 000..90c00ff350b --- /dev/null +++ b/src/test/modules/autovacuum/Makefile @@ -0,0 +1,14 @@ +# src/test/modules/autovacuum/Makefile + +TAP_TESTS = 1 + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/autovacuum +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif \ No newline at end of file diff --git a/src/test/modules/autovacuum/meson.build b/src/test/modules/autovacuum/meson.build new file mode 100644 index 000..f91c1a14d2b --- /dev/null +++ b/src/test/modules/autovacuum/meson.build @@ -0,0 +1,12 @@ +# Copyright (c) 2022-2025, PostgreSQL Global Development Group + +tests += { + 'name': 'autovacuum', + 'sd': meson.current_source_dir(), + 'bd': meson.current_build_dir(), + 'tap': { +'tests': [ + 't/001_autovac_parallel.pl', +], + }, +} diff --git a/src/test/modules/autovacuum/t/001_autovac_parallel.pl b/src/test/modules/autovacuum/t/001_autovac_parallel.pl new file mode 100644 index 000..a44cbebe0fd --- /dev/null +++ b/src/test/modules/autovacuum/t/001_autovac_parallel.pl @@ -0,0 +1,129 @@ +use warnings FATAL => 'all'; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; + +my $psql_out; + +my $node = PostgreSQL::Test::Cluster->new('node1'); +$node->init; +$node->append_conf('postgresql.conf', qq{ + autovacuum = off + max
[BUG] Skipped initialization of some xl_xact_parsed_prepare fields
Hi, I noticed that inside ParsePrepareRecord function we are missing initialization of `nstats` and `nabortstats` fields of xl_xact_parsed_prepare structure: *** (current code in master) memset(parsed, 0, sizeof(*parsed)); parsed->xact_time = xlrec->prepared_at; parsed->origin_lsn = xlrec->origin_lsn; parsed->origin_timestamp = xlrec->origin_timestamp; parsed->twophase_xid = xlrec->xid; parsed->dbId = xlrec->database; parsed->nsubxacts = xlrec->nsubxacts; parsed->nrels = xlrec->ncommitrels; parsed->nabortrels = xlrec->nabortrels; parsed->nmsgs = xlrec->ninvalmsgs; *** Thus, they will always be 0 after calling the ParsePrepareRecord function, but `stats` and `abortstats` pointers are set correctly: *** (current code in master) parsed->stats = (xl_xact_stats_item *) bufptr; bufptr += MAXALIGN(xlrec->ncommitstats * sizeof(xl_xact_stats_item)); parsed->abortstats = (xl_xact_stats_item *) bufptr; bufptr += MAXALIGN(xlrec->nabortstats * sizeof(xl_xact_stats_item)); *** If we look (for example) on parsing of a commit record, we could see that both `nstats` and `stats` fields are set inside the ParseCommitRecord function. For now, zeroed number of stats lead to invalid output of `xact_desc_prepare`, because it relies on values of `nstats` and `nabortstats` fields: *** (current code in master) xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats); xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats); *** I suppose to add initialization of `nstats` and `nabortstats` to ParsePrepareRecord (see attached patch). -- Best regards, Daniil Davydov From 6338b32aeb170dfd23ae6d313ab2fb77058e0645 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Thu, 15 May 2025 12:19:03 +0700 Subject: [PATCH v1] Add initialization of nstats and nabortstats fields inside ParsePrepareRecord --- src/backend/access/rmgrdesc/xactdesc.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c index 715cc1f7bad..305598e2865 100644 --- a/src/backend/access/rmgrdesc/xactdesc.c +++ b/src/backend/access/rmgrdesc/xactdesc.c @@ -252,6 +252,8 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p parsed->nsubxacts = xlrec->nsubxacts; parsed->nrels = xlrec->ncommitrels; parsed->nabortrels = xlrec->nabortrels; + parsed->nstats = xlrec->ncommitstats; + parsed->nabortstats = xlrec->nabortstats; parsed->nmsgs = xlrec->ninvalmsgs; strncpy(parsed->twophase_gid, bufptr, xlrec->gidlen); -- 2.43.0
Re: POC: Parallel processing of indexes in autovacuum
r so > that the user can specify the number of parallel vacuum workers for > the table. For example, we can have a reloption > autovacuum_parallel_workers. Setting 0 (by default) means to disable > parallel vacuum during autovacuum, and setting special value -1 means > to let PostgreSQL calculate the parallel degree for the table (same as > the default VACUUM command behavior). > ... > The patch includes the changes to bgworker.c so that we can reserve > some slots for autovacuums. I guess that this change is not > necessarily necessary because if the user sets the related GUC > parameters correctly the autovacuum workers can use parallel vacuum as > expected. Even if we need this change, I would suggest implementing > it as a separate patch. > .. > +#define AV_PARALLEL_DEADTUP_THRESHOLD 1024 > +#define NUM_INDEXES_PER_PARALLEL_WORKER 30 > > These fixed values really useful in common cases? Given that we rely on > users' heuristics which table needs to use parallel vacuum during > autovacuum, I think we don't need to apply these conditions. > .. I grouped these comments together, because they all relate to a single question : how much freedom will we give to the user? Your opinion (as far as I understand) is that we allow users to specify any number of parallel workers for tables, and it is the user's responsibility to configure appropriate GUC variables, so that autovacuum can always process indexes in parallel. And we don't need to think about thresholds. Even if the table has a small number of indexes and dead rows - if the user specified table option, we must do a parallel index a/v with requested number of parallel workers. Please correct me if I messed something up. I think that this logic is well suited for the `VACUUM (PARALLEL)` sql command, which is manually called by the user. But autovacuum (as I think) should work as stable as possible and `unnoticed` by other processes. Thus, we must : 1) Compute resources (such as the number of parallel workers for a single table's indexes vacuuming) as efficiently as possible. 2) Provide a guarantee that as many tables as possible (among requested) will be processed in parallel. (1) can be achieved by calculating the parameters on the fly. NUM_INDEXES_PER_PARALLEL_WORKER is a rough mock. I can provide more accurate value in the near future. (2) can be achieved by workers reserving - we know that N workers (from bgworkers pool) are *always* at our disposal. And when we use such workers we are not dependent on other operations in the cluster and we don't interfere with other operations by taking resources away from them. If we give the user too much freedom in parallel index a/v tuning, all these requirements may be violated. This is only my opinion, and I can agree with yours. Maybe we need another person to judge us? Please see v3 patches that contain changes related to GUC parameter and table option (no changes in global logic by now). -- Best regards, Daniil Davydov From 2223da7a9b2ef8c8d71780ad72b24eaf6d6c1141 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Fri, 16 May 2025 11:58:40 +0700 Subject: [PATCH v3 1/2] Parallel index autovacuum with bgworkers --- src/backend/access/common/reloptions.c| 11 src/backend/commands/vacuum.c | 55 +++ src/backend/commands/vacuumparallel.c | 46 ++-- src/backend/postmaster/autovacuum.c | 14 - src/backend/postmaster/bgworker.c | 33 ++- src/backend/utils/init/globals.c | 1 + src/backend/utils/misc/guc_tables.c | 12 src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/miscadmin.h | 1 + src/include/utils/guc_hooks.h | 2 + src/include/utils/rel.h | 10 11 files changed, 166 insertions(+), 20 deletions(-) diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 46c1dce222d..730096002b1 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -166,6 +166,15 @@ static relopt_bool boolRelOpts[] = }, true }, + { + { + "parallel_index_autovacuum_enabled", + "Allows autovacuum to process indexes of this table in parallel mode", + RELOPT_KIND_HEAP, + ShareUpdateExclusiveLock + }, + false + }, /* list terminator */ {{NULL}} }; @@ -1863,6 +1872,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind) {"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)}, {"autovacuum_enabled", RELOPT_TYPE_BOOL, offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)}, + {"parallel_index_autovacuum_enabled", RELOPT_TYPE_BOOL, + offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, p
Speedup truncations of temporary relation forks
Hi, For now we fully scan local buffers for each fork of the temporary relation that we want to truncate (in order to drop its buffers). It happens in the function "DropRelationBuffers". There used to be the same problem for regular tables (i.e. shared buffers) and it was fixed in commit [1] and now shared buffers are scanned only one time for those three relation forks. I suggest making the same fix for temporary relations. See attached patch. [1] 6d05086c0a79e50d8e91ed953626ec7280cd2481 BTW, I see that we call "DropRelationBuffers" separately for relation, toast table and indexes. What if we collect all this information in advance and iterate over the local/shared buffers only once? I understand that it will look kinda ugly, but it will increase performance for sure. -- Best regards, Daniil Davydov From 7b3c3e816b502a94d1ccc6889d86e07d2e1555fd Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Fri, 30 May 2025 17:50:47 +0700 Subject: [PATCH v1] Speedup temp table truncation --- src/backend/storage/buffer/bufmgr.c | 8 +++- src/backend/storage/buffer/localbuf.c | 20 +--- src/include/storage/buf_internals.h | 4 ++-- 3 files changed, 18 insertions(+), 14 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f93131a645e..bd4c549eb14 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4550,11 +4550,9 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, if (RelFileLocatorBackendIsTemp(rlocator)) { if (rlocator.backend == MyProcNumber) - { - for (j = 0; j < nforks; j++) -DropRelationLocalBuffers(rlocator.locator, forkNum[j], - firstDelBlock[j]); - } + DropRelationLocalBuffers(rlocator.locator, forkNum, nforks, + firstDelBlock); + return; } diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 63101d56a07..7f51288546f 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -660,24 +660,30 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) * See DropRelationBuffers in bufmgr.c for more notes. */ void -DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum, - BlockNumber firstDelBlock) +DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber *forkNum, + int nforks, BlockNumber *firstDelBlock) { int i; + int j; for (i = 0; i < NLocBuffer; i++) { BufferDesc *bufHdr = GetLocalBufferDescriptor(i); uint32 buf_state; + if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator)) + continue; + buf_state = pg_atomic_read_u32(&bufHdr->state); - if ((buf_state & BM_TAG_VALID) && - BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator) && - BufTagGetForkNum(&bufHdr->tag) == forkNum && - bufHdr->tag.blockNum >= firstDelBlock) + for (j = 0; j < nforks; j++) { - InvalidateLocalBuffer(bufHdr, true); + if ((buf_state & BM_TAG_VALID) && +BufTagGetForkNum(&bufHdr->tag) == forkNum[j] && +bufHdr->tag.blockNum >= firstDelBlock[j]) + { +InvalidateLocalBuffer(bufHdr, true); + } } } } diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index 0dec7d93b3b..52a71b138f7 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -486,8 +486,8 @@ extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait); extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln); extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced); extern void DropRelationLocalBuffers(RelFileLocator rlocator, - ForkNumber forkNum, - BlockNumber firstDelBlock); + ForkNumber *forkNum, int nforks, + BlockNumber *firstDelBlock); extern void DropRelationAllLocalBuffers(RelFileLocator rlocator); extern void AtEOXact_LocalBuffers(bool isCommit); -- 2.43.0
Re: Speedup truncations of temporary relation forks
Hi, On Sun, Jun 1, 2025 at 5:31 PM Dilip Kumar wrote: > > On Sun, Jun 1, 2025 at 7:52 AM Michael Paquier wrote: > > > > I doubt that it would be a good idea to apply a patch "just" because > > it looks like a good idea. It is important to prove that something is > > a good idea first. > > I think it makes sense to do the optimization for temporary tables as > well, I tried testing with the below test case[1] and I can see ~18% > improvement with the patch. > > On head it is taking ~78 ms to truncate whereas with patch it is just > taking 66ms. > > [1] > set temp_buffers ='8GB'; > show temp_buffers; > BEGIN; > CREATE TEMPORARY TABLE test(a int, b varchar); > INSERT INTO test select i, repeat('a', 100) from > generate_series(1,100) as i; > ANALYZE ; > select relpages from pg_class where relname='test'; > TRUNCATE TABLE test; > ROLLBACK; Thank you very much for your help! I had also done some performance measurements : set temp_buffers ='1GB'; BEGIN; CREATE TEMP TABLE test (id INT) ON COMMIT DELETE ROWS; INSERT INTO test SELECT generate_series(1, 3000); DELETE FROM test WHERE id % 1000 = 0; -- force postgres to create fsm ANALYZE test; COMMIT; *postgres was running on ramdisk with disabled swapoff* Thus, we are creating a 1 GB table, so that the local buffers are completely full and contain only the pages of this table. To measure the time, I hardcoded calls of GetCurrentTimestamp and TimestampDifference. I got ~7% improvement with the patch. Note, that table had only 2 forks - main and fsm (I haven't figured out how to force postgres to create a visibility map for temp table within the transaction block). -- Best regards, Daniil Davydov
Re: Speedup truncations of temporary relation forks
Hi, On Mon, Jun 2, 2025 at 11:14 AM Dilip Kumar wrote: > > > (I haven't figured out how to force postgres to > > create a visibility map for temp table within the transaction block). > > I haven't tested this, but I think if you do bulk copy into a table > which should mark pages all visible and after that if you delete some > tuple from pages logically it should try to update the status to not > all visible in vm? > I found reliable way to create all three forks : begin; create temp table test (id int) on commit delete rows; copy test from 'data.csv' with (freeze); insert into test select generate_series(200, 300); delete from test where id % 50 = 0; commit; -- Best regards, Daniil Davydov
Re: Speedup truncations of temporary relation forks
Hi, On Sat, May 31, 2025 at 7:49 AM Michael Paquier wrote: > > On Fri, May 30, 2025 at 06:01:16PM +0700, Daniil Davydov wrote: > > For now we fully scan local buffers for each fork of the temporary > > relation that we want to truncate (in order to drop its buffers). It > > happens in the function "DropRelationBuffers". > > There used to be the same problem for regular tables (i.e. shared > > buffers) and it was fixed in commit [1] and now shared buffers are > > scanned only one time for those three relation forks. > > I suggest making the same fix for temporary relations. See attached patch. > > Applying the same kind of optimization for local buffers makes sense > here, even if the impact is more limited than regular relations. > Thanks for looking into it! > > BTW, I see that we call "DropRelationBuffers" separately for relation, > > toast table and indexes. What if we collect all this information in > > advance and iterate over the local/shared buffers only once? > > I understand that it will look kinda ugly, but it will increase > > performance for sure. > > I guess it does. Do you have numbers to share with a test case? > Not yet. I proceed from the assumption that if the temp_buffers parameter is set to a large value (some users set it to more than a gigabyte), then the vast majority of time is spent iterating through the local buffers. Thus, if we reduce the number of iterations from N to (for example) N/10, we can get a 10x increase in performance. Of course, this is a super rough assumption, but I think you understand my point. In the near future I will prepare a patch for the idea above and try to do some measurements. If there is a significant difference, I will definitely let you know. Anyway, first I suggest committing the current patch. > Please make sure to add this patch to the next commit fest. OK, already created. -- Best regards, Daniil Davydov
Re: Speedup truncations of temporary relation forks
Hi, On Sat, May 31, 2025 at 7:41 PM Fujii Masao wrote: > > Here are a few review comments on the patch: > > + for (j = 0; j < nforks; j++) > { > - InvalidateLocalBuffer(bufHdr, true); > + if ((buf_state & BM_TAG_VALID) && > + BufTagGetForkNum(&bufHdr->tag) == forkNum[j] > && > + bufHdr->tag.blockNum >= firstDelBlock[j]) > + { > + InvalidateLocalBuffer(bufHdr, true); > + } > > It looks like the "buf_state & BM_TAG_VALID" check can be moved > outside the loop, along with the BufTagMatchesRelFileLocator() check. > That would avoid unnecessary looping. > > Also, should we add a "break" right after calling InvalidateLocalBuffer()? > Since the buffer has already been invalidated, continuing the loop > may not be necessary. Thanks for the review! I'll fix both remarks. Please see the v2 patch. -- Best regards, Daniil Davydov From 203be6d23be59db2d3b9c1ace501ae94c61a4c27 Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Fri, 30 May 2025 17:50:47 +0700 Subject: [PATCH v2] Speedup temp table truncation --- src/backend/storage/buffer/bufmgr.c | 8 +++- src/backend/storage/buffer/localbuf.c | 21 ++--- src/include/storage/buf_internals.h | 4 ++-- 3 files changed, 19 insertions(+), 14 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f93131a645e..bd4c549eb14 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -4550,11 +4550,9 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, if (RelFileLocatorBackendIsTemp(rlocator)) { if (rlocator.backend == MyProcNumber) - { - for (j = 0; j < nforks; j++) -DropRelationLocalBuffers(rlocator.locator, forkNum[j], - firstDelBlock[j]); - } + DropRelationLocalBuffers(rlocator.locator, forkNum, nforks, + firstDelBlock); + return; } diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 63101d56a07..275dc756f60 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -660,10 +660,11 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced) * See DropRelationBuffers in bufmgr.c for more notes. */ void -DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum, - BlockNumber firstDelBlock) +DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber *forkNum, + int nforks, BlockNumber *firstDelBlock) { int i; + int j; for (i = 0; i < NLocBuffer; i++) { @@ -672,12 +673,18 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum, buf_state = pg_atomic_read_u32(&bufHdr->state); - if ((buf_state & BM_TAG_VALID) && - BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator) && - BufTagGetForkNum(&bufHdr->tag) == forkNum && - bufHdr->tag.blockNum >= firstDelBlock) + if (!(buf_state & BM_TAG_VALID) || + !BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator)) + continue; + + for (j = 0; j < nforks; j++) { - InvalidateLocalBuffer(bufHdr, true); + if (BufTagGetForkNum(&bufHdr->tag) == forkNum[j] && +bufHdr->tag.blockNum >= firstDelBlock[j]) + { +InvalidateLocalBuffer(bufHdr, true); +break; + } } } } diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index 0dec7d93b3b..52a71b138f7 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -486,8 +486,8 @@ extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait); extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln); extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced); extern void DropRelationLocalBuffers(RelFileLocator rlocator, - ForkNumber forkNum, - BlockNumber firstDelBlock); + ForkNumber *forkNum, int nforks, + BlockNumber *firstDelBlock); extern void DropRelationAllLocalBuffers(RelFileLocator rlocator); extern void AtEOXact_LocalBuffers(bool isCommit); -- 2.43.0
Re: [BUG] Cannot flush buffers of temp table deleted from other session
> > BTW, there are other bugs that can occur during interacting with other > session temp tables. I described them and suggested corrections in > this thread [1]. > > [1] > https://www.postgresql.org/message-id/flat/CAJDiXgj72Axj0d4ojKdRWG_rnkfs4uWY414NL%3D15sCvh7-9rwg%40mail.gmail.com Sorry, this is an outdated thread. I planned to continue to discuss mentioned topic here : https://www.postgresql.org/message-id/CAJDiXghdFcZ8=nh4g69te7irr3q0ufyxxb3zdg09_gtnzxw...@mail.gmail.com -- Best regards, Daniil Davydov
[BUG] Cannot flush buffers of temp table deleted from other session
Hi, Ability to do a DROP temporary table of other session leads to error : session 1 : create temp table test (id int); -- let's assume that 'test' will be placed in schema 'pg_temp_0' and has relfilepath 'base/5/t0_16390' insert into test select generate_series(1, 1000); -- make few buffers dirty session 2 : drop table pg_temp_0.test; -- drop temp table of session 1 session 1: create temp table test (id int); -- create another temp table -- make dirty too many buffers (> temp_buffers), so postgres will try to flush some of them insert into test select generate_series(1, 25); ERROR: could not open file "base/5/t0_16390": No such file or directory Thus, we were trying to flush buffers of an already deleted temp table. Error occurs here : GetLocalVictimBuffer -> FlushLocalBuffer -> smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber). To be honest, I don't see a good way to handle this error, because there may be a lot of reasons why flushing failed (up to the point that the buffer tag may be corrupted). I attach a patch (targeted on the master branch) that contains a sample idea of error handling - don't throw an error if we can ensure that the relation has been deleted. Patch contains a very rough assumption, that relation's relfilenode == relation's oid. What do you think about it? BTW, there are other bugs that can occur during interacting with other session temp tables. I described them and suggested corrections in this thread [1]. [1] https://www.postgresql.org/message-id/flat/CAJDiXgj72Axj0d4ojKdRWG_rnkfs4uWY414NL%3D15sCvh7-9rwg%40mail.gmail.com -- Best regards, Daniil Davydov From 7775fc7afbefceebf88d8be1cd94da939807035c Mon Sep 17 00:00:00 2001 From: Daniil Davidov Date: Thu, 12 Jun 2025 02:11:43 +0700 Subject: [PATCH] Handle error with flushing dirty buffer of deleted temp table --- src/backend/storage/buffer/localbuf.c | 90 +-- 1 file changed, 84 insertions(+), 6 deletions(-) diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index ba26627f7b0..8b9da95c095 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -26,6 +26,7 @@ #include "utils/memdebug.h" #include "utils/memutils.h" #include "utils/resowner.h" +#include "utils/syscache.h" /*#define LBDEBUG*/ @@ -225,6 +226,7 @@ GetLocalVictimBuffer(void) int victim_bufid; int trycounter; BufferDesc *bufHdr; + MemoryContext old_mcxt = CurrentMemoryContext; ResourceOwnerEnlarge(CurrentResourceOwner); @@ -281,12 +283,88 @@ GetLocalVictimBuffer(void) LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage(); } - /* - * this buffer is not referenced but it might still be dirty. if that's - * the case, write it out before reusing it! - */ - if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY) - FlushLocalBuffer(bufHdr, NULL); + PG_TRY(); + { + /* + * this buffer is not referenced but it might still be dirty. if that's + * the case, write it out before reusing it! + */ + if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY) + FlushLocalBuffer(bufHdr, NULL); + } + PG_CATCH(); + { + MemoryContext error_mctx; + ErrorData *error; + bool can_handle_error = false; + + /* Switch to the original context & copy edata */ + error_mctx = MemoryContextSwitchTo(old_mcxt); + error = CopyErrorData(); + + /* + * If we failed during file access, there can be two cases : + * 1) Buffer belongs to relation deleted from other session. + * 2) Any other error. + * + * In first case we still can continue without throwing error - just + * don't try to flush this buffer. + */ + errcode_for_file_access(); + if (error->sqlerrcode == ERRCODE_UNDEFINED_FILE) + { + HeapTuple tuple; + Oid estimated_oid; + + /* + * Buffer tag contains only relfilenode, that don't necessary + * matches relation's oid. For lack of other information, just + * assume that oid == relfilenumber. + */ + estimated_oid = BufTagGetRelFileLocator(&bufHdr->tag).relNumber; + + /* + * Try to find relcache entry for relation. If we fail - that means + * that relation was deleted (e.g. first case). + */ + tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(estimated_oid)); + if (!HeapTupleIsValid(tuple)) + { +uint32 buf_state; +RelPathStr path; +SMgrRelation reln; + +can_handle_error = true; + +reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), +MyProcNumber); + +path = relpath(reln->smgr_rlocator, + BufTagGetForkNum(&bufHdr->tag)); + +ereport(WARNING, + (errmsg("encountered dirty local buffer that belongs to" +"to deleted relation \"%s\"", path.str), + errhint("if this relation had not
Re: [BUG] Cannot flush buffers of temp table deleted from other session
Hi, On Thu, Jun 12, 2025 at 2:40 AM Tom Lane wrote: > > Daniil Davydov <3daniss...@gmail.com> writes: > > Ability to do a DROP temporary table of other session leads to error : > > [ shrug... ] Don't do that. A superuser is capable of doing lots > of things that will break the database, and this one hardly seems > like one of the worse ones. OK, I see your point, thanks. IMHO, deleting other temporary tables and (for example) deleting rows from pg_class are not so similar in terms of the obvious occurrence of bad consequences. But as far as I understand, we will not specify such details somewhere in the documentation. > > > BTW, there are other bugs that can occur during interacting with other > > session temp tables. I described them and suggested corrections in > > this thread [1]. > > I doubt I'd agree that any such thing is a bug we need to fix. > You should not be attempting to touch other sessions' temp tables. > (If there's any path by which a non-superuser can do that, we > probably need to tighten some permission check somewhere.) Let's forget about DROP for a while. Postgres design has rules, that superuser cannot perform commands like select/update/delete/truncate... with other session's temp tables. You can find appropriate error messages (for example) in bufmgr.c : *** if (RELATION_IS_OTHER_TEMP(reln)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot access temporary tables of other sessions"))); *** But by now the query parser does not recognize other session's temp tables as temporary, and we can get weird query results instead of getting the error provided by the postgres design (that even worse than unexpected error occurrence). I am sure that this is a bug and we should fix such behavior. -- Best regards, Danill Davydov