from:"Daniil Davydov"

Re: Do not lock temp relations

2024-10-22 Thread Daniil Davydov

On Mon, 30 Sept 2024 at 18:05, Tom Lane 
wrote:

> Yes. Our implementation restrictions preclude access to the contents
> of another session's temp tables, but it is not forbidden to do DDL
> on them so long as no content access is required. (Without this,
> it'd be problematic for example to clean out a crashed session's temp
> tables. See the "orphan temporary tables" logic in autovacuum.c.)

Indeed, a potentially dangerous situation may arise when both the
autovacuum and client process attempt to delete the contents of a temporary
namespace. But there is a patch (6faca9ae2878c8f642a2e5748d2dbb2b91341bec)
that protects us from race condition in this case.

--
Best regards,
Daniil Davydov


On Tue, Oct 22, 2024 at 5:55 PM Tom Lane  wrote:

> Maxim Orlov  writes:
> > But for the second one: do we really need any lock for temp relations?
>
> Yes.  Our implementation restrictions preclude access to the contents
> of another session's temp tables, but it is not forbidden to do DDL
> on them so long as no content access is required.  (Without this,
> it'd be problematic for example to clean out a crashed session's temp
> tables.  See the "orphan temporary tables" logic in autovacuum.c.)
> You need fairly realistic locking to ensure that's OK.
>
> regards, tom lane
>
>
>
>
>

Re: Do not lock temp relations

2024-10-22 Thread Daniil Davydov

> Yes. Our implementation restrictions preclude access to the contents
> of another session's temp tables, but it is not forbidden to do DDL
> on them so long as no content access is required. (Without this,
> it'd be problematic for example to clean out a crashed session's temp
> tables. See the "orphan temporary tables" logic in autovacuum.c.)

Indeed, a potentially dangerous situation may arise when both the
autovacuum and client process attempt to delete the contents of a temporary
namespace. But there is a patch (6faca9ae2878c8f642a2e5748d2dbb2b91341bec)
that protects us from race condition in this case. -- Best regards,
Daniil Davydov.

Forbid to DROP temp tables of other sessions

2024-10-25 Thread Daniil Davydov

Hi,
I noticed that TRUNCATE and ALTER commands on temporary tables of other
sessions produce an error "cannot truncate/alter temporary tables of other
sessions". But are there any reasons to allow us to DROP such tables?
It seems to me that the only case when we may need it is the removal of
orphan tables. But the autovacuum is responsible for this and it uses a
different functionality. I'm wondering if there are any other cases. If
not, can we just handle it for example in ExecDropStmt and produce an error
like "cannot drop temporary tables of other sessions"?

--
Best regards,
Daniil Davydov

Re: Forbid to DROP temp tables of other sessions

2024-10-28 Thread Daniil Davydov

Hi,
Thanks for your comments, I appreciate them.

As I continued to deal with the topic of working with temp tables of
other sessions, I noticed something like a bug. For example
(REL_17_STABLE):
Session 1:
=# CREATE TEMP TABLE test(id int);

Session 2:
=# INSERT INTO pg_temp_0.test VALUES (1);
=# INSERT INTO pg_temp_0.test VALUES (2);

Second INSERT command will end with an error "cannot access temporary
tables of other sessions". I checked why this is happening and found
errors in several places.
So, I attach two files to this email :
1) Isolation test, that shows an error in REL_17_STABLE (iso_1.patch)
2) Patch that fixes code that mistakenly considered temporary tables
to be permanent (I will be glad to receive feedback on these fixes) +
isolation test, which shows that now any action with temp table of
other session leads to error (temp_tbl_fix.patch)

Tests look kinda ugly, but I think it's inevitable, given that we
don't know exactly what the name of the temporary schema of other
session will be.

--
Best regards,
Daniil Davydov
From 995002793b0af57b660565e6027615c083d8ce5e Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 28 Oct 2024 17:19:07 +0700
Subject: [PATCH] Add isolaton test for work with other session's temp table

---
 src/test/isolation/expected/temp-tables.out |  75 +++
 src/test/isolation/isolation_schedule   |   1 +
 src/test/isolation/specs/temp-tables.spec   | 100 
 3 files changed, 176 insertions(+)
 create mode 100644 src/test/isolation/expected/temp-tables.out
 create mode 100644 src/test/isolation/specs/temp-tables.spec

diff --git a/src/test/isolation/expected/temp-tables.out b/src/test/isolation/expected/temp-tables.out
new file mode 100644
index 00..dbd6ffc2e5
--- /dev/null
+++ b/src/test/isolation/expected/temp-tables.out
@@ -0,0 +1,75 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1_st1 s1_st2 s2_st1 s1_st3 s2_st2
+step s1_st1: CREATE TEMP TABLE test_tmp(id int);
+step s1_st2: INSERT INTO temp_schema_id SELECT pg_my_temp_schema();
+s2: NOTICE:  cannot access temporary tables of other sessions
+s2: NOTICE:  cannot access temporary tables of other sessions
+step s2_st1: 
+DO $$
+DECLARE
+schema_name text;
+BEGIN
+-- Find out name of temporary schema of first session
+SELECT nspname INTO schema_name
+  FROM pg_namespace
+ WHERE oid = (SELECT oid FROM temp_schema_id LIMIT 1);
+
+-- Execute few insert operations. We expect that behavior to be the
+-- same (both will complete successfully or both will fail). Since
+-- we have RELATION_IS_OTHER_TEMP() check in PrefetchBuffer and
+-- ReadBufferExtended functions (src/backend/storage/buffer/bufmgr.c),
+-- let's assume that both operations must fail (this is reflected
+-- in expected file)
+BEGIN
+EXECUTE format('INSERT INTO %I.test_tmp VALUES (1);', schema_name);
+EXCEPTION
+WHEN feature_not_supported
+THEN RAISE NOTICE 'cannot access temporary tables of other sessions';
+END;
+
+BEGIN
+EXECUTE format('INSERT INTO %I.test_tmp VALUES (2);', schema_name);
+EXCEPTION
+WHEN feature_not_supported
+THEN RAISE NOTICE 'cannot access temporary tables of other sessions';
+END;
+END
+$$;
+
+step s1_st3: INSERT INTO test_tmp VALUES (3);
+s2: NOTICE:  (3)
+s2: NOTICE:  cannot access temporary tables of other sessions
+s2: NOTICE:  cannot access temporary tables of other sessions
+step s2_st2: 
+DO $$
+DECLARE
+schema_name text;
+result RECORD;
+BEGIN
+-- Find out name of temporary schema of first session
+SELECT nspname INTO schema_name
+  FROM pg_namespace
+ WHERE oid = (SELECT oid FROM temp_schema_id LIMIT 1);
+
+-- Before this step call, first session inserted few tuples into
+-- test_tmp table. Let's assume that SELECT result must contain all
+-- of these tuples (based on current logic)
+FOR result IN
+   EXECUTE format('SELECT * FROM %I.test_tmp;', schema_name)
+  LOOP
+   RAISE NOTICE '%', result;
+   END LOOP;
+
+-- Now lets try to update or delete tuples from test_tmp. If we
+-- cannot insert into this table, lets assume that both UPDATE and
+-- DELETE operations must return same error as INSERT
+BEGIN
+EXECUTE format('UPDATE %I.test_tmp SET id = 100 WHERE id = 3;', schema_name);
+EXCEPTION
+WHEN feature_not_supporte

Re: Forbid to DROP temp tables of other sessions

2024-11-14 Thread Daniil Davydov

On Wed, Oct 30, 2024 at 7:32 PM Rafia Sabih  wrote:

> Good catch. I agree with this being an unwarranted behaviour.
> A minor comment from my end is the wording of the error message.
> Based on the Postgresql error message style huide, something like this could 
> be better,
> "could not access temporary relations of other sessions".
> --
> Regards,
> Rafia Sabih
> CYBERTEC PostgreSQL International GmbH
>
Thanks for your comment. I attach a patch with a fixed error message.
Also you can find it in commit fest
(https://commitfest.postgresql.org/51/5379/)
--
Best Regards,
Daniil Davydov
From 2c2e1b2fd75d38d43fbec5023691b93dceb18dbb Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Sun, 27 Oct 2024 16:15:50 +0700
Subject: [PATCH] Add fixes so now we cannot access to temporary tables of
 other sessions

---
 src/backend/catalog/namespace.c |  52 --
 src/backend/nodes/makefuncs.c   |   6 +-
 src/backend/parser/gram.y   |  11 ++-
 src/test/isolation/expected/temp-tables.out |  92 +
 src/test/isolation/isolation_schedule   |   1 +
 src/test/isolation/specs/temp-tables.spec   | 103 
 6 files changed, 231 insertions(+), 34 deletions(-)
 create mode 100644 src/test/isolation/expected/temp-tables.out
 create mode 100644 src/test/isolation/specs/temp-tables.spec

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 30807f9190..1a8e8c8844 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -499,28 +499,19 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode,
 		 */
 		if (relation->relpersistence == RELPERSISTENCE_TEMP)
 		{
-			if (!OidIsValid(myTempNamespace))
-relId = InvalidOid; /* this probably can't happen? */
-			else
+			if (relation->schemaname)
 			{
-if (relation->schemaname)
-{
-	Oid			namespaceId;
+Oid			namespaceId;
 
-	namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
-
-	/*
-	 * For missing_ok, allow a non-existent schema name to
-	 * return InvalidOid.
-	 */
-	if (namespaceId != myTempNamespace)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
- errmsg("temporary tables cannot specify a schema name")));
-}
-
-relId = get_relname_relid(relation->relname, myTempNamespace);
+namespaceId = LookupExplicitNamespace(relation->schemaname,
+	  missing_ok);
+if (namespaceId != myTempNamespace)
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("could not access temporary relations of other sessions")));
 			}
+
+			relId = get_relname_relid(relation->relname, myTempNamespace);
 		}
 		else if (relation->schemaname)
 		{
@@ -3387,17 +3378,14 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok)
 	Oid			namespaceId;
 	AclResult	aclresult;
 
-	/* check for pg_temp alias */
 	if (strcmp(nspname, "pg_temp") == 0)
 	{
-		if (OidIsValid(myTempNamespace))
-			return myTempNamespace;
-
 		/*
-		 * Since this is used only for looking up existing objects, there is
-		 * no point in trying to initialize the temp namespace here; and doing
-		 * so might create problems for some callers --- just fall through.
+		 * If namespace specified in 'pg_temp' format
+		 * (without owner proc number specification), we assume that this is
+		 * our temporary namespace
 		 */
+		return myTempNamespace;
 	}
 
 	namespaceId = get_namespace_oid(nspname, missing_ok);
@@ -3553,21 +3541,19 @@ get_namespace_oid(const char *nspname, bool missing_ok)
 RangeVar *
 makeRangeVarFromNameList(const List *names)
 {
-	RangeVar   *rel = makeRangeVar(NULL, NULL, -1);
+	RangeVar   *rel;
 
 	switch (list_length(names))
 	{
 		case 1:
-			rel->relname = strVal(linitial(names));
+			rel = makeRangeVar(NULL, strVal(linitial(names)), -1);
 			break;
 		case 2:
-			rel->schemaname = strVal(linitial(names));
-			rel->relname = strVal(lsecond(names));
+			rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1);
 			break;
 		case 3:
+			rel = makeRangeVar(strVal(lsecond(names)), strVal(lthird(names)), -1);
 			rel->catalogname = strVal(linitial(names));
-			rel->schemaname = strVal(lsecond(names));
-			rel->relname = strVal(lthird(names));
 			break;
 		default:
 			ereport(ERROR,
@@ -3774,6 +3760,8 @@ GetTempNamespaceProcNumber(Oid namespaceId)
 		return INVALID_PROC_NUMBER; /* no such namespace? */
 	if (strncmp(nspname, "pg_temp_", 8) == 0)
 		result = atoi(nspname + 8);
+	else if (strncmp(nspname, "pg_temp", 7) == 0)
+		result = MyProcNumber;
 	else if (strncmp(nspname, "pg_toast_temp_", 14) == 0)
 		result = atoi(nspname + 14);
 	else
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 61ac172a85..e60044c9d5 10

[BUG] Assert always fails while checking whether local buffer is dirty of is exclusively locked

2024-12-05 Thread Daniil Davydov

Hi,
Commit 00d7fb5e2e39198387ae00af8dd18b787b6a4d63 adds
BufferIsExclusiveLocked and BufferIsDirty functions.
They take into account that they can also be called for temporary
tables, but they both have assert, that will always fail
for temporary tables :
***
Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
LW_EXCLUSIVE));
***

They fails because LockBuffer function just skip acquiring lock for
temporary table :
***
if (BufferIsLocal(buffer))
return;
***

I suppose that fix can be done as in the attached to this email patch
(for REL_17_STABLE)

--
Best regards,
Daniil Davydov
From ee418040d8fb1988fdb2c79ae018e5ee9266c0e9 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Thu, 5 Dec 2024 15:07:14 +0700
Subject: [PATCH] Fix isDirty and isLocked for temp tables

---
 src/backend/storage/buffer/bufmgr.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6181673095..51e90f6d23 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2465,11 +2465,12 @@ BufferIsExclusiveLocked(Buffer buffer)
 		int			bufid = -buffer - 1;
 
 		bufHdr = GetLocalBufferDescriptor(bufid);
+		Assert(BufferIsPinned(buffer));
+
+		return true;
 	}
-	else
-	{
-		bufHdr = GetBufferDescriptor(buffer - 1);
-	}
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
 
 	Assert(BufferIsPinned(buffer));
 	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
@@ -2498,11 +2499,11 @@ BufferIsDirty(Buffer buffer)
 	else
 	{
 		bufHdr = GetBufferDescriptor(buffer - 1);
+		Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
+	LW_EXCLUSIVE));
 	}
 
 	Assert(BufferIsPinned(buffer));
-	Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
-LW_EXCLUSIVE));
 
 	return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY;
 }
-- 
2.43.0

Improve DDL related functionality in table AM

2024-12-04 Thread Daniil Davydov

Hi,
Table AM has a callback for assigning a new file locator to the table
(and, accordingly, creating its own storage). That is, as far as I
understand, the design implies that the creator of the access method
has the right to fully dispose of the table storage. However, for
example, there is an "ALTER COLUMN ... TYPE ..." command, inside which
two storage swaps occur (the source table and the temporary one).
Wouldn't it be better to have a separate callback for such an
operation?
I am writing my own access method, and for each table I store some
information in the local memory of the process. In this case, the
information is "linked" to the oid of the table. At this time, ALTER
creates a temporary table, fills it with data, and then deletes it.
And I have no way of knowing that the storage of this table (which I
should be able to manage myself) now belongs to another table (with
the exception of POST_ALTER object access hook, but it looks more like
a hack).

--
Best regards,
Daniil Davydov

[BUG] temp_table_max_size parameter may entail an error within ReadBuffer function

2024-12-06 Thread Daniil Davydov

Hi,
I noticed that this sequence of actions (for temporary table) leads to
an error, if
temp_table_max_size parameter is set :
***
LockRelationForExtension(relation, ExclusiveLock);
buf = ReadBuffer(relation, P_NEW);
***

Error occurs during total temporary table size calculation
(find_total_temp_relation_size function,
that can be called only if temp_table_max_size parameter > 0) : we are
trying to open all table's
indexes and append their size to the total size of the temporary
table. But inside relation_open function
(called for index) we meet this assert (and, of course, fail on it):

***
/*
 * We don't acquire any other heavyweight lock while holding the relation
 * extension lock.  We do allow to acquire the same relation extension
 * lock more than once but that case won't reach here.
 */
Assert(!IsRelationExtensionLockHeld);
***

I suppose that the simplest way out of the situation would be to skip
locking temporary tables for
extension

--
Best regards,
Daniil Davydov
From da84d916e568be0d2414303f4a1e1d01b0bc6abd Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Fri, 6 Dec 2024 16:41:53 +0700
Subject: [PATCH] Fix bug with locking temp relation for extension

---
 src/backend/storage/lmgr/lmgr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 094522acb4..0a91109d03 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -421,6 +421,9 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 {
 	LOCKTAG		tag;
 
+	if (relation->rd_islocaltemp)
+		return;
+
 	SET_LOCKTAG_RELATION_EXTEND(tag,
 relation->rd_lockInfo.lockRelId.dbId,
 relation->rd_lockInfo.lockRelId.relId);
-- 
2.43.0

SLRU_PAGES_PER_SEGMENT as configure parameter

2025-02-06 Thread Daniil Davydov

Hi,
The constant SLRU_PAGES_PER_SEGMENT defines the size of the SLRU
segments. It is currently hardcoded in slru.h. It would be nice to be
able to set this parameter during configuration (for example, by
analogy with --with-segsize-blocks).
What will it give us:
1) The ability to test pg_upgrade more extensively, since without it
you need to generate too much data to fill a significant number of
segments.
2) The number of segments is arbitrary: 32. However, I have not heard
about any estimates in terms of performance; with such a patch (for
REL_17_STABLE), it will be realistic to do this.

--
Best regards,
Daniil Davydov
From d962b1a9b5e2852706e256abbb6d9ca5155842a7 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Thu, 6 Feb 2025 14:34:36 +0700
Subject: [PATCH] SLRU_PAGES_PER_SEGMENT as configure parameter

---
 configure |  52 
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/include/access/slru.h |  15 ---
 src/include/pg_config.h.in|  13 ++
 .../modules/test_slru/expected/test_slru.out  | 123 ++
 src/test/modules/test_slru/sql/test_slru.sql  | 122 +++--
 src/test/modules/test_slru/test_slru.c|   6 -
 7 files changed, 228 insertions(+), 115 deletions(-)

diff --git a/configure b/configure
index 97996b7f6b..514418e589 100755
--- a/configure
+++ b/configure
@@ -845,6 +845,7 @@ enable_dtrace
 enable_tap_tests
 enable_injection_points
 with_blocksize
+with_slru_pps
 with_segsize
 with_segsize_blocks
 with_wal_blocksize
@@ -1559,6 +1560,7 @@ Optional Packages:
   set table segment size in blocks [0]
   --with-wal-blocksize=BLOCKSIZE
   set WAL block size in kB [8]
+  --with-slru-pps=SLRUPPS set slru segment size in blocks
   --with-llvm build with LLVM based JIT support
   --without-icu   build without ICU support
   --with-tcl  build Tcl modules (PL/Tcl)
@@ -3762,6 +3764,56 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
+#
+# SLRU pages per segment
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for slru pages per segment" >&5
+$as_echo_n "checking for slru pages per segment... " >&6; }
+
+
+
+# Check whether --with-slru-pps was given.
+if test "${with_slru_pps+set}" = set; then :
+  withval=$with_slru_pps;
+  case $withval in
+yes)
+  as_fn_error $? "argument required for --with_slru_pps option" "$LINENO" 5
+  ;;
+no)
+  as_fn_error $? "argument required for --with_slru_pps option" "$LINENO" 5
+  ;;
+*)
+  slru_pages_per_segment=$withval
+  ;;
+  esac
+
+else
+  slru_pages_per_segment=32
+fi
+
+
+INT_MAX=2147483647
+if test "${slru_pages_per_segment}" -lt 32; then
+	as_fn_error $? "Invalid slru_pps value. It cannot be less, then 32." "$LINENO" 5
+	exit 1
+elif test "${slru_pages_per_segment}" -gt "${INT_MAX}"; then
+	as_fn_error $? "Invalid slru_pps value. It cannot be greater, then INT_MAX." "$LINENO" 5
+	exit 1
+elif test $((${slru_pages_per_segment} % 2)) -ne 0; then
+	as_fn_error $? "Invalid slru_pps value. It must be a power of 2." "$LINENO" 5
+  exit 1
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${slru_pages_per_segment}" >&5
+$as_echo "${slru_pages_per_segment}" >&6; }
+
+
+cat >>confdefs.h <<_ACEOF
+#define SLRU_PAGES_PER_SEGMENT ${slru_pages_per_segment}
+_ACEOF
+
+
+
+
 #
 # Relation segment size
 #
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c42fccf3fe..bf1aa3a56c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -596,6 +596,7 @@ static int	max_function_args;
 static int	max_index_keys;
 static int	max_identifier_length;
 static int	block_size;
+static int	slru_pages_per_segment;
 static int	segment_size;
 static int	shared_memory_size_mb;
 static int	shared_memory_size_in_huge_pages;
@@ -3266,6 +3267,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"slru_pps", PGC_INTERNAL, PRESET_OPTIONS,
+			gettext_noop("Shows the number of slru pages per segment."),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&slru_pages_per_segment,
+		SLRU_PAGES_PER_SEGMENT, SLRU_PAGES_PER_SEGMENT, SLRU_PAGES_PER_SEGMENT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"segment_size", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the number of pages per disk file."),
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 02fcb3bca5..c75fc215d1 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -23,21 +23,6 @@
  */
 #define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
 
-/*
- * Define SLRU segment size.

Re: SLRU_PAGES_PER_SEGMENT as configure parameter

2025-02-06 Thread Daniil Davydov

On Thu, Feb 6, 2025 at 8:23 PM Daniel Gustafsson  wrote:
>
> You should include the configure.ac changes and not just the generated code in
> configure (which is fine to include for review, but the committer will
> re-generate it regardless).  Please also include the corresponding Meson 
> change
> to keep the buildsystems in sync.

Thank you for your comment. I'm applying a patch that also includes
configure.ac, meson_options.txt and meson.build changes.
It would also be interesting to know what you think about the idea of
allowing such a parameter to be configured.

--
Best regards,
Daniil Davydov
From 836d648297b8ede905115f4e823953ebfa273642 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Thu, 6 Feb 2025 14:34:36 +0700
Subject: [PATCH] SLRU_PAGES_PER_SEGMENT as configure parameter

---
 configure |  53 
 configure.ac  |  36 +
 meson.build   |  17 +++
 meson_options.txt |   3 +
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/include/access/slru.h |  15 ---
 src/include/pg_config.h.in|  15 +++
 .../modules/test_slru/expected/test_slru.out  | 123 ++
 src/test/modules/test_slru/sql/test_slru.sql  | 122 +++--
 src/test/modules/test_slru/test_slru.c|   6 -
 10 files changed, 287 insertions(+), 115 deletions(-)

diff --git a/configure b/configure
index 97996b7f6b..ff15660f5f 100755
--- a/configure
+++ b/configure
@@ -845,6 +845,7 @@ enable_dtrace
 enable_tap_tests
 enable_injection_points
 with_blocksize
+with_slru_pps
 with_segsize
 with_segsize_blocks
 with_wal_blocksize
@@ -1554,6 +1555,7 @@ Optional Packages:
   --with-pgport=PORTNUM   set default port number [5432]
   --with-blocksize=BLOCKSIZE
   set table block size in kB [8]
+  --with-slru-pps=SLRUPPS set slru segment size in blocks
   --with-segsize=SEGSIZE  set table segment size in GB [1]
   --with-segsize-blocks=SEGSIZE_BLOCKS
   set table segment size in blocks [0]
@@ -3762,6 +3764,57 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
+#
+# SLRU pages per segment
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for slru pages per segment" >&5
+$as_echo_n "checking for slru pages per segment... " >&6; }
+
+
+
+# Check whether --with-slru-pps was given.
+if test "${with_slru_pps+set}" = set; then :
+  withval=$with_slru_pps;
+  case $withval in
+yes)
+  as_fn_error $? "argument required for --with-slru-pps option" "$LINENO" 5
+  ;;
+no)
+  as_fn_error $? "argument required for --with-slru-pps option" "$LINENO" 5
+  ;;
+*)
+  slru_pages_per_segment=$withval
+  ;;
+  esac
+
+else
+  slru_pages_per_segment=32
+fi
+
+
+
+INT_MAX=2147483647
+if test "${slru_pages_per_segment}" -lt 32; then :
+  as_fn_error $? "Invalid slru_pps value. It cannot be less than 32." "$LINENO" 5
+fi
+
+if test "${slru_pages_per_segment}" -gt "${INT_MAX}"; then :
+  as_fn_error $? "Invalid slru_pps value. It cannot be greater than INT_MAX." "$LINENO" 5
+fi
+
+if test $(expr ${slru_pages_per_segment} % 2) -ne 0; then :
+  as_fn_error $? "Invalid slru_pps value. It must be a power of 2." "$LINENO" 5
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${slru_pages_per_segment} pages" >&5
+$as_echo "${slru_pages_per_segment} pages" >&6; }
+
+
+cat >>confdefs.h <<_ACEOF
+#define SLRU_PAGES_PER_SEGMENT ${slru_pages_per_segment}
+_ACEOF
+
+
 #
 # Relation segment size
 #
diff --git a/configure.ac b/configure.ac
index 3c76c9ebc8..2fd0e331c0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -288,6 +288,42 @@ AC_DEFINE_UNQUOTED([BLCKSZ], ${BLCKSZ}, [
  Changing BLCKSZ requires an initdb.
 ])
 
+#
+# SLRU pages per segment
+#
+AC_MSG_CHECKING([for slru pages per segment])
+PGAC_ARG_REQ(with, slru-pps, [SLRUPPS], [set slru segment size in blocks],
+ [slru_pages_per_segment=$withval],
+ [slru_pages_per_segment=32])
+
+INT_MAX=2147483647
+AS_IF([test "${slru_pages_per_segment}" -lt 32],
+  [AC_MSG_ERROR([Invalid slru_pps value. It cannot be less than 32.])])
+
+AS_IF([test "${slru_pages_per_segment}" -gt "${INT_MAX}"],
+  [AC_MSG_ERROR([Invalid slru_pps value. It cannot be greater than INT_MAX.])])
+
+AS_IF([test $(expr ${slru_pages_per_segment} % 2) -ne 0],
+  [AC_MSG_ERROR([Invalid slru_pps value. It must be a power of 2.])])
+
+AC_MSG_RESULT([${slru_pages_per_segment} pages])
+
+AC_DEFINE_UNQUOTED([SLRU_PAGES_PER_SEGMENT], ${slru_pages_per_segment}, [
+  SLRU segment size in pages. A page is the same BLCKSZ as is used everywhere
+	else in Postgres.  The segmen

[BUG] Possible occurrence of segfault in ecpg test

2024-12-09 Thread Daniil Davydov

Hi,
The src/interfaces/ecpg/test/sql/bytea.pgc file contains the following code :
***
init();
exec sql truncate test;
exec sql insert into test values(:send_buf[0], :send_buf[1]);
exec sql insert into test values(:send_buf[0], :send_buf[1]);
exec sql select data1 into :recv_vlen_buf from test;
dump_binary(recv_vlen_buf[0].arr, recv_vlen_buf[0].len, 0);
dump_binary(recv_vlen_buf[1].arr, recv_vlen_buf[1].len, 0);
free(recv_vlen_buf);
***

recv_vlen_buf is initialized in the following way :
***
bytea recv_vlen_buf[][DATA_SIZE];
recv_vlen_buf = NULL
***

Thus, if the program behaves in an unexpected way and the transaction
is aborted before it executes the
"select data1 into :recv_vlen_buf from test" query, dump_binary will
refer to a null pointer. So, instead of an error
message, the user will see a segfault.

I think that in all such cases it is worth adding some checks into
.pgc and .c files (like in attached patch)

--
Best regards,
Daniil Davydov
From 36b01661fd6b9083b4cd1620020ebf4f737740b9 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Tue, 10 Dec 2024 12:34:59 +0700
Subject: [PATCH] Fix refer to a null pointer

---
 src/interfaces/ecpg/test/expected/sql-bytea.c | 62 ++-
 .../ecpg/test/expected/sql-bytea.stderr   | 62 +--
 src/interfaces/ecpg/test/sql/bytea.pgc| 10 ++-
 3 files changed, 71 insertions(+), 63 deletions(-)

diff --git a/src/interfaces/ecpg/test/expected/sql-bytea.c b/src/interfaces/ecpg/test/expected/sql-bytea.c
index 8338c6008d..c0cd962d90 100644
--- a/src/interfaces/ecpg/test/expected/sql-bytea.c
+++ b/src/interfaces/ecpg/test/expected/sql-bytea.c
@@ -254,37 +254,41 @@ if (sqlca.sqlcode < 0) sqlprint();}
 if (sqlca.sqlcode < 0) sqlprint();}
 #line 90 "bytea.pgc"
 
-	dump_binary(recv_vlen_buf[0].arr, recv_vlen_buf[0].len, 0);
-	dump_binary(recv_vlen_buf[1].arr, recv_vlen_buf[1].len, 0);
-	free(recv_vlen_buf);
+
+	if (recv_vlen_buf != NULL)
+	{
+		dump_binary(recv_vlen_buf[0].arr, recv_vlen_buf[0].len, 0);
+		dump_binary(recv_vlen_buf[1].arr, recv_vlen_buf[1].len, 0);
+		free(recv_vlen_buf);
+	}
 
 	/* Test for dynamic sql statement with normal host variable, indicator */
 	init();
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "truncate test", ECPGt_EOIT, ECPGt_EORT);
-#line 97 "bytea.pgc"
+#line 101 "bytea.pgc"
 
 if (sqlca.sqlcode < 0) sqlprint();}
-#line 97 "bytea.pgc"
+#line 101 "bytea.pgc"
 
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "ins_stmt", 
 	ECPGt_bytea,&(send_buf[0]),(long)512,(long)1,sizeof(struct bytea_1), 
 	ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, 
 	ECPGt_bytea,&(send_buf[1]),(long)512,(long)1,sizeof(struct bytea_1), 
 	ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT, ECPGt_EORT);
-#line 98 "bytea.pgc"
+#line 102 "bytea.pgc"
 
 if (sqlca.sqlcode < 0) sqlprint();}
-#line 98 "bytea.pgc"
+#line 102 "bytea.pgc"
 
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "sel_stmt", ECPGt_EOIT, 
 	ECPGt_bytea,&(recv_buf[0]),(long)DATA_SIZE,(long)1,sizeof(struct bytea_2), 
 	ECPGt_int,&(ind[0]),(long)1,(long)1,sizeof(int), 
 	ECPGt_bytea,&(recv_short_buf),(long)DATA_SIZE - LACK_SIZE,(long)1,sizeof(struct bytea_4), 
 	ECPGt_int,&(ind[1]),(long)1,(long)1,sizeof(int), ECPGt_EORT);
-#line 99 "bytea.pgc"
+#line 103 "bytea.pgc"
 
 if (sqlca.sqlcode < 0) sqlprint();}
-#line 99 "bytea.pgc"
+#line 103 "bytea.pgc"
 
 	dump_binary(recv_buf[0].arr, recv_buf[0].len, ind[0]);
 	dump_binary(recv_short_buf.arr, recv_short_buf.len, ind[1]);
@@ -292,81 +296,81 @@ if (sqlca.sqlcode < 0) sqlprint();}
 	/* Test for dynamic sql statement with sql descriptor */
 	init();
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "truncate test", ECPGt_EOIT, ECPGt_EORT);
-#line 105 "bytea.pgc"
+#line 109 "bytea.pgc"
 
 if (sqlca.sqlcode < 0) sqlprint();}
-#line 105 "bytea.pgc"
+#line 109 "bytea.pgc"
 
 	{ ECPGset_desc(__LINE__, "idesc", 1,ECPGd_data,
 	ECPGt_bytea,&(send_buf[0]),(long)512,(long)1,sizeof(struct bytea_1), ECPGd_EODT);
 
-#line 106 "bytea.pgc"
+#line 110 "bytea.pgc"
 
 if (sqlca.sqlcode < 0) sqlprint();}
-#line 106 "bytea.pgc"
+#line 110 "bytea.pgc"
 
 	{ ECPGset_desc(__LINE__, "idesc", 2,ECPGd_data,
 	ECPGt_bytea,&(send_buf[1]),(long)512,(long)1,sizeof(struct bytea_1), ECPGd_EODT);
 
-#line 107 "bytea.pgc"
+#line 111 "bytea.pgc"
 
 if (sqlca.sqlcode < 0) sqlprint();}
-#line 107 "bytea.pgc"
+#line 111 "bytea.pgc"
 
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "ins_stmt", 
 	ECPGt_descriptor, "idesc", 1L, 1L, 1L, 
 	ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT, ECPGt_EORT);
-#line 108 "bytea.pgc"
+#line 112 "bytea.

Re: Repeatable read transaction doesn't see dropped table

2024-12-23 Thread Daniil Davydov

On Mon, Dec 23, 2024 at 3:17 PM David G. Johnston
 wrote:

> The quoted section describes how two consecutive select queries will see the 
> same data.  Your example shows how a single query behaves in isolation.  The 
> “as the first query saw it” is fundamentally important since until it 
> successfully executes there are no locks being held restricting the changing 
> of non-data structural aspects of the database.  In short, the snapshot 
> doesn’t include an object until it is requested.  It’s a repeatable read, not 
> a frozen point-in-time read.  The performance implications for the later 
> would be unacceptable.
>
> Thus, the behavior is expected and needed as-is; but I would say that the 
> concurrency control chapter of the documentation is one of the harder to 
> actually learn and understand.  It is a challenging topic, so I get why.  In 
> its defense, the commentary surrounding the regarding control record and 
> detail does try to make this distinction clear to the reader. YMMV as to its 
> effectiveness in this regard.
>
> David J.
>

Thank you for your comments.

OK, I agree that there are no contradictions from the point of view of
the source code. But essentially, snapshot is just a range of xids,
and we must not see changes of transactions that started after the
snapshot was taken.
As far as I understand, documentation says that repeatable read
transactions take a snapshot at first non-transaction-control
statement, and this snapshot remains relevant for all statements
within the transaction.

My example shows that the second session sees changes that have been
made by a transaction that started after snapshot creation (and it
sees them only because of cache optimization). It might be unexpected
behavior for users.

I also agree that taking it into account will reduce performance, but
maybe we can clarify this aspect in documentation (?)

--
Best regards,
Daniil Davydov

Repeatable read transaction doesn't see dropped table

2024-12-22 Thread Daniil Davydov

Hi,
The documentation for PostgreSQL 17 says the following :
"query in a repeatable read transaction sees a snapshot as of the
start of the first non-transaction-control statement in the
transaction, not as of the start of the current statement within the
transaction"

But I noticed this behavior (REL_17_STABLE):
***
SESSION 1: create two user tables and fill them with data
CREATE TABLE test (id INT);
CREATE TABLE test_1 (id INT);
INSERT INTO test VALUES (1);
INSERT INTO test_1 VALUES (1);

SESSION 2 : begin transaction and allow it to take snapshot
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM test_1;

SESSION 1 : drop table, that was not accessed from second session
DROP TABLE test;

SESSION 2 :
SELECT * FROM test;
***

If I'm not mistaken, second transaction must see all data in table
test (according to documentation), but an error occurs:
***
ERROR:  relation "test" does not exist
LINE 1: select * from test;
***

We are getting this behavior due to the fact that the second session
searching for table's oid in cache via RelnameGetRelid,
but first session already invalidated it. It seems like a bug to me,
so I suggest in such cases to additionally scan pg_class.

I would like to know your opinion.

--
Best regards,
Daniil Davydov

Accessing an invalid pointer in BufferManagerRelation structure

2025-01-27 Thread Daniil Davydov

Hi,
Postgres allows us to pass BufferManagerRelation structure to
functions in two ways : BMR_REL and BMR_SMGR.
In case we use BMR_REL, the "smgr" field of structure initialized this way :
***
if (bmr.smgr == NULL)
{
bmr.smgr = RelationGetSmgr(bmr.rel);
bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
}
***
Thus, we set the "smgr" field only once. But in case of frequent cache
invalidation (for example with debug_discard_caches parameter
enabled),
this pointer may become invalid (because RelationCloseSmgr will be called).
I have not found any places in the current code where this could
happen. But if (just for example) we add acquiring of new lock into
ExtendBufferedRelLocal
or ExtendBufferedRelShared, relation cache will be invalidated (inside
AcceptInvalidationMessages).

I would suggest adding a special macro to access the "smgr" field
(check attached patch for REL_17_STABLE).
What do you think about this?

--
Best regards,
Daniil Davydov
From 828e798d2a4d42aa901f4c02e9a0c76ebb057c41 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 27 Jan 2025 18:36:10 +0700
Subject: [PATCH] Add marcos for safety access to smgr field of
 BufferManagerRelation structure

---
 src/backend/storage/buffer/bufmgr.c   | 54 ---
 src/backend/storage/buffer/localbuf.c |  8 ++--
 src/include/storage/bufmgr.h  | 13 +++
 3 files changed, 41 insertions(+), 34 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6181673095..d2c9297edb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -886,11 +886,8 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_by > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == 0)
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	return ExtendBufferedRelCommon(bmr, fork, strategy, flags,
    extend_by, InvalidBlockNumber,
@@ -922,11 +919,8 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == 0)
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	/*
 	 * If desired, create the file if it doesn't exist.  If
@@ -934,15 +928,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * an smgrexists call.
 	 */
 	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
-		(bmr.smgr->smgr_cached_nblocks[fork] == 0 ||
-		 bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
-		!smgrexists(bmr.smgr, fork))
+		(BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 ||
+		 BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(BMR_GET_SMGR(bmr), fork))
 	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
 
 		/* recheck, fork might have been created concurrently */
-		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+		if (!smgrexists(BMR_GET_SMGR(bmr), fork))
+			smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY);
 
 		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
 	}
@@ -952,13 +946,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+		BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber;
 
 	/*
 	 * Estimate how many pages we'll need to extend by. This avoids acquiring
 	 * unnecessarily many victim buffers.
 	 */
-	current_size = smgrnblocks(bmr.smgr, fork);
+	current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork);
 
 	/*
 	 * Since no-one else can be looking at the page contents yet, there is no
@@ -1002,7 +996,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	if (buffer == InvalidBuffer)
 	{
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.rel, bmr.smgr, 0,
+		buffer = ReadBuffer_common(bmr.rel, BMR_GET_SMGR(bmr), 0,
    fork, extend_to - 1, mode, strategy);
 	}
 
@@ -2144,10 +2138,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr,
 	BlockNumber first_block;
 
 	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
-		 bmr.smgr->smgr_rlocator.locator.spcOid,
-		 bmr.smgr->smgr_rlocator.locator.dbOid,
-		 bmr.smgr->smgr_rlocator.locator.relNumber,
-		 bmr.smgr->smgr_rlocator.backend,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.dbOid,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.relNumber,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.backend,
 		 extend_by);
 
 	if (bmr.relpersistence == RELPERSIS

Improve consistency checks in tbm_prepare_shared_iterate

2024-12-25 Thread Daniil Davydov

Hi,
Currently, iterating through TIDBitmap contains this code (REL_16_STABLE) :
***
while ((page = pagetable_iterate(tbm->pagetable, &i)) != NULL)
{
idx = page - ptbase->ptentry;
if (page->ischunk)
ptchunks->index[nchunks++] = idx;
else
ptpages->index[npages++] = idx;
}

Assert(npages == tbm->npages);
Assert(nchunks == tbm->nchunks);
***

Two asserts in the end seem to be overdue to me, because if (for
example) nchunks > tbm->nchunks (due to another error),
we have already accessed to invalid memory chunk and overwritten it.
If we want to monitor the correctness of these variables, it might be
better to add a few checks, as in the attached patch.

I'm not sure if the comment in the error message is written correctly,
but first I would like to hear your opinion.

--
Best regards,
Daniil Davydov
From 04e41969ef9ea2c96af218a5519b5efb19baeed5 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Wed, 25 Dec 2024 19:32:55 +0700
Subject: [PATCH] Add new checks for tidbitmap scan

---
 src/backend/nodes/tidbitmap.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 29a18584410..34a015e5ea7 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -827,10 +827,17 @@ tbm_prepare_shared_iterate(TIDBitmap *tbm)
 			while ((page = pagetable_iterate(tbm->pagetable, &i)) != NULL)
 			{
 idx = page - ptbase->ptentry;
-if (page->ischunk)
+if (page->ischunk && (nchunks + 1 <= tbm->nchunks))
 	ptchunks->index[nchunks++] = idx;
-else
+else if (npages + 1 <= tbm->npages)
 	ptpages->index[npages++] = idx;
+else
+	ereport(ERROR,
+			(errcode(ERRCODE_INTERNAL_ERROR),
+			 errmsg("invalid chunks/pages num in tidbitmap : %d"
+			 		" chunks met of %d available, %d pages met "
+	"of %d available", nchunks, tbm->nchunks,
+	npages, tbm->npages)));
 			}
 
 			Assert(npages == tbm->npages);
-- 
2.43.0

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Mon, Mar 17, 2025 at 1:15 PM David G. Johnston
 wrote:
>
> It’s seems like the bug “session A can read and write to session B’s tables” 
> has gotten lost in this thread that pertains to something related but 
> different.  Solving the bug is not going to involve adding a new GUC.
I don't think it's worth putting this in a separate discussion. Now
everything is moving towards the fact that the superuser will be
prohibited from changing the temporary tables of other sessions. In
fact, he can't do it anyway (except for DROP command) - see [1]. But
now the error for INSERT, UPDATE, DELETE and SELECT commands may not
surface immediately due to errors in the code. The only question now
is whether superuser should be allowed to DROP these other temp
tables. Since opinions are divided, I suggested adding a GUC that will
only control this feature.

> If they specify the precise pg_temp schema to affect they likely didn’t make 
> a mistake.
Yep, I agree. However, the features of the postgres kernel do not
allow the superuser to correctly perform INSERT, UPDATE, DELETE,
SELECT operations, because temporary table's pages cannot be stored in
shared buffers.

> If instead we are discussing the owner of the temporary table there is 
> probably a discussion to be had and decision to be documented somewhere - 
> maybe that central place of testing being wished for.
As far as I understand, only superuser and table's owner (within
session) can access the temp tables of session. We want CRUD
operations to be performed only by the owner.

--
Best regards,
Daniil Davydov

[1] 
https://www.postgresql.org/message-id/CAJDiXgj6TBzn=6ezx7+9bna9hpbitbu+muv-n3mhen_zs3n...@mail.gmail.com

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Mon, Mar 17, 2025 at 8:29 PM Steven Niu  wrote:
>
> If the (relation->relpersistence == RELPERSISTENCE_TEMP) can ensure the
> myTempNamespace is always valid, then my comment can be ignored.

No, even if we see a temporary table in RangeVarGetRelidExtended,
MyTempNamespace still can be `InvalidOid` (I mentioned it in the
previous letter).
Thus, comment like "this probably can't happen?" should be removed.

> Otherwise I think the modified RangeVarGetRelidExtended() should keep
> check of myTempNamespace, like this:
>
> if (relation->relpersistence == RELPERSISTENCE_TEMP)
> {
>Oid namespaceId;
>
>   if (!OidIsValid(myTempNamespace))
>   relId = InvalidOid; /* this probably can't happen? */
>  ...

OK, let's assume that MyTempNamespace == InvalidOid and caller
specified "pg_temp_N" in his query. In this case we want to throw an
error, because access to the other temp tables is prohibited.
If we keep code like "if (!OidIsValid(myTempNamespace)) => relId =
InvalidOid", eventually the caller will get an error "relation ...
does not exist".
Yes, we have saved the caller from making mistakes, but we are giving
the wrong error message.

In my realization the caller will get the correct error message, and I
still think that we should keep v3 patch logic.

--
Best regards,
Daniil Davydov

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Tue, Mar 18, 2025 at 2:36 AM David G. Johnston
 wrote:
>
> "I want to give a better error message" is not a good enough reason to change 
> this long-standing behavior in a back-patchable bug fix.
> and I do not agree that it is an improvement worth making in HEAD.

OK, In this case I'd rather agree.

On Tue, Mar 18, 2025 at 3:30 AM David G. Johnston
 wrote:
>
> I sound like a broken record but this is a bug fix; introducing a GUC and 
> changing unrelated behaviors is not acceptable in a back-patch.
>
> Superusers have been able to drop the temporary tables of other sessions for 
> a long time - now is not the place to change that behavior.

Well, we can keep this option for the superuser, but we need to
explicitly indicate that we are inside the DROP operation.
Please see v5 patch. It is a bit experimental :
1) The original behavior in the LookupExplicitNamespace function has
been returned.
2) New RVROption has been added to indicate that we are inside the
DROP operation. Thus, superuser can drop other temp tables.

>
> Separately:
>
> Is the general logic here that we assume r->relpersistence is 
> RELPERSISTENCE_PERMANENT until and unless we can prove it to be 
> RELPERSISTENCE_TEMP?
>
> I'm trying to figure out whether there is still an issue when dealing with an 
> unqualified name that would be found in the temporary schema.
> We've obviously already marked it permanent since we didn't have pg_temp in 
> the query to inform us otherwise.
> But if we are changing permanent to temporary later because of search_path 
> resolution then why did we have to get it right in the case where pg_temp is 
> specified?
> I question whether "persistence" is something that gram.y should be dealing 
> with at all.  Shouldn't that property be solely determined in post-parse 
> analysis via catalog lookup?

Hm, let's dive into this question. Why do I consider the original
behavior to be incorrect?

1)
If schema is not specified, then gram.y marks the table as PERMANENT.
1.1)
If we are looking for our temporary table, then "pg_temp_N" is present
in search_path, and RelnameGetRelid will return us this table.
1.2)
If we are looking for other temporary table, then RelnameGetRelid will
return `InvalidOid` and we will get a "relation not found" error.

2)
If schema is specified, then gram.y ALSO marks the table as PERMANENT.
2.1)
If we are looking for our temporary table, then
LookupExplicitNamespace will return us MyTempNamespace and we can get
our table without search_path lookup.
2.2)
If we are looking for other temporary table, then
LookupExplicitNamespace will return some valid oid, and we can get
other temp table without search_path lookup. Then, we will perform any
specified operation, assuming that we are working with a PERMANENT
table. This is a bug.

Let's summarize. If schema is not specified, we can safely mark the
table as PERMANENT, because we always can get the table from
search_path (and if the schema is not specified, then it is obvious
that the user wants to refer specifically to his table).
BUT, if schema is specified, we must know that the given relation is
TEMP before calling RangeVarGetRelidExtended (otherwise, there will be
an error, which I wrote about in paragraph 2.2).

>
> IOW, the original code here seems incorrect if this is the definitive place 
> to determine relpersistence.  in "case 1:", where there is no schema name, 
> one cannot deduce relpersistence and it should remain NULL, IMO (not knowing 
> what that might break...).  The code this patch replaces is wrong for the 
> same reason.

I hope that I managed to clarify this issue. As far as "pg_temp_"
prefix is reserved by the postgres kernel, we can definitely say that
we have encountered a temporary table when we see this prefix. IMO
there is no problem, that gram.y will do it.

>
> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 271ae26cbaf..f68948d475c 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -19424,7 +19424,11 @@ makeRangeVarFromAnyName(List *names, int position, 
> core_yyscan_t yyscanner)
>   break;
>   }
>
> - r->relpersistence = RELPERSISTENCE_PERMANENT;
> + if (r->schemaname && strncmp(r->schemaname, "pg_temp", 7) == 0)
> + r->relpersistence = RELPERSISTENCE_TEMP;
> + else
> + r->relpersistence = RELPERSISTENCE_PERMANENT;
> +
>   r->location = position;
>
> The qualified name variant is fine since r->schemaname must be present by 
> definition.  Passing through the same if block, once with schemaname null (in 
> makeRangeVar) and once with it populated (end of 
> makeRangeVarFromQualifiedName) is a bit annoying.
>
> makeRangeVar has the same prob

Re: Forbid to DROP temp tables of other sessions

2025-03-18 Thread Daniil Davydov

Hi,

On Wed, Mar 19, 2025 at 6:32 AM David G. Johnston
 wrote:
>
> I'm probably done trying to help with this one - it's beyond my ability and 
> desire to contribute to meaningfully.  It seems to need to be escalated if 
> the regression has a chance of being understood and fixed.
> Some good candidates for helping out are copied already so I'm just hoping 
> one or more of them chimes in.  I'll repeat my suggestion to start a new 
> thread and let this one die.  The answer to $subject is no.

Well, the name of the thread has stopped reflecting what is being
discussed here. I may start a new discussion in the near future, but
as part of this discussion I would like to make sure that people agree
with the statements from my previous letter.

Thanks for your help.

--
Best regards,
Daniil Davydov

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

2025-03-16 Thread Daniil Davydov

Hi,
Recently I took more careful measurements of the performance. I
compared three branches with each other: HEAD, Patched and Patched
with tuplestore.
Here are the results :

1)
Test case : matview creation test attached in the email from Jingtang Zhang.
10 measurements for each branch.
Result in wall clock execution time :

HEAD
30.532 +- 0.59 seconds elapsed
Patched
20.454 +- 0.114 seconds elapsed
Patched with tuplestore
19.653 +- 0.111 seconds elapsed

2)
-- init.sql
drop table test_insert;
vacuum;
checkpoint;
create table test_insert(i int, f float);

-- iowrite.sql
insert into test_insert select g, (g % 100) / 100.0 from
generate_series(1, 100) as g;

Test case :
pgbench -f iowrite.sql -n -j 4 -c 10 -T 40
5 measurements for each branch.
Result in tps :

HEAD
1.025 +- 0.009
Patched
2.923 +- 0.032
Patched with tuplestore
2.987 +- 0.011

P.S.
I cannot find a commitfest entry for this patch. Should we add it there?

--
Best regards,
Daniil Davydov

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,
I see that the presence of isolation tests in the patch is
controversial. First things first, let's concentrate on fixing the
bug.
I attach a new version of patch (for `master` branch) to this letter.
It contains better comments and a few small improvements.

P.S.
Sorry for bad formatting in previous letter (idk how to fix it in gmail client)

--
Best regards,
Daniil Davydov
From e5f568ece877a33d307ac6480cef9dd088412aae Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 17 Mar 2025 15:43:15 +0700
Subject: [PATCH v2] Fix accessing other sessions temp tables

---
 src/backend/catalog/namespace.c | 71 +++--
 src/backend/nodes/makefuncs.c   |  6 ++-
 src/backend/parser/gram.y   | 11 -
 3 files changed, 57 insertions(+), 31 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index d97d632a7ef..688819299af 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -499,28 +499,40 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode,
 		 */
 		if (relation->relpersistence == RELPERSISTENCE_TEMP)
 		{
-			if (!OidIsValid(myTempNamespace))
-relId = InvalidOid; /* this probably can't happen? */
-			else
-			{
-if (relation->schemaname)
-{
-	Oid			namespaceId;
+			Oid	namespaceId;
 
-	namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
+			if (relation->schemaname)
+			{
+namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
 
+if (namespaceId != myTempNamespace)
+{
 	/*
-	 * For missing_ok, allow a non-existent schema name to
-	 * return InvalidOid.
+	 * We don't allow users to access temp tables of other
+	 * sessions (consider adding GUC for to allow to DROP such
+	 * tables?).
 	 */
-	if (namespaceId != myTempNamespace)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
- errmsg("temporary tables cannot specify a schema name")));
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("could not access temporary relations of other sessions")));
 }
+			}
+			else
+			{
+namespaceId = myTempNamespace;
 
-relId = get_relname_relid(relation->relname, myTempNamespace);
+/*
+  * If this table was recognized as temporary, it means that we
+ * found it because backend's temporary namespace was specified
+ * in search_path. Thus, MyTempNamespace must contain valid oid.
+ */
+Assert(OidIsValid(namespaceId));
 			}
+
+			if (missing_ok && !OidIsValid(namespaceId))
+relId = InvalidOid;
+			else
+relId = get_relname_relid(relation->relname, namespaceId);
 		}
 		else if (relation->schemaname)
 		{
@@ -3387,17 +3399,18 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok)
 	Oid			namespaceId;
 	AclResult	aclresult;
 
-	/* check for pg_temp alias */
+	/*
+	 * If namespace specified in 'pg_temp' format (without owner backend id
+	 * specification), we assume that this is our temporary namespace.
+	 */
 	if (strcmp(nspname, "pg_temp") == 0)
 	{
-		if (OidIsValid(myTempNamespace))
-			return myTempNamespace;
+		if (!OidIsValid(myTempNamespace))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_SCHEMA_NAME),
+	 errmsg("pg_temp was specified, but you have not any temporary relations")));
 
-		/*
-		 * Since this is used only for looking up existing objects, there is
-		 * no point in trying to initialize the temp namespace here; and doing
-		 * so might create problems for some callers --- just fall through.
-		 */
+		return myTempNamespace;
 	}
 
 	namespaceId = get_namespace_oid(nspname, missing_ok);
@@ -3553,21 +3566,19 @@ get_namespace_oid(const char *nspname, bool missing_ok)
 RangeVar *
 makeRangeVarFromNameList(const List *names)
 {
-	RangeVar   *rel = makeRangeVar(NULL, NULL, -1);
+	RangeVar   *rel;
 
 	switch (list_length(names))
 	{
 		case 1:
-			rel->relname = strVal(linitial(names));
+			rel = makeRangeVar(NULL, strVal(linitial(names)), -1);
 			break;
 		case 2:
-			rel->schemaname = strVal(linitial(names));
-			rel->relname = strVal(lsecond(names));
+			rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1);
 			break;
 		case 3:
+			rel = makeRangeVar(strVal(lsecond(names)), strVal(lthird(names)), -1);
 			rel->catalogname = strVal(linitial(names));
-			rel->schemaname = strVal(lsecond(names));
-			rel->relname = strVal(lthird(names));
 			break;
 		default:
 			ereport(ERROR,
@@ -3774,6 +3785,8 @@ GetTempNamespaceProcNumber(Oid namespaceId)
 		return INVALID_PROC_NUMBER; /* no such namespace? */
 	if (strncmp(nspname, "pg_temp_", 8) == 0)
 		result = atoi(nspname + 8);
+	else if (strncmp(nspname, "pg_temp", 7) == 0)
+		result = MyProcNumber;
 	else if (strncmp(nspname, "pg_toast_temp_", 14) == 0)

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Mon, Mar 17, 2025 at 5:33 PM Steven Niu  wrote:
>
> I mean RangeVarGetRelidExtended(), you deleted the following code:
>
> if (!OidIsValid(myTempNamespace))
> relId = InvalidOid; /* this probably can't happen? */

Hm, I got it. Let's start with the fact that the comment "this
probably can't happen?" is incorrect. Even if we don't have a
temporary namespace in our session, we still can specify "pg_temp_N"
in the psql query.
Next, if relation->schemaname is pg_temp, then we firstly call
LookupExplicitNamespace, where you can find if
(!OidIsValid(myTempNamespace)) check.

--
Best regards,
Daniil Davydov

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Mon, Mar 17, 2025 at 4:48 PM Steven Niu  wrote:
>
> 1. namespace.c, if relation->schemaname is pg_temp but myTempNamespace
> isn't set, the error information might be misleading. Consider checking
> OidIsValid(myTempNamespace) first.
Could you please clarify exactly which place in the code we are
talking about? I think we handle this case in the
LookupExplicitNamespace call (with all appropriate error information).

>
> 2."you have not any temporary relations" --> "you have no any temporary
> relations"
I am not an English speaker, but it seems that "have not" would be
more correct. Someone has to judge us :)

>
> 3. Regarding to the code "strncmp(nspname, "pg_temp", 7)", is it ok when
> the nspname contains something like "pg_temp_1234"? I think we should
> use strcmp instead of strncmp for exact matching.
Great catch! I'll fix it. Please, see v3 patch.

--
Best regards,
Daniil Davydov
From 74dc11678c55eb80ad48ad44fc77d770620ea0a7 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 17 Mar 2025 15:43:15 +0700
Subject: [PATCH v3] Fix accessing other sessions temp tables

---
 src/backend/catalog/namespace.c | 71 +++--
 src/backend/nodes/makefuncs.c   |  6 ++-
 src/backend/parser/gram.y   | 11 -
 3 files changed, 57 insertions(+), 31 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index d97d632a7ef..b19c496a0ec 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -499,28 +499,40 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode,
 		 */
 		if (relation->relpersistence == RELPERSISTENCE_TEMP)
 		{
-			if (!OidIsValid(myTempNamespace))
-relId = InvalidOid; /* this probably can't happen? */
-			else
-			{
-if (relation->schemaname)
-{
-	Oid			namespaceId;
+			Oid	namespaceId;
 
-	namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
+			if (relation->schemaname)
+			{
+namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
 
+if (namespaceId != myTempNamespace)
+{
 	/*
-	 * For missing_ok, allow a non-existent schema name to
-	 * return InvalidOid.
+	 * We don't allow users to access temp tables of other
+	 * sessions (consider adding GUC for to allow to DROP such
+	 * tables?).
 	 */
-	if (namespaceId != myTempNamespace)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
- errmsg("temporary tables cannot specify a schema name")));
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("could not access temporary relations of other sessions")));
 }
+			}
+			else
+			{
+namespaceId = myTempNamespace;
 
-relId = get_relname_relid(relation->relname, myTempNamespace);
+/*
+  * If this table was recognized as temporary, it means that we
+ * found it because backend's temporary namespace was specified
+ * in search_path. Thus, MyTempNamespace must contain valid oid.
+ */
+Assert(OidIsValid(namespaceId));
 			}
+
+			if (missing_ok && !OidIsValid(namespaceId))
+relId = InvalidOid;
+			else
+relId = get_relname_relid(relation->relname, namespaceId);
 		}
 		else if (relation->schemaname)
 		{
@@ -3387,17 +3399,18 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok)
 	Oid			namespaceId;
 	AclResult	aclresult;
 
-	/* check for pg_temp alias */
+	/*
+	 * If namespace specified in 'pg_temp' format (without owner backend id
+	 * specification), we assume that this is our temporary namespace.
+	 */
 	if (strcmp(nspname, "pg_temp") == 0)
 	{
-		if (OidIsValid(myTempNamespace))
-			return myTempNamespace;
+		if (!OidIsValid(myTempNamespace))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_SCHEMA_NAME),
+	 errmsg("pg_temp was specified, but you have not any temporary relations")));
 
-		/*
-		 * Since this is used only for looking up existing objects, there is
-		 * no point in trying to initialize the temp namespace here; and doing
-		 * so might create problems for some callers --- just fall through.
-		 */
+		return myTempNamespace;
 	}
 
 	namespaceId = get_namespace_oid(nspname, missing_ok);
@@ -3553,21 +3566,19 @@ get_namespace_oid(const char *nspname, bool missing_ok)
 RangeVar *
 makeRangeVarFromNameList(const List *names)
 {
-	RangeVar   *rel = makeRangeVar(NULL, NULL, -1);
+	RangeVar   *rel;
 
 	switch (list_length(names))
 	{
 		case 1:
-			rel->relname = strVal(linitial(names));
+			rel = makeRangeVar(NULL, strVal(linitial(names)), -1);
 			break;
 		case 2:
-			rel->schemaname = strVal(linitial(names));
-			rel->relname = strVal(lsecond(names));
+			rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1);
 			brea

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Sun, Mar 16, 2025 at 7:53 PM vignesh C  wrote:
> I noticed that the following Andrey's comment regarding the isolation
> test from [1] and Andres's comment  from [2] are pending. I'm changing
> the commitfest entry to Waiting on Author, please provide an updated
> patch and update it to Needs review.

Thanks for reading it.
I saw [2] and introduced a possible solution in my last letter. In
short : we can have a GUC variable that will permit superuser to drop
temp tables of other sessions. Thus, we have a single location in code
that will check whether we can perform operations with other temp
tables. As far as I understand, this is exactly what Andres wrote
about.
Also, it is difficult for me to express my opinion on [1] at the
moment. I can say for sure that the tests will change when we agree on
the behavior of the code. Therefore, I suggest postponing the
resolution of this issue.

> I suggest adding a GUC that will allow superuser to do this
Waiting for your feedback on this issue :)

--
Best regards,
Daniil Davydov

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Mon, Mar 17, 2025 at 1:52 PM David G. Johnston
 wrote:
>
> My understanding is the limitation of an owner of a temporary relation in one 
> session being disallowed to alter its contents from another session is an 
> implementation consequence, and not some fundamental model restriction.
I would say this: working with temporary tables (in the postgres
kernel) is so very different from working with regular heap tables
that such a limitation can be considered fundamental. We simply will
not be able to organize coordinated work on the temporary table from
different OS processes (for INSERT, UPDATE, DELETE).

> Minimally informed thinking, associate the specific pg_temp namespace with a 
> procid.  Where this limitation exists, which seems like middle management, 
> compare the proc of the namespace to the executor.  Pass the role and also an 
> enum of action type (CRUD, drop, truncate, lock, etc…).  If the procs match 
> all good.  Superuser cannot bypass CRUD and similar as that is the limitation 
> being implemented here.  And the owner cannot bypass anything (exceptions 
> could be added as desired).
>
> Centralizing things a bit though…maybe something like the relcache (for 
> namespaces…) so you cannot even get a handle on the namespace if you don’t 
> supply the info and pass the checks.  Don’t really know enough to say 
> where/how to implement “if you forget to call this check all commands that 
> can reference tables will fail”.
I'm sorry, I probably don't quite understand what this is about, so
I'll just describe how it works now. If a superuser wants to access
other temp table, he must specify schema_name in request (for example
: INSERT INTO pg_temp_N.test .). N is the id of the owner process.
Thus, postgres will call RangeVarGetRelidExtended to find the given
relation's oid. It is at this point that we can step in and check
whether the caller can work with the specified schema. It is
elementary to understand that the schema does belong to another
session. Right now, there is a bug in the code that mistakenly
recognizes the table in such a case as permanent (not temporary), so
we cannot do what I just described. So, we have to get rid of this bug
and decide whether we reserve the right for the superuser to DROP such
tables.

I hope this remark will be useful.

--
Best regards,
Daniil Davydov

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

2025-04-06 Thread Daniil Davydov

Hi,

On Sun, Apr 6, 2025 at 8:55 PM Jingtang Zhang  wrote:
>
> It was quite a while since I last looked at the patch. I've tested it again,
> and still get regression on patched version where a table has many columns.
> And it is totally CPU-bounded on tts_virtual_copyslot.
>
> Unpatched version:
> 1 col:
> Time: 8909.714 ms (00:08.910)
> Time: 8803.579 ms (00:08.804)
> Time: 8600.415 ms (00:08.600)
> 32 cols:
> Time: 12911.699 ms (00:12.912)
> Time: 13543.491 ms (00:13.543)
> Time: 13325.368 ms (00:13.325)
>
> Patched version:
> 1 col:
> Time: 3532.841 ms (00:03.533)
> Time: 3598.223 ms (00:03.598)
> Time: 3515.858 ms (00:03.516)
> 32 cols:
> Time: 35647.724 ms (00:35.648)
> Time: 35596.233 ms (00:35.596)
> Time: 35669.106 ms (00:35.669)
>

Hm, maybe I didn't choose the best way to measure performance. Can you
please share how you do it?

> I've tested your patch with tuplestore and found the regression does not exist
> anymore, but I haven't look deep inside it.
>
> Patched version (with tuplestore):
> 1 col:
> Time: 3500.502 ms (00:03.501)
> Time: 3486.886 ms (00:03.487)
> Time: 3514.233 ms (00:03.514)
> 32 cols:
> Time: 10375.391 ms (00:10.375)
> Time: 10248.256 ms (00:10.248)
> Time: 10248.289 ms (00:10.248)
>
> It seems to be a good idea if there is no other issue with your patch.

As far as I understand, the use of multi inserts for queries like
"INSERT INTO ... SELECT FROM" is not discussed here anymore due to the
fact that in such cases we will have to take into account the volatile
functions and ROW triggers.
I've been thinking about this for a while and made a patch as an
experiment. The principles that the patch works on are listed below.
1)
Since performance decreases for single INSERTs (within a multi inserts
mechanism), I designed this feature as an option for the table. Thus,
if the user knows that he will perform a lot of inserts on the table,
he can specify "WITH (append_optimized=true)".
2)
The availability of volatile functions is monitored during the
construction of a subtree for a ModifyTable node. I'm not that
familiar with the query plan construction mechanism, but it seems to
me that this way we can track any occurrence of volatile functions.
Of course, most volatile functions don't stop us from using multi
inserts, but checking each such function would take a very long time,
so the very fact of having a volatile function is enough for us to
abandon multi-inserts.
3)
Default expressions of the target table are also checked for volatile
functions. The same rules apply to them as in (2). As an exception, I
allowed the use of SERIAL in the column data type, since this is a
fairly common use case.
4)
If the target table contains any ROW triggers, we don't use multi insert.
5)
Patch also contains a regression test. This is a "sandbox" where you
can do some experiments with append-optimized tables.

I hope that patch (targeted on 'master' branch,
2c7bd2ba507e273f2d7fe1b2f6d30775ed4f3c09) will be useful for this
thread.

--
Best regards,
Daniil Davydov
From 224378c11d270aabe28bdd32efacd37ed1984bd1 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 7 Apr 2025 12:55:50 +0700
Subject: [PATCH v1] Meet append optimized tables

---
 src/backend/access/common/reloptions.c|  11 +
 src/backend/access/heap/heapam.c  | 205 ++
 src/backend/access/heap/heapam_handler.c  |   5 +
 src/backend/access/table/tableamapi.c |   5 +
 src/backend/commands/explain.c|   5 +-
 src/backend/executor/execExpr.c   |  17 +-
 src/backend/executor/execProcnode.c   |   9 +
 src/backend/executor/nodeModifyTable.c| 194 -
 src/backend/optimizer/plan/createplan.c   |   1 +
 src/backend/optimizer/util/clauses.c  |  28 ++-
 src/include/access/heapam.h   |  41 
 src/include/access/tableam.h  |  84 +++
 src/include/nodes/execnodes.h |   6 +
 src/include/nodes/plannodes.h |   2 +
 src/include/optimizer/optimizer.h |   3 +
 src/include/utils/rel.h   |  10 +
 .../regress/expected/append_optimized.out | 161 ++
 src/test/regress/parallel_schedule|   2 +
 src/test/regress/sql/append_optimized.sql | 105 +
 19 files changed, 879 insertions(+), 15 deletions(-)
 create mode 100644 src/test/regress/expected/append_optimized.out
 create mode 100644 src/test/regress/sql/append_optimized.sql

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 46c1dce222d..9652cf4179b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -166,6 +166,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true

Re: Forbid to DROP temp tables of other sessions

2025-03-17 Thread Daniil Davydov

Hi,

On Mon, Mar 17, 2025 at 10:09 PM David G. Johnston
 wrote:
>
> On Monday, March 17, 2025, Daniil Davydov <3daniss...@gmail.com> wrote:
>>
>>
>> >
>> > 2."you have not any temporary relations" --> "you have no any temporary
>> > relations"
>> I am not an English speaker, but it seems that "have not" would be
>> more correct. Someone has to judge us :)
>>
>
> Both are not good.
>
> “pg_temp was specified but it contains no relations” [1]

That sounds reasonable. I'll fix it. Thanks!

> But why are we promoting this situation to an error?  It should be a relation 
> not found error just like any other and not its own special case.
> The fact we create pg_temp only as it is needed is an implementation detail 
> that should not be visible to the user.

1)
Error message has no mention of a non-existent schema. "Schema has no
relations" => "Given relation not found in empty schema". It seems to
me that these are equivalent statements.

2)
Is this really the implementation detail that we want to hide from the
user? User can just run "select pg_my_temp_schema();" and see that
there is no temp schema in the current session.
Don't get me wrong - I can agree with that, but for now it seems odd to me...
Steven Niu also mentioned this issue, but IMO we must give the most
accurate description of the problem - tell "relation not found" only
if we have temp namespace, but not specified relation in it.

Please see v4 patch (only comment change).

--
Best regards,
Daniil Davydov
From 33e2b77607c26bb82a807711ecdec5174be64874 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 17 Mar 2025 15:43:15 +0700
Subject: [PATCH v4] Fix accessing other sessions temp tables

---
 src/backend/catalog/namespace.c | 71 +++--
 src/backend/nodes/makefuncs.c   |  6 ++-
 src/backend/parser/gram.y   | 11 -
 3 files changed, 57 insertions(+), 31 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index d97d632a7ef..9e96189920d 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -499,28 +499,40 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode,
 		 */
 		if (relation->relpersistence == RELPERSISTENCE_TEMP)
 		{
-			if (!OidIsValid(myTempNamespace))
-relId = InvalidOid; /* this probably can't happen? */
-			else
-			{
-if (relation->schemaname)
-{
-	Oid			namespaceId;
+			Oid	namespaceId;
 
-	namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
+			if (relation->schemaname)
+			{
+namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
 
+if (namespaceId != myTempNamespace)
+{
 	/*
-	 * For missing_ok, allow a non-existent schema name to
-	 * return InvalidOid.
+	 * We don't allow users to access temp tables of other
+	 * sessions (consider adding GUC for to allow to DROP such
+	 * tables?).
 	 */
-	if (namespaceId != myTempNamespace)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
- errmsg("temporary tables cannot specify a schema name")));
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("could not access temporary relations of other sessions")));
 }
+			}
+			else
+			{
+namespaceId = myTempNamespace;
 
-relId = get_relname_relid(relation->relname, myTempNamespace);
+/*
+  * If this table was recognized as temporary, it means that we
+ * found it because backend's temporary namespace was specified
+ * in search_path. Thus, MyTempNamespace must contain valid oid.
+ */
+Assert(OidIsValid(namespaceId));
 			}
+
+			if (missing_ok && !OidIsValid(namespaceId))
+relId = InvalidOid;
+			else
+relId = get_relname_relid(relation->relname, namespaceId);
 		}
 		else if (relation->schemaname)
 		{
@@ -3387,17 +3399,18 @@ LookupExplicitNamespace(const char *nspname, bool missing_ok)
 	Oid			namespaceId;
 	AclResult	aclresult;
 
-	/* check for pg_temp alias */
+	/*
+	 * If namespace specified in 'pg_temp' format (without owner backend id
+	 * specification), we assume that this is our temporary namespace.
+	 */
 	if (strcmp(nspname, "pg_temp") == 0)
 	{
-		if (OidIsValid(myTempNamespace))
-			return myTempNamespace;
+		if (!OidIsValid(myTempNamespace))
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_SCHEMA_NAME),
+	 errmsg("pg_temp was specified but it contains no relations")));
 
-		/*
-		 * Since this is used only for looking up existing objects, there is
-		 * no point in trying to initialize the temp namespace here; and doing
-		 * so might create problems for some callers --- just fall through.
-		 */
+		return myTempName

Re: Accessing an invalid pointer in BufferManagerRelation structure

2025-04-13 Thread Daniil Davydov

Hi,

On Wed, Mar 26, 2025 at 2:14 PM Stepan Neretin  wrote:
>
> The first thing we both noticed is that the macro calls a function that won't 
> be available without an additional header. This seems a bit inconvenient.

Well, I rebased the patch onto the latest `master`
(b51f86e49a7f119004c0ce5d0be89cdf98309141) and noticed that we don't
need to include `rel.h` in `localbuf.c` directly anymore, because
`#include lmgr.h` was added in memutils.h
I guess it solves this issue. Please, see v3 patch.

> I also have a question: is the logic correct that if the relation is valid, 
> we should fetch it rather than the other way around? Additionally, is 
> checking only the `rd_isvalid` flag sufficient, or should we also consider 
> the flag below?
>
> ```
> bool rd_isvalid; /* relcache entry is valid */
>

I don't think that we should check any Relation's flags here. We are
checking `RelationIsValid((bmr).rel) ?` to decide whether
BufferManagerRelation was created via BMR_REL or BMR_SMGR.
If the `rel` field is not NULL, we can definitely say that BMR_REL was
used, so we should call RelationGetSmgr in order to access smgr.

--
Best regards,
Daniil Davydov
From a1c572652e69f13b954ec3a915ed55c1ec11c772 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 14 Apr 2025 13:07:38 +0700
Subject: [PATCH v3] Add macros for safety access to smgr

---
 src/backend/storage/buffer/bufmgr.c   | 55 ---
 src/backend/storage/buffer/localbuf.c |  9 +++--
 src/include/storage/bufmgr.h  | 13 +++
 3 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0b0e056eea2..f2b117088a8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -892,11 +892,8 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_by > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == 0)
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	return ExtendBufferedRelCommon(bmr, fork, strategy, flags,
    extend_by, InvalidBlockNumber,
@@ -928,11 +925,8 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == 0)
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	/*
 	 * If desired, create the file if it doesn't exist.  If
@@ -940,15 +934,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * an smgrexists call.
 	 */
 	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
-		(bmr.smgr->smgr_cached_nblocks[fork] == 0 ||
-		 bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
-		!smgrexists(bmr.smgr, fork))
+		(BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 ||
+		 BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(BMR_GET_SMGR(bmr), fork))
 	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
 
 		/* recheck, fork might have been created concurrently */
-		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+		if (!smgrexists(BMR_GET_SMGR(bmr), fork))
+			smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY);
 
 		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
 	}
@@ -958,13 +952,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+		BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber;
 
 	/*
 	 * Estimate how many pages we'll need to extend by. This avoids acquiring
 	 * unnecessarily many victim buffers.
 	 */
-	current_size = smgrnblocks(bmr.smgr, fork);
+	current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork);
 
 	/*
 	 * Since no-one else can be looking at the page contents yet, there is no
@@ -1008,7 +1002,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	if (buffer == InvalidBuffer)
 	{
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.rel, bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr.rel, BMR_GET_SMGR(bmr), bmr.relpersistence,
    fork, extend_to - 1, mode, strategy);
 	}
 
@@ -2568,10 +2562,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr,
 	BlockNumber first_block;
 
 	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
-		 bmr.smgr->smgr_rlocator.locator.spcOid,
-		 bmr.smgr->smgr_rlocator.locator.dbOid,
-		 bmr.smgr->smgr_rlocator.locator.relNumber,
-		 bmr.smgr->smgr_rlocator.backend,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid,
+		 BMR_GET_SMGR(bmr)->smg

Fix bug with accessing to temporary tables of other sessions

2025-04-13 Thread Daniil Davydov

Hi,

During previous commitfest this topic already has been discussed
within the "Forbid to DROP temp tables of other sessions" thread [1].
Unfortunately its name doesn't reflect the real problem, so I decided
to start a new thread, as David G. Johnston advised.

Here are the summary results of the discussion [1] :
The superuser is only allowed to DROP temporary relations of other
sessions. Other commands (like SELECT, INSERT, UPDATE, DELETE ...)
must be forbidden to him. Error message for this case will look like
this : `could not access temporary relations of other sessions`.
For now, superuser still can specify such operations because of a bug
in the code that mistakenly recognizes other session's temp table as
permanent (I've covered this topic in more detail in [2]). Attached
patch fixes this bug (targeted on
b51f86e49a7f119004c0ce5d0be89cdf98309141).

Opened issue:
Not everyone liked the idea that table's persistence can be assigned
to table during makeRangeVarXXX calls (inside gram.y).
My opinion - `As far as "pg_temp_" prefix is reserved by the postgres
kernel, we can definitely say that we have encountered a temporary
table when we see this prefix.`

I will be glad to hear your opinion.

--
Best regards,
Daniil Davydov

[1] 
https://www.postgresql.org/message-id/CAJDiXgj72Axj0d4ojKdRWG_rnkfs4uWY414NL%3D15sCvh7-9rwg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAJDiXgj%2B5UKLWSUT5605rJhuw438NmEKecvhFAF2nnrMsgGK3w%40mail.gmail.com
From c1415e457edd8e1cf38fd78ac55c93dbd8617d55 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Mon, 14 Apr 2025 11:01:56 +0700
Subject: [PATCH v5] Fix accessing other sessions temp tables

---
 src/backend/catalog/namespace.c  | 56 
 src/backend/commands/tablecmds.c |  3 +-
 src/backend/nodes/makefuncs.c|  6 +++-
 src/backend/parser/gram.y| 11 ++-
 src/include/catalog/namespace.h  |  2 ++
 5 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index d97d632a7ef..f407efd9447 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -499,28 +499,44 @@ RangeVarGetRelidExtended(const RangeVar *relation, LOCKMODE lockmode,
 		 */
 		if (relation->relpersistence == RELPERSISTENCE_TEMP)
 		{
-			if (!OidIsValid(myTempNamespace))
-relId = InvalidOid; /* this probably can't happen? */
-			else
-			{
-if (relation->schemaname)
-{
-	Oid			namespaceId;
+			Oid	namespaceId;
 
-	namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
+			if (relation->schemaname)
+			{
+namespaceId = LookupExplicitNamespace(relation->schemaname, missing_ok);
 
+/*
+ * If the user has specified an existing temporary schema
+ * owned by another user.
+ */
+if (OidIsValid(namespaceId) && namespaceId != myTempNamespace)
+{
 	/*
-	 * For missing_ok, allow a non-existent schema name to
-	 * return InvalidOid.
+	 * We don't allow users to access temp tables of other
+	 * sessions except for the case of dropping tables.
 	 */
-	if (namespaceId != myTempNamespace)
+	if (!(flags & RVR_OTHER_TEMP_OK))
 		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
- errmsg("temporary tables cannot specify a schema name")));
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("could not access temporary relations of other sessions")));
 }
+			}
+			else
+			{
+namespaceId = myTempNamespace;
 
-relId = get_relname_relid(relation->relname, myTempNamespace);
+/*
+ * If this table was recognized as temporary, it means that we
+ * found it because backend's temporary namespace was specified
+ * in search_path. Thus, MyTempNamespace must contain valid oid.
+ */
+Assert(OidIsValid(namespaceId));
 			}
+
+			if (missing_ok && !OidIsValid(namespaceId))
+relId = InvalidOid;
+			else
+relId = get_relname_relid(relation->relname, namespaceId);
 		}
 		else if (relation->schemaname)
 		{
@@ -3553,21 +3569,19 @@ get_namespace_oid(const char *nspname, bool missing_ok)
 RangeVar *
 makeRangeVarFromNameList(const List *names)
 {
-	RangeVar   *rel = makeRangeVar(NULL, NULL, -1);
+	RangeVar   *rel;
 
 	switch (list_length(names))
 	{
 		case 1:
-			rel->relname = strVal(linitial(names));
+			rel = makeRangeVar(NULL, strVal(linitial(names)), -1);
 			break;
 		case 2:
-			rel->schemaname = strVal(linitial(names));
-			rel->relname = strVal(lsecond(names));
+			rel = makeRangeVar(strVal(linitial(names)), strVal(lsecond(names)), -1);
 			break;
 		case 3:
+			rel = makeRangeVar(strVal(lsecond(names)), strVal(lthird(names)), -1);
 			rel->catalogname = strVal(linitial(names));
-			rel->schemaname = strVal(lsecond(names));
-			rel->relname = str

Re: POC: Parallel processing of indexes in autovacuum

2025-05-05 Thread Daniil Davydov

On Tue, May 6, 2025 at 6:57 AM Masahiko Sawada  wrote:
>
> What I roughly imagined is that we don't need to change the entire
> autovacuum scheduling, but would like autovacuum workers to decides
> whether or not to use parallel vacuum during its vacuum operation
> based on GUC parameters (having a global effect) or storage parameters
> (having an effect on the particular table). The criteria of triggering
> parallel vacuum in autovacuum might need to be somewhat pessimistic so
> that we don't unnecessarily use parallel vacuum on many tables.
>

+1, I think about it in the same way. I will expand on this topic in
more detail in response to Sami's letter [1], so as not to repeat
myself.

> > Here are my thoughts on this. A/v worker has a very simple role - it
> > is born after the launcher's request and must do exactly one 'task' -
> > vacuum table or participate in parallel index vacuum.
> > We also have a dedicated 'launcher' role, meaning the whole design
> > implies that only the launcher is able to launch processes.
> >
> > If we allow a/v worker to use bgworkers, then :
> > 1) A/v worker will go far beyond his responsibility.
> > 2) Its functionality will overlap with the functionality of the launcher.
>
> While I agree that the launcher process is responsible for launching
> autovacuum worker processes but I'm not sure it should be for
> launching everything related autovacuums. It's quite possible that we
> have parallel heap vacuum and processing the particular index with
> parallel workers in the future. The code could get more complex if we
> have the autovacuum launcher process launch such parallel workers too.
> I believe it's more straightforward to divide the responsibility like
> in a way that the autovacuum launcher is responsible for launching
> autovacuum workers and autovacuum workers are responsible for
> vacuuming tables no matter how to do that.

It sounds very tempting. At the very beginning I did exactly that (to
make sure that nothing would break in a parallel autovacuum). Only
later it was decided to abandon the use of bgworkers.
For now both approaches look fair for me. What do you think - will
others agree that we can provide more responsibility to a/v workers?

> > 3) Resource consumption can jump dramatically, which is unexpected for
> > the user.
>
> What extra resources could be used if we use background workers
> instead of autovacuum workers?

I meant that more processes are starting to participate in the
autovacuum than indicated in autovacuum_max_workers. And if a/v worker
will use additional bgworkers => other operations cannot get these
resources.

> > Autovacuum will also be dependent on other resources
> > (bgworkers pool). The current design does not imply this.
>
> I see your point but I think it doesn't necessarily need to reflect it
> at the infrastructure layer. For example, we can internally allocate
> extra background worker slots for parallel vacuum workers based on
> max_parallel_index_autovac_workers in addition to
> max_worker_processes. Anyway we might need something to check or
> validate max_worker_processes value to make sure that every autovacuum
> worker can use the specified number of parallel workers for parallel
> vacuum.

I don't think that we can provide all supportive workers for each
parallel index vacuuming request. But I got your point - always keep
several bgworkers that only a/v workers can use if needed and the size
of this additional pool (depending on max_worker_processes) must be
user-configurable.

> > I wanted to create a patch that would fit into the existing mechanism
> > without drastic innovations. But if you think that the above is not so
> > important, then we can reuse VACUUM PARALLEL code and it would
> > simplify the final implementation)
>
> I'd suggest using the existing infrastructure if we can achieve the
> goal with it. If we find out there are some technical difficulties to
> implement it without new infrastructure, we can revisit this approach.

OK, in the near future I'll implement it and send a new patch to this
thread. I'll be glad if you will take a look on it)

[1] 
https://www.postgresql.org/message-id/CAA5RZ0vfBg%3Dc_0Sa1Tpxv8tueeBk8C5qTf9TrxKBbXUqPc99Ag%40mail.gmail.com

--
Best regards,
Daniil Davydov

Re: POC: Parallel processing of indexes in autovacuum

2025-05-05 Thread Daniil Davydov

On Tue, May 6, 2025 at 7:21 AM Sami Imseih  wrote:
>
> Perhaps we should only provide a reloption, therefore only tables specified
> by the user via the reloption can be autovacuumed  in parallel?

Аfter your comments (earlier in this thread) I decided to do just
that. For now we have reloption, so the user can decide which tables
are "important" for parallel index vacuuming.
We also set lower bounds (hardcoded) on the number of indexes and the
number of dead tuples. For example, there is no need to use a parallel
vacuum if the table has only one index.
The situation is more complicated with the number of dead tuples - we
need tests that would show the optimal minimum value. This issue is
still being worked out.

> This gives a targeted approach. Of course if multiple of these allowed tables
> are to be autovacuumed at the same time, some may not get all the workers,
> But that’s not different from if you are to manually vacuum in parallel the 
> tables
> at the same time.

I fully agree. Recently v2 patch has been supplemented with a new
feature [1] - multiple tables in a cluster can be processed in
parallel during autovacuum. And of course, not every a/v worker can
get enough supportive processes, but this is considered normal
behavior.
Maximum number of supportive workers is limited by the GUC variable.

[1] I guess that I'll send it within the v3 patch, that will also
contain logic that was discussed in the letter above - using bgworkers
instead of additional a/v workers. BTW, what do you think about this
idea?

--
Best regards,
Daniil Davydov

Re: POC: Parallel processing of indexes in autovacuum

2025-05-02 Thread Daniil Davydov

On Thu, May 1, 2025 at 8:03 AM Masahiko Sawada  wrote:
>
> As I understand it, we initially disabled parallel vacuum for
> autovacuum because their objectives are somewhat contradictory.
> Parallel vacuum aims to accelerate the process by utilizing additional
> resources, while autovacuum is designed to perform cleaning operations
> with minimal impact on foreground transaction processing (e.g.,
> through vacuum delay).
>
Yep, we also decided that we must not create more a/v workers for
index processing.
In current implementation, the leader process sends a signal to the
a/v launcher, and the launcher tries to launch all requested workers.
But the number of workers never exceeds `autovacuum_max_workers`.
Thus, we will never have more a/v workers than in the standard case
(without this feature).

> Nevertheless, I see your point about the potential benefits of using
> parallel vacuum within autovacuum in specific scenarios. The crucial
> consideration is determining appropriate criteria for triggering
> parallel vacuum in autovacuum. Given that we currently support only
> parallel index processing, suitable candidates might be autovacuum
> operations on large tables that have a substantial number of
> sufficiently large indexes and a high volume of garbage tuples.
>
> Although the actual number of parallel workers ultimately depends on
> the number of eligible indexes, it might be beneficial to introduce a
> storage parameter, say parallel_vacuum_workers, that allows control
> over the number of parallel vacuum workers on a per-table basis.
>
For now, we have three GUC variables for this purpose:
max_parallel_index_autovac_workers, autovac_idx_parallel_min_rows,
autovac_idx_parallel_min_indexes.
That is, everything is as you said. But we are still conducting
research on this issue. I would like to get rid of some of these
parameters.

> Regarding implementation: I notice the WIP patch implements its own
> parallel vacuum mechanism for autovacuum. Have you considered simply
> setting at_params.nworkers to a value greater than zero?
>
About `at_params.nworkers = N` - that's exactly what we're doing (you
can see it in the `vacuum_rel` function). But we cannot fully reuse
code of VACUUM PARALLEL, because it creates its own processes via
dynamic bgworkers machinery.
As I said above - we don't want to consume additional resources. Also
we don't want to complicate communication between processes (the idea
is that a/v workers can only send signals to the a/v launcher).
As a result, we created our own implementation of parallel index
processing control - see changes in vacuumparallel.c and autovacuum.c.

--
Best regards,
Daniil Davydov

Re: POC: Parallel processing of indexes in autovacuum

2025-05-02 Thread Daniil Davydov

On Fri, May 2, 2025 at 11:58 PM Sami Imseih  wrote:
>
> I am generally -1 on the idea of autovacuum performing parallel
> index vacuum, because I always felt that the parallel option should
> be employed in a targeted manner for a specific table. if you have a bunch
> of large tables, some more important than others, a/c may end
> up using parallel resources on the least important tables and you
> will have to adjust a/v settings per table, etc to get the right table
> to be parallel index vacuumed by a/v.

Hm, this is a good point. I think I should clarify one moment - in
practice, there is a common situation when users have one huge table
among all databases (with 80+ indexes created on it). But, of course,
in general there may be few such tables.
But we can still adjust the autovac_idx_parallel_min_rows parameter.
If a table has a lot of dead tuples => it is actively used => table is
important (?).
Also, if the user can really determine the "importance" of each of the
tables - we can provide an appropriate table option. Tables with this
option set will be processed in parallel in priority order. What do
you think about such an idea?

>
> Also, with the TIDStore improvements for index cleanup, and the practical
> elimination of multi-pass index vacuums, I see this being even less
> convincing as something to add to a/v.

If I understood correctly, then we are talking about the fact that
TIDStore can store so many tuples that in fact a second pass is never
needed.
But the number of passes does not affect the presented optimization in
any way. We must think about a large number of indexes that must be
processed. Even within a single pass we can have a 40% increase in
speed.

>
> Now, If I am going to allocate extra workers to run vacuum in parallel, why
> not just provide more autovacuum workers instead so I can get more tables
> vacuumed within a span of time?

For now, only one process can clean up indexes, so I don't see how
increasing the number of a/v workers will help in the situation that I
mentioned above.
Also, we don't consume additional resources during autovacuum in this
patch - total number of a/v workers always <= autovacuum_max_workers.

BTW, see v2 patch, attached to this letter (bug fixes) :-)

--
Best regards,
Daniil Davydov
From 1c93a729b844a1dfe109e8d9e54d5cc0a941d061 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Sat, 3 May 2025 00:27:45 +0700
Subject: [PATCH v2] WIP Allow autovacuum to process indexes of single table in
 parallel

---
 src/backend/commands/vacuum.c |  27 +
 src/backend/commands/vacuumparallel.c | 289 +-
 src/backend/postmaster/autovacuum.c   | 906 +-
 src/backend/utils/misc/guc_tables.c   |  30 +
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/postmaster/autovacuum.h   |  23 +
 src/test/modules/autovacuum/.gitignore|   1 +
 src/test/modules/autovacuum/Makefile  |  14 +
 .../autovacuum/t/001_autovac_parallel.pl  | 137 +++
 9 files changed, 1387 insertions(+), 46 deletions(-)
 create mode 100644 src/test/modules/autovacuum/.gitignore
 create mode 100644 src/test/modules/autovacuum/Makefile
 create mode 100644 src/test/modules/autovacuum/t/001_autovac_parallel.pl

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 33a33bf6b1c..a5ef5319ccc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2234,6 +2234,33 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	else
 		toast_relid = InvalidOid;
 
+	/*
+	 * Decide whether we need to process table with given oid in parallel mode
+	 * during autovacuum.
+	 */
+	if (AmAutoVacuumWorkerProcess() &&
+		params->index_cleanup != VACOPTVALUE_DISABLED)
+	{
+		PgStat_StatTabEntry *tabentry;
+
+		/* fetch the pgstat table entry */
+		tabentry = pgstat_fetch_stat_tabentry_ext(rel->rd_rel->relisshared,
+  rel->rd_id);
+		if (tabentry && tabentry->dead_tuples >= autovac_idx_parallel_min_rows)
+		{
+			List   *indexes = RelationGetIndexList(rel);
+			int		num_indexes = list_length(indexes);
+
+			list_free(indexes);
+
+			if (num_indexes >= autovac_idx_parallel_min_indexes &&
+max_parallel_index_autovac_workers > 0)
+			{
+params->nworkers = max_parallel_index_autovac_workers;
+			}
+		}
+	}
+
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations and
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..cb4b7c23010 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1,20 +1,23 @@
 /*-
  *
  * vacuumparallel.c
- *	  Support routines fo

Re: POC: Parallel processing of indexes in autovacuum

2025-05-03 Thread Daniil Davydov

On Sat, May 3, 2025 at 5:28 AM Masahiko Sawada  wrote:
>
> > In current implementation, the leader process sends a signal to the
> > a/v launcher, and the launcher tries to launch all requested workers.
> > But the number of workers never exceeds `autovacuum_max_workers`.
> > Thus, we will never have more a/v workers than in the standard case
> > (without this feature).
>
> I have concerns about this design. When autovacuuming on a single
> table consumes all available autovacuum_max_workers slots with
> parallel vacuum workers, the system becomes incapable of processing
> other tables. This means that when determining the appropriate
> autovacuum_max_workers value, users must consider not only the number
> of tables to be processed concurrently but also the potential number
> of parallel workers that might be launched. I think it would more make
> sense to maintain the existing autovacuum_max_workers parameter while
> introducing a new parameter that would either control the maximum
> number of parallel vacuum workers per autovacuum worker or set a
> system-wide cap on the total number of parallel vacuum workers.
>

For now we have max_parallel_index_autovac_workers - this GUC limits
the number of parallel a/v workers that can process a single table. I
agree that the scenario you provided is problematic.
The proposal to limit the total number of supportive a/v workers seems
attractive to me (I'll implement it as an experiment).

It seems to me that this question is becoming a key one. First we need
to determine the role of the user in the whole scheduling mechanism.
Should we allow users to determine priority? Will this priority affect
only within a single vacuuming cycle, or it will be more 'global'?
I guess I don't have enough expertise to determine this alone. I will
be glad to receive any suggestions.

> > About `at_params.nworkers = N` - that's exactly what we're doing (you
> > can see it in the `vacuum_rel` function). But we cannot fully reuse
> > code of VACUUM PARALLEL, because it creates its own processes via
> > dynamic bgworkers machinery.
> > As I said above - we don't want to consume additional resources. Also
> > we don't want to complicate communication between processes (the idea
> > is that a/v workers can only send signals to the a/v launcher).
>
> Could you elaborate on the reasons why you don't want to use
> background workers and avoid complicated communication between
> processes? I'm not sure whether these concerns provide sufficient
> justification for implementing its own parallel index processing.
>

Here are my thoughts on this. A/v worker has a very simple role - it
is born after the launcher's request and must do exactly one 'task' -
vacuum table or participate in parallel index vacuum.
We also have a dedicated 'launcher' role, meaning the whole design
implies that only the launcher is able to launch processes.
If we allow a/v worker to use bgworkers, then :
1) A/v worker will go far beyond his responsibility.
2) Its functionality will overlap with the functionality of the launcher.
3) Resource consumption can jump dramatically, which is unexpected for
the user. Autovacuum will also be dependent on other resources
(bgworkers pool). The current design does not imply this.

I wanted to create a patch that would fit into the existing mechanism
without drastic innovations. But if you think that the above is not so
important, then we can reuse VACUUM PARALLEL code and it would
simplify the final implementation)

--
Best regards,
Daniil Davydov

Re: POC: Parallel processing of indexes in autovacuum

2025-05-03 Thread Daniil Davydov

On Sat, May 3, 2025 at 5:59 AM Sami Imseih  wrote:
>
> > I think it would more make
> > sense to maintain the existing autovacuum_max_workers parameter while
> > introducing a new parameter that would either control the maximum
> > number of parallel vacuum workers per autovacuum worker or set a
> > system-wide cap on the total number of parallel vacuum workers.
>
> +1, and would it make sense for parallel workers to come from
> max_parallel_maintenance_workers? This is capped by
> max_parallel_workers and max_worker_processes, so increasing
> the defaults for all 3 will be needed as well.

I may be wrong, but the `max_parallel_maintenance_workers` parameter
is only used for commands that are explicitly run by the user. We
already have `autovacuum_max_workers` and I think that code will be
more consistent, if we adapt this particular parameter (perhaps with
the addition of a new one, as I wrote in the previous letter).

--
Best regards,
Daniil Davydov

Re: POC: Parallel processing of indexes in autovacuum

2025-05-03 Thread Daniil Davydov

On Sat, May 3, 2025 at 3:17 AM Sami Imseih  wrote:
>
> I think in most cases, the user will want to determine the priority of
> a table getting parallel vacuum cycles rather than having the autovacuum
> determine the priority. I also see users wanting to stagger
> vacuums of large tables with many indexes through some time period,
> and give the
> tables the full amount of parallel workers they can afford at these
> specific periods
> of time. A/V currently does not really allow for this type of
> scheduling, and if we
> give some kind of GUC to prioritize tables, I think users will constantly have
> to be modifying this priority.

If the user wants to determine priority himself, we anyway need to
introduce some parameter (GUC or table option) that will give us a
hint how we should schedule a/v work.
You think that we should think about a more comprehensive behavior for
such a parameter (so that the user doesn't have to change it often)? I
will be glad to know your thoughts.

> > If I understood correctly, then we are talking about the fact that
> > TIDStore can store so many tuples that in fact a second pass is never
> > needed.
> > But the number of passes does not affect the presented optimization in
> > any way. We must think about a large number of indexes that must be
> > processed. Even within a single pass we can have a 40% increase in
> > speed.
>
> I am not discounting that a single table vacuum with many indexes will
> maybe perform better with parallel index scan, I am merely saying that
> the TIDStore optimization now makes index vacuums better and perhaps
> there is less of an incentive to use parallel.

I still insist that this does not affect the parallel index vacuum,
because we don't get an advantage in repeated passes. We get the same
speed increase whether we have this optimization or not.
Although it's even possible that the opposite is true - the situation
will be better with the new TIDStore, but I can't say for sure.

> > > Now, If I am going to allocate extra workers to run vacuum in parallel, 
> > > why
> > > not just provide more autovacuum workers instead so I can get more tables
> > > vacuumed within a span of time?
> >
> > For now, only one process can clean up indexes, so I don't see how
> > increasing the number of a/v workers will help in the situation that I
> > mentioned above.
> > Also, we don't consume additional resources during autovacuum in this
> > patch - total number of a/v workers always <= autovacuum_max_workers.
>
> Increasing a/v workers will not help speed up a specific table, what I
> am suggesting is that instead of speeding up one table, let's just allow
> other tables to not be starved of a/v cycles due to lack of a/v workers.

OK, I got it. But what if vacuuming of a single table will take (for
example) 60% of all time? This is still a possible situation, and the
fast vacuum of all other tables will not help us.

--
Best regards,
Daniil Davydov

Re: Forbid to DROP temp tables of other sessions

2025-02-18 Thread Daniil Davydov

Hi,
I'm sorry for the long lull.
Considering that it is still important for some users to be able to
clean other sessions' temporary directories, I suggest adding a GUC
that will allow superuser to do this.
We keep only one option - to drop tables, because only this feature
works properly in postgres by now.

The ability to read and modify other session's temp tables is broken
(I mean INSERT, UPDATE, DELETE, SELECT queries) and needs to be
disabled.
I wrote about broken INSERTs here :
https://www.postgresql.org/message-id/CAJDiXgj6TBzn%3D6Ezx7%2B9BNa9HpBitBU%2BMuv-N3mHeN_Zs3NBDw%40mail.gmail.com

What do you think about this idea?

--
Best regards,
Daniil Davydov

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

2025-03-09 Thread Daniil Davydov

Hi,
A few days ago I came up with an idea to implement multi insert
optimization wherever possible. I prepared a raw patch
and it showed a great performance gain (up to 4 times during INSERT
... INTO ... in the best case).
Then I was very happy to find this thread. You did a great job and I
want to help you to bring the matter to an end.

On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang  wrote:
> I did some performance test these days, and I have some findings.
> HEAD:
>   12.29%  postgres[.] pg_checksum_block
>6.33%  postgres[.] GetPrivateRefCountEntry
>5.40%  postgres[.] pg_comp_crc32c_sse42
>4.54%  [kernel][k] copy_user_enhanced_fast_string
>2.69%  postgres[.] BufferIsValid
>1.52%  postgres[.] XLogRecordAssemble
>
> Patched:
>   11.75%  postgres[.] tts_virtual_materialize
>8.87%  postgres[.] pg_checksum_block
>8.17%  postgres[.] slot_deform_heap_tuple
>8.09%  postgres[.] heap_compute_data_size
>6.17%  postgres[.] fill_val
>3.81%  postgres[.] heap_fill_tuple
>3.37%  postgres[.] tts_virtual_copyslot
>2.62%  [kernel][k] copy_user_enhanced_fast_string

I applied v25 patches on the master branch and made some measurements
to find out what is the bottleneck in this case. The 'time' utility
showed that without a patch, this query will run 1.5 times slower. I
also made a few flamegraphs for this test. Most of the time is spent
calling
these two functions : tts_virtual_copyslot and heap_form_tuple.
All tests were run in virtual machine with these CPU characteristics:
Architecture: x86_64
CPU(s):   2
  On-line CPU(s) list:0,1
Virtualization features:
  Virtualization: AMD-V
  Hypervisor vendor:  KVM
  Virtualization type:full
Caches (sum of all):
  L1d:128 KiB (2 instances)
  L1i:128 KiB (2 instances)
  L2: 1 MiB (2 instances)
  L3: 32 MiB (2 instances)
NUMA:
  NUMA node(s):   1
  NUMA node0 CPU(s):  0,1

In my implementation, I used Tuplestore functionality to store tuples.
In order to get rid of getting stuck in the above mentioned functions,
I crossed it with the current implementation (v25 patches) and got a
10% increase in performance (for the test above). I also set up v22
patches to
compare performance (with/without tuplestore) for INSERT ... INTO ...
queries (with -j 4 -c 10 parameters for pgbech), and there also was an
increase in TPS (about 3-4%).

I attach a patch that adds Tuplestore to v25. What do you think about this idea?

--
Best regards,
Daniil Davydov
From a59cfcbb05bb07c94a4c0ad6531baa5e531629ae Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Sun, 9 Mar 2025 16:37:44 +0700
Subject: [PATCH] Replace holding tuples in virtual slots with tuplestorage

During performance testing, it was found out that in the current
implementation a lot of the program's time is spent calling two functions :
tts_virtual_copyslot and heap_fill_tuple. Calls to these functions are related
to the fact that tuples are stored in virtual_tts, so I propose to replace this
logic with Tuplestore functionality.

Discussion: https://www.postgresql.org/message-id/9F9326B4-8AD9-4858-B1C1-559FC64E6E93%40gmail.com
---
 src/backend/access/heap/heapam.c | 67 +++-
 src/include/access/heapam.h  |  9 -
 2 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index acdce1a4b4..276480213a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2665,7 +2665,6 @@ void
 heap_modify_buffer_insert(TableModifyState *state,
 		  TupleTableSlot *slot)
 {
-	TupleTableSlot *dstslot;
 	HeapInsertState *istate;
 	HeapMultiInsertState *mistate;
 	MemoryContext oldcontext;
@@ -2682,8 +2681,10 @@ heap_modify_buffer_insert(TableModifyState *state,
 		mistate =
 			(HeapMultiInsertState *) palloc(sizeof(HeapMultiInsertState));
 		mistate->slots =
-			(TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
-		mistate->cur_slots = 0;
+			(TupleTableSlot **) palloc0(sizeof(void *) * HEAP_MAX_BUFFERED_SLOTS);
+		mistate->tstore = tuplestore_begin_heap(false, false, work_mem);
+		mistate->nused = 0;
+
 		istate->mistate = mistate;

 		/*
@@ -2702,36 +2703,11 @@ heap_modify_buffer_insert(TableModifyState *state,
 	istate = (HeapInsertState *) state->data;
 	Assert(istate->mistate != NULL);
 	mistate = istate->mistate;
-	dstslot = mistate->slots[mistate->cur_slots];
-
-	if (dstslot == NULL)
-	{
-		/*
-		 * We use virtual tuple slots buffered slots for leveraging the
-		 * optimization it provides to minimize physical data copying. The
-

Add arbitrary xid and mxid to pg_resetwal

2025-03-05 Thread Daniil Davydov

Hi,
I prepared a patch that will allow us to set arbitrary values in -m and -x
options for pg_resetwal. For now, it is not possible to specify a value
that does not fit into existing SLRU segments, and
main idea of this patch (for REL_17_STABLE) is to create such segments
inside pg_resetwal's main() function.
In my opinion, this will be useful primarily to simplify testing, since at
the moment you have to create segments manually (as in this article
<https://www.cybertec-postgresql.com/en/transaction-id-wraparound-a-walk-on-the-wild-side/>
).

Patch also contains additional tests for pg_resetwal (regression is called
to make sure that all postgres components are working correctly, but maybe
it can be replaced with something more "compact").

What do you think about the idea of adding such functionality?

--
Best regards,
Daniil Davydov
From 2993b376674cc0b565abd42a7d85eae8c8856428 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Wed, 5 Mar 2025 14:07:06 +0700
Subject: [PATCH] Allow to set any value for -m and -x options

---
 src/bin/pg_resetwal/Makefile |   4 +
 src/bin/pg_resetwal/pg_resetwal.c| 145 +++
 src/bin/pg_resetwal/t/003_advance.pl | 135 +
 3 files changed, 284 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/003_advance.pl

diff --git a/src/bin/pg_resetwal/Makefile b/src/bin/pg_resetwal/Makefile
index 4228a5a772..c890b1c5c6 100644
--- a/src/bin/pg_resetwal/Makefile
+++ b/src/bin/pg_resetwal/Makefile
@@ -17,6 +17,10 @@ include $(top_builddir)/src/Makefile.global
 
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
 
+# required for 03_advance.pl
+REGRESS_SHLIB=$(top_builddir)/src/test/regress/regress$(DLSUFFIX)
+export REGRESS_SHLIB
+
 OBJS = \
 	$(WIN32RES) \
 	pg_resetwal.o
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..50f6b9ca2f 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -76,6 +76,9 @@ static XLogSegNo minXlogSegNo = 0;
 static int	WalSegSz;
 static int	set_wal_segsize;
 
+static void AdvanceNextXid(FullTransactionId oldval, FullTransactionId newval);
+static void AdvanceNextMultiXid(MultiXactId oldval, MultiXactId newval);
+
 static void CheckDataVersion(void);
 static bool read_controlfile(void);
 static void GuessControlValues(void);
@@ -90,6 +93,29 @@ static void WriteEmptyXLOG(void);
 static void usage(void);
 
 
+typedef struct CommitTimestampEntry
+{
+	TimestampTz time;
+	RepOriginId nodeid;
+} CommitTimestampEntry;
+
+// typedef uint64 MultiXactOffset;
+
+/*
+ * Note: these macros are copied from clog.c, commit_ts.c and subtrans.c and
+ * should be kept in sync.
+ */
+#define CLOG_XACTS_PER_BYTE 4
+#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId))
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
+	sizeof(RepOriginId))
+#define COMMIT_TS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+#define SLRU_PAGES_PER_SEGMENT	32
+
 int
 main(int argc, char *argv[])
 {
@@ -113,6 +139,7 @@ main(int argc, char *argv[])
 	bool		force = false;
 	bool		noupdate = false;
 	MultiXactId set_oldestmxid = 0;
+	FullTransactionId current_fxid = {0};
 	char	   *endptr;
 	char	   *endptr2;
 	char	   *DataDir = NULL;
@@ -406,6 +433,11 @@ main(int argc, char *argv[])
 	if ((guessed && !force) || noupdate)
 		PrintControlValues(guessed);
 
+	/*
+	 * Remember full id of next free transaction
+	 */
+	current_fxid = ControlFile.checkPointCopy.nextXid;
+
 	/*
 	 * Adjust fields if required by switches.  (Do this now so that printout,
 	 * if any, includes these values.)
@@ -422,10 +454,18 @@ main(int argc, char *argv[])
 	}
 
 	if (set_xid != 0)
+	{
 		ControlFile.checkPointCopy.nextXid =
 			FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(ControlFile.checkPointCopy.nextXid),
 			 set_xid);
 
+		if (FullTransactionIdPrecedes(current_fxid, ControlFile.checkPointCopy.nextXid) &&
+			!noupdate)
+		{
+			AdvanceNextXid(current_fxid, ControlFile.checkPointCopy.nextXid);
+		}
+	}
+
 	if (set_oldest_commit_ts_xid != 0)
 		ControlFile.checkPointCopy.oldestCommitTsXid = set_oldest_commit_ts_xid;
 	if (set_newest_commit_ts_xid != 0)
@@ -436,12 +476,19 @@ main(int argc, char *argv[])
 
 	if (set_mxid != 0)
 	{
+		MultiXactId current_mxid = ControlFile.checkPointCopy.nextMulti;
 		ControlFile.checkPointCopy.nextMulti = set_mxid;
 
 		ControlFile.checkPointCopy.oldestMulti = set_oldestmxid;
 		if (ControlFile.checkPointCopy.oldestMulti < FirstMultiXactId)
 			ControlFile.checkPointCopy.oldestMulti += FirstMultiXactId;
 		ControlFile.checkPointCopy.oldestMultiDB = InvalidOid;
+
+		/*
+		 * If current_mxid precedes set_mxid
+		 */
+		if (((int32) (current_mxid - set_mxid) < 0) && !noupd

Re: Add arbitrary xid and mxid to pg_resetwal

2025-03-10 Thread Daniil Davydov

Hi,

On Fri, Mar 7, 2025 at 9:35 PM Aleksander Alekseev
 wrote:
> Your patch should target the `master` branch. Also please add a
> corresponding entry to the nearest open commitfest [1].

OK, thanks for the notice! I attach the v2 patch for the `master`
branch to this letter. Now you can also find it in commitfest in
System Administration topic.

On Fri, Mar 7, 2025 at 9:35 PM Aleksander Alekseev
 wrote:
> > In my opinion, this will be useful primarily to simplify testing, since at 
> > the moment you have to create segments manually (as in this article).
>
> In this case you should add a test or two that demonstrate this. As
> separate commits perhaps.

Well, I just saw that people have a request for such functionality.
More specifically, I think it's worth taking a look at the
src/test/modules/xid_wraparound.
There, the transaction counter is just jumping forward to check the
autovacuum and postgres wraparound limits. These tests are using the
`xid_wraparound` extension, but it can be replaced with
new pg_resetwal feature.
Measurements on my machine showed that the test execution time (with
advancing xid up to 10 billions with 100 millions values by step)
takes 40% less time to complete.

P.S.
v2 patch looks a bit scary, because it contains a lot of raw I/O
operations. At the moment, it is the best I came up with, but the
logic of the code (I hope) is correct.

--
Best regards,
Daniil Davydov
From bcca199a1787399c58c959a4357c2daefb8335f5 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Tue, 11 Mar 2025 10:34:12 +0700
Subject: [PATCH v2] Arbitrary xid and mxid for resetwal

---
 src/bin/pg_resetwal/Makefile |   4 +
 src/bin/pg_resetwal/pg_resetwal.c| 338 ++-
 src/bin/pg_resetwal/t/003_advance.pl | 137 +++
 3 files changed, 476 insertions(+), 3 deletions(-)
 create mode 100644 src/bin/pg_resetwal/t/003_advance.pl

diff --git a/src/bin/pg_resetwal/Makefile b/src/bin/pg_resetwal/Makefile
index 82bea06dee5..1bd76180864 100644
--- a/src/bin/pg_resetwal/Makefile
+++ b/src/bin/pg_resetwal/Makefile
@@ -17,6 +17,10 @@ include $(top_builddir)/src/Makefile.global
 
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils
 
+# required for 03_advance.pl
+REGRESS_SHLIB=$(top_builddir)/src/test/regress/regress$(DLSUFFIX)
+export REGRESS_SHLIB
+
 OBJS = \
 	$(WIN32RES) \
 	pg_resetwal.o
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 31bc0abff16..18649127446 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -65,7 +65,7 @@ static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 static uint32 set_xid_epoch = (uint32) -1;
 static TransactionId set_oldest_xid = 0;
-static TransactionId set_xid = 0;
+static uint64 set_xid = 0;
 static TransactionId set_oldest_commit_ts_xid = 0;
 static TransactionId set_newest_commit_ts_xid = 0;
 static Oid	set_oid = 0;
@@ -89,7 +89,41 @@ static void KillExistingArchiveStatus(void);
 static void KillExistingWALSummaries(void);
 static void WriteEmptyXLOG(void);
 static void usage(void);
+static void AdvanceNextXid(TransactionId oldval, TransactionId newval);
+static void AdvanceNextMultiXid(MultiXactId oldval, MultiXactId newval);
 
+/*
+ * Note: this structure is copied from commit_ts.c and should be kept in sync.
+ */
+typedef struct CommitTimestampEntry
+{
+	TimestampTz time;
+	RepOriginId nodeid;
+} CommitTimestampEntry;
+
+/*
+ * Note: these macros are copied from clog.c, commit_ts.c and subtrans.c and
+ * should be kept in sync.
+ */
+#define CLOG_BITS_PER_XACT	2
+#define CLOG_XACTS_PER_BYTE 4
+#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
+
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
+#define TransactionIdToByte(xid)	(TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
+#define TransactionIdToBIndex(xid)	((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)
+
+#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId))
+
+#define SizeOfCommitTimestampEntry (offsetof(CommitTimestampEntry, nodeid) + \
+	sizeof(RepOriginId))
+
+#define COMMIT_TS_XACTS_PER_PAGE \
+	(BLCKSZ / SizeOfCommitTimestampEntry)
+
+#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
+
+#define SLRU_PAGES_PER_SEGMENT	32
 
 int
 main(int argc, char *argv[])
@@ -441,9 +475,47 @@ main(int argc, char *argv[])
 	}
 
 	if (set_xid != 0)
+	{
+		FullTransactionId current_fxid = ControlFile.checkPointCopy.nextXid;
+		FullTransactionId full_datfrozenxid;
+		uint32			  current_epoch;
+
+		full_datfrozenxid =
+			FullTransactionIdFromEpochAndXid(EpochFromFullTransactionId(current_fxid),
+			 ControlFile.checkPointCopy.oldestXid);
+
+		if (set_xid > full_datfrozenxid.value &&
+			(set_xid - full_datfrozenxid.value) > INT32_MAX)
+		{
+			/*
+			 * Cannot advance transaction ID in this case, because all unfrozen
+			 *

Re: Accessing an invalid pointer in BufferManagerRelation structure

2025-03-11 Thread Daniil Davydov

Hi,
I am posting the new v2 patch, which is now rebased on the `master` branch.
Waiting for your feedback :)

--
Best regards,
Daniil Davydov
From 2e43a1411ebcb37b2c0c1d6bac758e48799d2c4e Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Tue, 11 Mar 2025 17:11:16 +0700
Subject: [PATCH v2] Add macros for safety access to smgr

---
 src/backend/storage/buffer/bufmgr.c   | 55 ---
 src/backend/storage/buffer/localbuf.c | 10 +++--
 src/include/storage/bufmgr.h  | 14 +++
 3 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..c55228f298a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -887,11 +887,8 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_by > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == 0)
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	return ExtendBufferedRelCommon(bmr, fork, strategy, flags,
    extend_by, InvalidBlockNumber,
@@ -923,11 +920,8 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
+	if (bmr.relpersistence == 0)
 		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 
 	/*
 	 * If desired, create the file if it doesn't exist.  If
@@ -935,15 +929,15 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * an smgrexists call.
 	 */
 	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
-		(bmr.smgr->smgr_cached_nblocks[fork] == 0 ||
-		 bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
-		!smgrexists(bmr.smgr, fork))
+		(BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == 0 ||
+		 BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(BMR_GET_SMGR(bmr), fork))
 	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
 
 		/* recheck, fork might have been created concurrently */
-		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+		if (!smgrexists(BMR_GET_SMGR(bmr), fork))
+			smgrcreate(BMR_GET_SMGR(bmr), fork, flags & EB_PERFORMING_RECOVERY);
 
 		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
 	}
@@ -953,13 +947,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+		BMR_GET_SMGR(bmr)->smgr_cached_nblocks[fork] = InvalidBlockNumber;
 
 	/*
 	 * Estimate how many pages we'll need to extend by. This avoids acquiring
 	 * unnecessarily many victim buffers.
 	 */
-	current_size = smgrnblocks(bmr.smgr, fork);
+	current_size = smgrnblocks(BMR_GET_SMGR(bmr), fork);
 
 	/*
 	 * Since no-one else can be looking at the page contents yet, there is no
@@ -1003,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	if (buffer == InvalidBuffer)
 	{
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.rel, bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr.rel, BMR_GET_SMGR(bmr), bmr.relpersistence,
    fork, extend_to - 1, mode, strategy);
 	}
 
@@ -2153,10 +2147,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr,
 	BlockNumber first_block;
 
 	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
-		 bmr.smgr->smgr_rlocator.locator.spcOid,
-		 bmr.smgr->smgr_rlocator.locator.dbOid,
-		 bmr.smgr->smgr_rlocator.locator.relNumber,
-		 bmr.smgr->smgr_rlocator.backend,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.dbOid,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.locator.relNumber,
+		 BMR_GET_SMGR(bmr)->smgr_rlocator.backend,
 		 extend_by);
 
 	if (bmr.relpersistence == RELPERSISTENCE_TEMP)
@@ -2170,10 +2164,10 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr,
 	*extended_by = extend_by;
 
 	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
-		bmr.smgr->smgr_rlocator.locator.spcOid,
-		bmr.smgr->smgr_rlocator.locator.dbOid,
-		bmr.smgr->smgr_rlocator.locator.relNumber,
-		bmr.smgr->smgr_rlocator.backend,
+		BMR_GET_SMGR(bmr)->smgr_rlocator.locator.spcOid,
+		BMR_GET_SMGR(bmr)->smgr_rlocator.locator.dbOid,
+		BMR_GET_SMGR(bmr)->smgr_rlocator.locator.relNumber,
+		BMR_GET_SMGR(bmr)->smgr_rlocator.backend,
 		*extended_by,
 		first_block);
 
@@ -2239,9 +2233,9 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNum

Re: POC: Parallel processing of indexes in autovacuum

2025-05-09 Thread Daniil Davydov

Hi,
As I promised - meet parallel index autovacuum with bgworkers
(Parallel-index-autovacuum-with-bgworkers.patch). This is pretty
simple implementation :
1) Added new table option `parallel_idx_autovac_enabled` that must be
set to `true` if user wants autovacuum to process table in parallel.
2) Added new GUC variable `autovacuum_reserved_workers_num`. This is
number of parallel workers from bgworkers pool that can be used only
by autovacuum workers. The `autovacuum_reserved_workers_num` parameter
actually reserves a requested part of the processes, the total number
of which is equal to `max_worker_processes`.
3) When an autovacuum worker decides to process some table in
parallel, it just sets `VacuumParams->nworkers` to appropriate value
(> 0) and then the code is executed as if it were a regular VACUUM
PARALLEL.
4) I kept test/modules/autovacuum as sandbox where you can play with
parallel index autovacuum a bit.

What do you think about this implementation?

P.S.
I also improved "self-managed" parallel autovacuum implementation
(Self-managed-parallel-index-autovacuum.patch). For now it needs a lot
of refactoring, but all features are working good.
Both patches are targeting on master branch
(bc35adee8d7ad38e7bef40052f196be55decddec)

--
Best regards,
Daniil Davydov
From cfb7e675d9a1b05aef0cdaeeca5f6edd4bcd3b70 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Sat, 10 May 2025 01:07:42 +0700
Subject: [PATCH v1] Parallel index autovacuum with bgworkers

---
 src/backend/access/common/reloptions.c|  11 ++
 src/backend/commands/vacuum.c |  55 
 src/backend/commands/vacuumparallel.c |  46 ---
 src/backend/postmaster/autovacuum.c   |   9 ++
 src/backend/postmaster/bgworker.c |  33 -
 src/backend/utils/init/globals.c  |   1 +
 src/backend/utils/misc/guc_tables.c   |  13 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/miscadmin.h   |   1 +
 src/include/utils/guc_hooks.h |   2 +
 src/include/utils/rel.h   |  10 ++
 src/test/modules/autovacuum/.gitignore|   1 +
 src/test/modules/autovacuum/Makefile  |  14 ++
 .../autovacuum/t/001_autovac_parallel.pl  | 129 ++
 14 files changed, 307 insertions(+), 19 deletions(-)
 create mode 100644 src/test/modules/autovacuum/.gitignore
 create mode 100644 src/test/modules/autovacuum/Makefile
 create mode 100644 src/test/modules/autovacuum/t/001_autovac_parallel.pl

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 46c1dce222d..ccf59208783 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -166,6 +166,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"parallel_idx_autovac_enabled",
+			"Allows autovacuum to process indexes of this table in parallel mode",
+			RELOPT_KIND_HEAP,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1863,6 +1872,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
+		{"parallel_idx_autovac_enabled", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, parallel_idx_autovac_enabled)},
 		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)},
 		{"autovacuum_vacuum_max_threshold", RELOPT_TYPE_INT,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 33a33bf6b1c..f7667f14147 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -57,9 +57,21 @@
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/*
+ * Minimum number of dead tuples required for the table's indexes to be
+ * processed in parallel during autovacuum.
+ */
+#define AV_PARALLEL_DEADTUP_THRESHOLD	1024
+
+/*
+ * How many indexes should process each parallel worker during autovacuum.
+ */
+#define NUM_INDEXES_PER_PARALLEL_WORKER 30
+
 /*
  * Minimum interval for cost-based vacuum delay reports from a parallel worker.
  * This aims to avoid sending too many messages and waking up the leader too
@@ -2234,6 +2246,49 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	else
 		toast_relid = InvalidOid;
 
+	/*
+	 * If we are running autovacuum - decide whether we need to process indexes
+	 * of table with given oid in parallel.
+	 */
+	if (AmAutoVacuumWorkerProcess() &&
+		params->index_clea

Re: [BUG] Skipped initialization of some xl_xact_parsed_prepare fields

2025-05-20 Thread Daniil Davydov

Hi,

On Wed, May 21, 2025 at 9:59 AM Fujii Masao  wrote:
> I've pushed the patch. Thanks!
>

Glad to hear it, thank you!

--
Best regards,
Daniil Davydov

Re: POC: Parallel processing of indexes in autovacuum

2025-05-25 Thread Daniil Davydov

Hi,

On Fri, May 23, 2025 at 6:12 AM Masahiko Sawada  wrote:
>
> On Thu, May 22, 2025 at 12:44 AM Daniil Davydov <3daniss...@gmail.com> wrote:
> >
> > On Wed, May 21, 2025 at 5:30 AM Masahiko Sawada  
> > wrote:
> > >
> > > I find that the name "autovacuum_reserved_workers_num" is generic. It
> > > would be better to have a more specific name for parallel vacuum such
> > > as autovacuum_max_parallel_workers. This parameter is related to
> > > neither autovacuum_worker_slots nor autovacuum_max_workers, which
> > > seems fine to me. Also, max_parallel_maintenance_workers doesn't
> > > affect this parameter.
> >
> > This was my headache when I created names for variables. Autovacuum
> > initially implies parallelism, because we have several parallel a/v
> > workers.
>
> I'm not sure if it's parallelism. We can have multiple autovacuum
> workers simultaneously working on different tables, which seems not
> parallelism to me.

Hm, I didn't thought about the 'parallelism' definition in this way.
But I see your point - the next v4 patch will contain the naming that
you suggest.

>
> > So I think that parameter like
> > `autovacuum_max_parallel_workers` will confuse somebody.
> > If we want to have a more specific name, I would prefer
> > `max_parallel_index_autovacuum_workers`.
>
> It's better not to use 'index' as we're trying to extend parallel
> vacuum to heap scanning/vacuuming as well[1].

OK, I'll fix it.

> > > +   /*
> > > +* If we are running autovacuum - decide whether we need to process 
> > > indexes
> > > +* of table with given oid in parallel.
> > > +*/
> > > +   if (AmAutoVacuumWorkerProcess() &&
> > > +   params->index_cleanup != VACOPTVALUE_DISABLED &&
> > > +   RelationAllowsParallelIdxAutovac(rel))
> > >
> > > I think that this should be done in autovacuum code.
> >
> > We need params->index cleanup variable to decide whether we need to
> > use parallel index a/v. In autovacuum.c we have this code :
> > ***
> > /*
> >  * index_cleanup and truncate are unspecified at first in autovacuum.
> >  * They will be filled in with usable values using their reloptions
> >  * (or reloption defaults) later.
> >  */
> > tab->at_params.index_cleanup = VACOPTVALUE_UNSPECIFIED;
> > tab->at_params.truncate = VACOPTVALUE_UNSPECIFIED;
> > ***
> > This variable is filled in inside the `vacuum_rel` function, so I
> > think we should keep the above logic in vacuum.c.
>
> I guess that we can specify the parallel degree even if index_cleanup
> is still UNSPECIFIED. vacuum_rel() would then decide whether to use
> index vacuuming and vacuumlazy.c would decide whether to use parallel
> vacuum based on the specified parallel degree and index_cleanup value.
>
> >
> > > +#define AV_PARALLEL_DEADTUP_THRESHOLD  1024
> > >
> > > These fixed values really useful in common cases? I think we already
> > > have an optimization where we skip vacuum indexes if the table has
> > > fewer dead tuples (see BYPASS_THRESHOLD_PAGES).
> >
> > When we allocate dead items (and optionally init parallel autocuum) we
> > don't have sane value for `vacrel->lpdead_item_pages` (which should be
> > compared with BYPASS_THRESHOLD_PAGES).
> > The only criterion that we can focus on is the number of dead tuples
> > indicated in the PgStat_StatTabEntry.
>
> My point is that this criterion might not be useful. We have the
> bypass optimization for index vacuuming and having many dead tuples
> doesn't necessarily mean index vacuuming taking a long time. For
> example, even if the table has a few dead tuples, index vacuuming
> could take a very long time and parallel index vacuuming would help
> the situation, if the table is very large and has many indexes.

That sounds reasonable. I'll fix it.

> > But autovacuum (as I think) should work as stable as possible and
> > `unnoticed` by other processes. Thus, we must :
> > 1) Compute resources (such as the number of parallel workers for a
> > single table's indexes vacuuming) as efficiently as possible.
> > 2) Provide a guarantee that as many tables as possible (among
> > requested) will be processed in parallel.
> >
> > (1) can be achieved by calculating the parameters on the fly.
> > NUM_INDEXES_PER_PARALLEL_WORKER is a rough mock. I can provide more
> > accurate value in the near future.
>
> I think it requires more things than the number of indexes

Re: POC: Parallel processing of indexes in autovacuum

2025-05-15 Thread Daniil Davydov

Hi,

On Fri, May 16, 2025 at 4:06 AM Matheus Alcantara
 wrote:
> I've reviewed the v1-0001 patch, the build on MacOS using meson+ninja is
> failing:
> ❯❯❯ ninja -C build install
> ninja: Entering directory `build'
> [1/126] Compiling C object
> src/backend/postgres_lib.a.p/utils_misc_guc_tables.c.o
> FAILED: src/backend/postgres_lib.a.p/utils_misc_guc_tables.c.o
> ../src/backend/utils/misc/guc_tables.c:3613:4: error: incompatible
> pointer to integer conversion initializing 'int' with an expression of
> type 'void *' [-Wint-conversion]
>  3613 | NULL,
>   | ^~~~
>

Thank you for reviewing this patch!

> It seems that the "autovacuum_reserved_workers_num" declaration on
> guc_tables.c has an extra gettext_noop() call?

Good catch, I fixed this warning in the v2 version.

>
> One other point is that as you've added TAP tests for the autovacuum I
> think you also need to create a meson.build file as you already create
> the Makefile.
>
> You also need to update the src/test/modules/meson.build and
> src/test/modules/Makefile to include the new test/modules/autovacuum
> path.
>

OK, I should clarify this moment : modules/autovacuum is not a normal
test but a sandbox - just an example of how we can trigger parallel
index autovacuum. Also it may be used for debugging purposes.
In fact, 001_autovac_parallel.pl is not verifying anything.
I'll do as you asked (add all meson and Make stuff), but please don't
focus on it. The creation of the real test is still in progress. (I'll
try to complete it as soon as possible).

In this letter I will divide the patch into 2 parts : implementation
and sandbox. What do you think about implementation?


--
Best regards,
Daniil Davydov
From 5a25535f5f4212ca756b9c67bcecf3a271ceb215 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Fri, 16 May 2025 11:59:03 +0700
Subject: [PATCH v2 2/2] Sandbox for parallel index autovacuum

---
 src/test/modules/Makefile |   1 +
 src/test/modules/autovacuum/.gitignore|   1 +
 src/test/modules/autovacuum/Makefile  |  14 ++
 src/test/modules/autovacuum/meson.build   |  12 ++
 .../autovacuum/t/001_autovac_parallel.pl  | 129 ++
 src/test/modules/meson.build  |   1 +
 6 files changed, 158 insertions(+)
 create mode 100644 src/test/modules/autovacuum/.gitignore
 create mode 100644 src/test/modules/autovacuum/Makefile
 create mode 100644 src/test/modules/autovacuum/meson.build
 create mode 100644 src/test/modules/autovacuum/t/001_autovac_parallel.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index aa1d27bbed3..b7f3e342e82 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -5,6 +5,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
+		  autovacuum \
 		  brin \
 		  commit_ts \
 		  delay_execution \
diff --git a/src/test/modules/autovacuum/.gitignore b/src/test/modules/autovacuum/.gitignore
new file mode 100644
index 000..0b54641bceb
--- /dev/null
+++ b/src/test/modules/autovacuum/.gitignore
@@ -0,0 +1 @@
+/tmp_check/
\ No newline at end of file
diff --git a/src/test/modules/autovacuum/Makefile b/src/test/modules/autovacuum/Makefile
new file mode 100644
index 000..90c00ff350b
--- /dev/null
+++ b/src/test/modules/autovacuum/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/autovacuum/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/autovacuum
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
\ No newline at end of file
diff --git a/src/test/modules/autovacuum/meson.build b/src/test/modules/autovacuum/meson.build
new file mode 100644
index 000..f91c1a14d2b
--- /dev/null
+++ b/src/test/modules/autovacuum/meson.build
@@ -0,0 +1,12 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+tests += {
+  'name': 'autovacuum',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'tests': [
+  't/001_autovac_parallel.pl',
+],
+  },
+}
diff --git a/src/test/modules/autovacuum/t/001_autovac_parallel.pl b/src/test/modules/autovacuum/t/001_autovac_parallel.pl
new file mode 100644
index 000..a44cbebe0fd
--- /dev/null
+++ b/src/test/modules/autovacuum/t/001_autovac_parallel.pl
@@ -0,0 +1,129 @@
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $psql_out;
+
+my $node = PostgreSQL::Test::Cluster->new('node1');
+$node->init;
+$node->append_conf('postgresql.conf', qq{
+	autovacuum = off
+	max

[BUG] Skipped initialization of some xl_xact_parsed_prepare fields

2025-05-14 Thread Daniil Davydov

Hi,
I noticed that inside ParsePrepareRecord function we are missing
initialization of `nstats` and `nabortstats` fields of
xl_xact_parsed_prepare structure:
*** (current code in master)
memset(parsed, 0, sizeof(*parsed));

parsed->xact_time = xlrec->prepared_at;
parsed->origin_lsn = xlrec->origin_lsn;
parsed->origin_timestamp = xlrec->origin_timestamp;
parsed->twophase_xid = xlrec->xid;
parsed->dbId = xlrec->database;
parsed->nsubxacts = xlrec->nsubxacts;
parsed->nrels = xlrec->ncommitrels;
parsed->nabortrels = xlrec->nabortrels;
parsed->nmsgs = xlrec->ninvalmsgs;
***

Thus, they will always be 0 after calling the ParsePrepareRecord
function, but `stats` and `abortstats` pointers are set correctly:
*** (current code in master)
parsed->stats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitstats * sizeof(xl_xact_stats_item));

parsed->abortstats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(xlrec->nabortstats * sizeof(xl_xact_stats_item));
***

If we look (for example) on parsing of a commit record, we could see
that both `nstats` and `stats` fields are set inside the
ParseCommitRecord function.
For now, zeroed number of stats lead to invalid output of
`xact_desc_prepare`, because it relies on values of `nstats` and
`nabortstats` fields:
*** (current code in master)
xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
***

I suppose to add initialization of `nstats` and `nabortstats` to
ParsePrepareRecord (see attached patch).

--
Best regards,
Daniil Davydov
From 6338b32aeb170dfd23ae6d313ab2fb77058e0645 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Thu, 15 May 2025 12:19:03 +0700
Subject: [PATCH v1] Add initialization of nstats and nabortstats fields inside
 ParsePrepareRecord

---
 src/backend/access/rmgrdesc/xactdesc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 715cc1f7bad..305598e2865 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -252,6 +252,8 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
 	parsed->nsubxacts = xlrec->nsubxacts;
 	parsed->nrels = xlrec->ncommitrels;
 	parsed->nabortrels = xlrec->nabortrels;
+	parsed->nstats = xlrec->ncommitstats;
+	parsed->nabortstats = xlrec->nabortstats;
 	parsed->nmsgs = xlrec->ninvalmsgs;
 
 	strncpy(parsed->twophase_gid, bufptr, xlrec->gidlen);
-- 
2.43.0

Re: POC: Parallel processing of indexes in autovacuum

2025-05-22 Thread Daniil Davydov

r so
> that the user can specify the number of parallel vacuum workers for
> the table. For example, we can have a reloption
> autovacuum_parallel_workers. Setting 0 (by default) means to disable
> parallel vacuum during autovacuum, and setting special value -1 means
> to let PostgreSQL calculate the parallel degree for the table (same as
> the default VACUUM command behavior).
> ...
> The patch includes the changes to bgworker.c so that we can reserve
> some slots for autovacuums. I guess that this change is not
> necessarily necessary because if the user sets the related GUC
> parameters correctly the autovacuum workers can use parallel vacuum as
> expected.  Even if we need this change, I would suggest implementing
> it as a separate patch.
> ..
> +#define AV_PARALLEL_DEADTUP_THRESHOLD  1024
> +#define NUM_INDEXES_PER_PARALLEL_WORKER 30
>
> These fixed values really useful in common cases? Given that we rely on
> users' heuristics which table needs to use parallel vacuum during
> autovacuum, I think we don't need to apply these conditions.
> ..

I grouped these comments together, because they all relate to a single
question : how much freedom will we give to the user?
Your opinion (as far as I understand) is that we allow users to
specify any number of parallel workers for tables, and it is the
user's responsibility to configure appropriate GUC variables, so that
autovacuum can always process indexes in parallel.
And we don't need to think about thresholds. Even if the table has a
small number of indexes and dead rows - if the user specified table
option, we must do a parallel index a/v with requested number of
parallel workers.
Please correct me if I messed something up.

I think that this logic is well suited for the `VACUUM (PARALLEL)` sql
command, which is manually called by the user.

But autovacuum (as I think) should work as stable as possible and
`unnoticed` by other processes. Thus, we must :
1) Compute resources (such as the number of parallel workers for a
single table's indexes vacuuming) as efficiently as possible.
2) Provide a guarantee that as many tables as possible (among
requested) will be processed in parallel.

(1) can be achieved by calculating the parameters on the fly.
NUM_INDEXES_PER_PARALLEL_WORKER is a rough mock. I can provide more
accurate value in the near future.
(2) can be achieved by workers reserving - we know that N workers
(from bgworkers pool) are *always* at our disposal. And when we use
such workers we are not dependent on other operations in the cluster
and we don't interfere with other operations by taking resources away
from them.

If we give the user too much freedom in parallel index a/v tuning, all
these requirements may be violated.

This is only my opinion, and I can agree with yours. Maybe we need
another person to judge us?
Please see v3 patches that contain changes related to GUC parameter
and table option (no changes in global logic by now).

--
Best regards,
Daniil Davydov
From 2223da7a9b2ef8c8d71780ad72b24eaf6d6c1141 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Fri, 16 May 2025 11:58:40 +0700
Subject: [PATCH v3 1/2] Parallel index autovacuum with bgworkers

---
 src/backend/access/common/reloptions.c| 11 
 src/backend/commands/vacuum.c | 55 +++
 src/backend/commands/vacuumparallel.c | 46 ++--
 src/backend/postmaster/autovacuum.c   | 14 -
 src/backend/postmaster/bgworker.c | 33 ++-
 src/backend/utils/init/globals.c  |  1 +
 src/backend/utils/misc/guc_tables.c   | 12 
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/miscadmin.h   |  1 +
 src/include/utils/guc_hooks.h |  2 +
 src/include/utils/rel.h   | 10 
 11 files changed, 166 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 46c1dce222d..730096002b1 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -166,6 +166,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"parallel_index_autovacuum_enabled",
+			"Allows autovacuum to process indexes of this table in parallel mode",
+			RELOPT_KIND_HEAP,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1863,6 +1872,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
+		{"parallel_index_autovacuum_enabled", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, p

Speedup truncations of temporary relation forks

2025-05-30 Thread Daniil Davydov

Hi,
For now we fully scan local buffers for each fork of the temporary
relation that we want to truncate (in order to drop its buffers). It
happens in the function "DropRelationBuffers".
There used to be the same problem for regular tables (i.e. shared
buffers) and it was fixed in commit [1] and now shared buffers are
scanned only one time for those three relation forks.
I suggest making the same fix for temporary relations. See attached patch.

[1] 6d05086c0a79e50d8e91ed953626ec7280cd2481

BTW, I see that we call "DropRelationBuffers" separately for relation,
toast table and indexes. What if we collect all this information in
advance and iterate over the local/shared buffers only once?
I understand that it will look kinda ugly, but it will increase
performance for sure.

--
Best regards,
Daniil Davydov
From 7b3c3e816b502a94d1ccc6889d86e07d2e1555fd Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Fri, 30 May 2025 17:50:47 +0700
Subject: [PATCH v1] Speedup temp table truncation

---
 src/backend/storage/buffer/bufmgr.c   |  8 +++-
 src/backend/storage/buffer/localbuf.c | 20 +---
 src/include/storage/buf_internals.h   |  4 ++--
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f93131a645e..bd4c549eb14 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4550,11 +4550,9 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	if (RelFileLocatorBackendIsTemp(rlocator))
 	{
 		if (rlocator.backend == MyProcNumber)
-		{
-			for (j = 0; j < nforks; j++)
-DropRelationLocalBuffers(rlocator.locator, forkNum[j],
-		 firstDelBlock[j]);
-		}
+			DropRelationLocalBuffers(rlocator.locator, forkNum, nforks,
+	 firstDelBlock);
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 63101d56a07..7f51288546f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -660,24 +660,30 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
  *		See DropRelationBuffers in bufmgr.c for more notes.
  */
 void
-DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
-		 BlockNumber firstDelBlock)
+DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber *forkNum,
+		 int nforks, BlockNumber *firstDelBlock)
 {
 	int			i;
+	int			j;
 
 	for (i = 0; i < NLocBuffer; i++)
 	{
 		BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
 		uint32		buf_state;
 
+		if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
+			continue;
+
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		if ((buf_state & BM_TAG_VALID) &&
-			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator) &&
-			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
-			bufHdr->tag.blockNum >= firstDelBlock)
+		for (j = 0; j < nforks; j++)
 		{
-			InvalidateLocalBuffer(bufHdr, true);
+			if ((buf_state & BM_TAG_VALID) &&
+BufTagGetForkNum(&bufHdr->tag) == forkNum[j] &&
+bufHdr->tag.blockNum >= firstDelBlock[j])
+			{
+InvalidateLocalBuffer(bufHdr, true);
+			}
 		}
 	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..52a71b138f7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -486,8 +486,8 @@ extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
-	 ForkNumber forkNum,
-	 BlockNumber firstDelBlock);
+	 ForkNumber *forkNum, int nforks,
+	 BlockNumber *firstDelBlock);
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
-- 
2.43.0

Re: Speedup truncations of temporary relation forks

2025-06-01 Thread Daniil Davydov

Hi,

On Sun, Jun 1, 2025 at 5:31 PM Dilip Kumar  wrote:
>
> On Sun, Jun 1, 2025 at 7:52 AM Michael Paquier  wrote:
> >
> > I doubt that it would be a good idea to apply a patch "just" because
> > it looks like a good idea.  It is important to prove that something is
> > a good idea first.
>
> I think it makes sense to do the optimization for temporary tables as
> well, I tried testing with the below test case[1] and I can see ~18%
> improvement with the patch.
>
> On head it is taking ~78 ms to truncate whereas with patch it is just
> taking 66ms.
>
> [1]
> set temp_buffers ='8GB';
> show temp_buffers;
> BEGIN;
> CREATE TEMPORARY TABLE test(a int, b varchar);
> INSERT INTO test select i, repeat('a', 100) from
> generate_series(1,100) as i;
> ANALYZE ;
> select relpages from pg_class where relname='test';
> TRUNCATE TABLE test;
> ROLLBACK;

Thank you very much for your help!
I had also done some performance measurements :
set temp_buffers ='1GB';
BEGIN;
CREATE TEMP TABLE test (id INT) ON COMMIT DELETE ROWS;
INSERT INTO test SELECT generate_series(1, 3000);
DELETE FROM test WHERE id % 1000 = 0; -- force postgres to create fsm
ANALYZE test;
COMMIT;

*postgres was running on ramdisk with disabled swapoff*

Thus, we are creating a 1 GB table, so that the local buffers are
completely full and contain only the pages of this table.
To measure the time, I hardcoded calls of GetCurrentTimestamp and
TimestampDifference.

I got ~7% improvement with the patch. Note, that table had only 2
forks - main and fsm (I haven't figured out how to force postgres to
create a visibility map for temp table within the transaction block).

--
Best regards,
Daniil Davydov

Re: Speedup truncations of temporary relation forks

2025-06-01 Thread Daniil Davydov

Hi,

On Mon, Jun 2, 2025 at 11:14 AM Dilip Kumar  wrote:
>
> > (I haven't figured out how to force postgres to
> > create a visibility map for temp table within the transaction block).
>
> I haven't tested this, but I think if you do bulk copy into a table
> which should mark pages all visible and after that if you delete some
> tuple from pages logically it should try to update the status to not
> all visible in vm?
>

I found reliable way to create all three forks :
begin;
create temp table test (id int) on commit delete rows;
copy test from 'data.csv' with (freeze);
insert into test select generate_series(200, 300);
delete from test where id % 50 = 0;
commit;

--
Best regards,
Daniil Davydov

Re: Speedup truncations of temporary relation forks

2025-05-30 Thread Daniil Davydov

Hi,

On Sat, May 31, 2025 at 7:49 AM Michael Paquier  wrote:
>
> On Fri, May 30, 2025 at 06:01:16PM +0700, Daniil Davydov wrote:
> > For now we fully scan local buffers for each fork of the temporary
> > relation that we want to truncate (in order to drop its buffers). It
> > happens in the function "DropRelationBuffers".
> > There used to be the same problem for regular tables (i.e. shared
> > buffers) and it was fixed in commit [1] and now shared buffers are
> > scanned only one time for those three relation forks.
> > I suggest making the same fix for temporary relations. See attached patch.
>
> Applying the same kind of optimization for local buffers makes sense
> here, even if the impact is more limited than regular relations.
>

Thanks for looking into it!

> > BTW, I see that we call "DropRelationBuffers" separately for relation,
> > toast table and indexes. What if we collect all this information in
> > advance and iterate over the local/shared buffers only once?
> > I understand that it will look kinda ugly, but it will increase
> > performance for sure.
>
> I guess it does.  Do you have numbers to share with a test case?
>

Not yet. I proceed from the assumption that if the temp_buffers
parameter is set to a large value (some users set it to more than a
gigabyte), then the vast majority of time is spent iterating through
the local buffers.
Thus, if we reduce the number of iterations from N to (for example)
N/10, we can get a 10x increase in performance. Of course, this is a
super rough assumption, but I think you understand my point.
In the near future I will prepare a patch for the idea above and try
to do some measurements. If there is a significant difference, I will
definitely let you know.

Anyway, first I suggest committing the current patch.

> Please make sure to add this patch to the next commit fest.

OK, already created.

--
Best regards,
Daniil Davydov

Re: Speedup truncations of temporary relation forks

2025-05-31 Thread Daniil Davydov

Hi,

On Sat, May 31, 2025 at 7:41 PM Fujii Masao  wrote:
>
> Here are a few review comments on the patch:
>
> +   for (j = 0; j < nforks; j++)
> {
> -   InvalidateLocalBuffer(bufHdr, true);
> +   if ((buf_state & BM_TAG_VALID) &&
> +   BufTagGetForkNum(&bufHdr->tag) == forkNum[j] 
> &&
> +   bufHdr->tag.blockNum >= firstDelBlock[j])
> +   {
> +   InvalidateLocalBuffer(bufHdr, true);
> +   }
>
> It looks like the "buf_state & BM_TAG_VALID" check can be moved
> outside the loop, along with the BufTagMatchesRelFileLocator() check.
> That would avoid unnecessary looping.
>
> Also, should we add a "break" right after calling InvalidateLocalBuffer()?
> Since the buffer has already been invalidated, continuing the loop
> may not be necessary.

Thanks for the review! I'll fix both remarks. Please see the v2 patch.

--
Best regards,
Daniil Davydov
From 203be6d23be59db2d3b9c1ace501ae94c61a4c27 Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Fri, 30 May 2025 17:50:47 +0700
Subject: [PATCH v2] Speedup temp table truncation

---
 src/backend/storage/buffer/bufmgr.c   |  8 +++-
 src/backend/storage/buffer/localbuf.c | 21 ++---
 src/include/storage/buf_internals.h   |  4 ++--
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f93131a645e..bd4c549eb14 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4550,11 +4550,9 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	if (RelFileLocatorBackendIsTemp(rlocator))
 	{
 		if (rlocator.backend == MyProcNumber)
-		{
-			for (j = 0; j < nforks; j++)
-DropRelationLocalBuffers(rlocator.locator, forkNum[j],
-		 firstDelBlock[j]);
-		}
+			DropRelationLocalBuffers(rlocator.locator, forkNum, nforks,
+	 firstDelBlock);
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 63101d56a07..275dc756f60 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -660,10 +660,11 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced)
  *		See DropRelationBuffers in bufmgr.c for more notes.
  */
 void
-DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
-		 BlockNumber firstDelBlock)
+DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber *forkNum,
+		 int nforks, BlockNumber *firstDelBlock)
 {
 	int			i;
+	int			j;
 
 	for (i = 0; i < NLocBuffer; i++)
 	{
@@ -672,12 +673,18 @@ DropRelationLocalBuffers(RelFileLocator rlocator, ForkNumber forkNum,
 
 		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		if ((buf_state & BM_TAG_VALID) &&
-			BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator) &&
-			BufTagGetForkNum(&bufHdr->tag) == forkNum &&
-			bufHdr->tag.blockNum >= firstDelBlock)
+		if (!(buf_state & BM_TAG_VALID) ||
+			!BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
+			continue;
+
+		for (j = 0; j < nforks; j++)
 		{
-			InvalidateLocalBuffer(bufHdr, true);
+			if (BufTagGetForkNum(&bufHdr->tag) == forkNum[j] &&
+bufHdr->tag.blockNum >= firstDelBlock[j])
+			{
+InvalidateLocalBuffer(bufHdr, true);
+break;
+			}
 		}
 	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0dec7d93b3b..52a71b138f7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -486,8 +486,8 @@ extern bool StartLocalBufferIO(BufferDesc *bufHdr, bool forInput, bool nowait);
 extern void FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln);
 extern void InvalidateLocalBuffer(BufferDesc *bufHdr, bool check_unreferenced);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
-	 ForkNumber forkNum,
-	 BlockNumber firstDelBlock);
+	 ForkNumber *forkNum, int nforks,
+	 BlockNumber *firstDelBlock);
 extern void DropRelationAllLocalBuffers(RelFileLocator rlocator);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
-- 
2.43.0

Re: [BUG] Cannot flush buffers of temp table deleted from other session

2025-06-11 Thread Daniil Davydov

>
> BTW, there are other bugs that can occur during interacting with other
> session temp tables. I described them and suggested corrections in
> this thread [1].
>
> [1] 
> https://www.postgresql.org/message-id/flat/CAJDiXgj72Axj0d4ojKdRWG_rnkfs4uWY414NL%3D15sCvh7-9rwg%40mail.gmail.com

Sorry, this is an outdated thread. I planned to continue to discuss
mentioned topic here :
https://www.postgresql.org/message-id/CAJDiXghdFcZ8=nh4g69te7irr3q0ufyxxb3zdg09_gtnzxw...@mail.gmail.com

--
Best regards,
Daniil Davydov

[BUG] Cannot flush buffers of temp table deleted from other session

2025-06-11 Thread Daniil Davydov

Hi,
Ability to do a DROP temporary table of other session leads to error :

session 1 :
create temp table test (id int);
-- let's assume that 'test' will be placed in schema 'pg_temp_0' and
has relfilepath 'base/5/t0_16390'
insert into test select generate_series(1, 1000); -- make few buffers dirty

session 2 :
drop table pg_temp_0.test; -- drop temp table of session 1

session 1:
create temp table test (id int); -- create another temp table
-- make dirty too many buffers (> temp_buffers), so postgres will try
to flush some of them
insert into test select generate_series(1, 25);
ERROR:  could not open file "base/5/t0_16390": No such file or directory

Thus, we were trying to flush buffers of an already deleted temp
table. Error occurs here : GetLocalVictimBuffer -> FlushLocalBuffer ->
smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber).
To be honest, I don't see a good way to handle this error, because
there may be a lot of reasons why flushing failed (up to the point
that the buffer tag may be corrupted).

I attach a patch (targeted on the master branch) that contains a
sample idea of error handling - don't throw an error if we can ensure
that the relation has been deleted.
Patch contains a very rough assumption, that relation's relfilenode ==
relation's oid.

What do you think about it?

BTW, there are other bugs that can occur during interacting with other
session temp tables. I described them and suggested corrections in
this thread [1].

[1] 
https://www.postgresql.org/message-id/flat/CAJDiXgj72Axj0d4ojKdRWG_rnkfs4uWY414NL%3D15sCvh7-9rwg%40mail.gmail.com

--
Best regards,
Daniil Davydov
From 7775fc7afbefceebf88d8be1cd94da939807035c Mon Sep 17 00:00:00 2001
From: Daniil Davidov 
Date: Thu, 12 Jun 2025 02:11:43 +0700
Subject: [PATCH] Handle error with flushing dirty buffer of deleted temp table

---
 src/backend/storage/buffer/localbuf.c | 90 +--
 1 file changed, 84 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index ba26627f7b0..8b9da95c095 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -26,6 +26,7 @@
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
+#include "utils/syscache.h"
 
 
 /*#define LBDEBUG*/
@@ -225,6 +226,7 @@ GetLocalVictimBuffer(void)
 	int			victim_bufid;
 	int			trycounter;
 	BufferDesc *bufHdr;
+	MemoryContext old_mcxt = CurrentMemoryContext;
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
@@ -281,12 +283,88 @@ GetLocalVictimBuffer(void)
 		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
 	}
 
-	/*
-	 * this buffer is not referenced but it might still be dirty. if that's
-	 * the case, write it out before reusing it!
-	 */
-	if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
-		FlushLocalBuffer(bufHdr, NULL);
+	PG_TRY();
+	{
+		/*
+		 * this buffer is not referenced but it might still be dirty. if that's
+		 * the case, write it out before reusing it!
+		 */
+		if (pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY)
+			FlushLocalBuffer(bufHdr, NULL);
+	}
+	PG_CATCH();
+	{
+		MemoryContext	error_mctx;
+		ErrorData		*error;
+		bool			can_handle_error = false;
+
+		/* Switch to the original context & copy edata */
+		error_mctx = MemoryContextSwitchTo(old_mcxt);
+		error = CopyErrorData();
+
+		/*
+		 * If we failed during file access, there can be two cases :
+		 * 1) Buffer belongs to relation deleted from other session.
+		 * 2) Any other error.
+		 *
+		 * In first case we still can continue without throwing error - just
+		 * don't try to flush this buffer.
+		 */
+		errcode_for_file_access();
+		if (error->sqlerrcode == ERRCODE_UNDEFINED_FILE)
+		{
+			HeapTuple	tuple;
+			Oid			estimated_oid;
+
+			/*
+			 * Buffer tag contains only relfilenode, that don't necessary
+			 * matches relation's oid. For lack of other information, just
+			 * assume that oid == relfilenumber.
+			 */
+			estimated_oid = BufTagGetRelFileLocator(&bufHdr->tag).relNumber;
+
+			/*
+			 * Try to find relcache entry for relation. If we fail - that means
+			 * that relation was deleted (e.g. first case).
+			 */
+			tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(estimated_oid));
+			if (!HeapTupleIsValid(tuple))
+			{
+uint32			buf_state;
+RelPathStr		path;
+SMgrRelation	reln;
+
+can_handle_error = true;
+
+reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
+MyProcNumber);
+
+path = relpath(reln->smgr_rlocator,
+			   BufTagGetForkNum(&bufHdr->tag));
+
+ereport(WARNING,
+		(errmsg("encountered dirty local buffer that belongs to"
+"to deleted relation \"%s\"", path.str),
+		 errhint("if this relation had not

Re: [BUG] Cannot flush buffers of temp table deleted from other session

2025-06-11 Thread Daniil Davydov

Hi,

On Thu, Jun 12, 2025 at 2:40 AM Tom Lane  wrote:
>
> Daniil Davydov <3daniss...@gmail.com> writes:
> > Ability to do a DROP temporary table of other session leads to error :
>
> [ shrug... ]  Don't do that.  A superuser is capable of doing lots
> of things that will break the database, and this one hardly seems
> like one of the worse ones.

OK, I see your point, thanks.
IMHO, deleting other temporary tables and (for example) deleting rows
from pg_class are not so similar in terms of the obvious occurrence of
bad consequences.
But as far as I understand, we will not specify such details somewhere
in the documentation.

>
> > BTW, there are other bugs that can occur during interacting with other
> > session temp tables. I described them and suggested corrections in
> > this thread [1].
>
> I doubt I'd agree that any such thing is a bug we need to fix.
> You should not be attempting to touch other sessions' temp tables.
> (If there's any path by which a non-superuser can do that, we
> probably need to tighten some permission check somewhere.)

Let's forget about DROP for a while. Postgres design has rules, that
superuser cannot perform commands like
select/update/delete/truncate... with other session's temp tables.
You can find appropriate error messages (for example) in bufmgr.c :
***
if (RELATION_IS_OTHER_TEMP(reln))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary tables of other sessions")));
***

But by now the query parser does not recognize other session's temp
tables as temporary, and we can get weird query results instead of
getting the error provided by the postgres design (that even worse
than unexpected error occurrence).
I am sure that this is a bug and we should fix such behavior.

--
Best regards,
Danill Davydov

55 matches

Mail list logo