Re: psql: fix variable existence tab completion

2024-05-07 Thread Anton A. Melnikov

Hi, Alexander!

On 06.05.2024 13:19, Alexander Korotkov wrote:

The patch attached fix the 010_tab_completion.pl test in the same way like [1].


Thank you for the fix.  As I get, the fix teaches
010_tab_completion.pl to tolerate the invalid result of tab
completion.  Do you think we could fix it another way to make the
result of tab completion correct?


Right now i don't see any straight way to fix this to the correct tab 
completion.
There are several similar cases in this test.
E.g., for such a commands:
 
 CREATE TABLE "mixedName" (f1 int, f2 text);

 select * from "mi ;

gives with debian 10:
postgres=# select * from \"mixedName\" ;

resulting in an error.
 
Now there is a similar workaround in the 010_tab_completion.pl with regex: qr/"mixedName\\?" /


I think if there were or will be complaints from users about this behavior in 
Debian 10,
then it makes sense to look for more complex solutions that will fix a 
backslash substitutions.
If no such complaints, then it is better to make a workaround in test.

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: elog/ereport VS misleading backtrace_function function address

2024-05-07 Thread Jakub Wartak
Hi Tom and -hackers!

On Thu, Mar 28, 2024 at 7:36 PM Tom Lane  wrote:
>
> Jakub Wartak  writes:
> > While chasing some other bug I've learned that backtrace_functions
> > might be misleading with top elog/ereport() address.
>
> That was understood from the beginning: this type of backtrace is
> inherently pretty imprecise, and I doubt there is much that can
> be done to make it better.  IIRC the fundamental problem is it only
> looks at global symbols, so static functions inherently defeat it.
> It was argued that this is better than nothing, which is true, but
> you have to take the info with a mountain of salt.

OK, point taken, thanks for describing the limitations, still I find
backtrace_functions often the best thing we have primarily due its
simplicity and ease of use (for customers). I found out simplest
portable way to generate NOP (or any other instruction that makes the
problem go away):

with the reproducer, psql returns:

psql:ri-collation-bug-example.sql:48: error: ERROR:  XX000: cache
lookup failed for collation 0
LOCATION:  get_collation_isdeterministic, lsyscache.c:1062

logfile without patch:

2024-05-07 09:05:43.043 CEST [44720] ERROR:  cache lookup failed for collation 0
2024-05-07 09:05:43.043 CEST [44720] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x155883) [0x55ce5a86a883]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x55ce5ac6dfd0]
postgres: postgres postgres [local] DELETE(+0x2d488b) [0x55ce5a9e988b]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres  -a -f 0x155883
0x00155883
get_constraint_type.cold << so it's wrong as the real function
should be get_collation_isdeterministic

logfile with attached patch:

2024-05-07 09:11:06.596 CEST [51185] ERROR:  cache lookup failed for collation 0
2024-05-07 09:11:06.596 CEST [51185] BACKTRACE:
postgres: postgres postgres [local] DELETE(+0x168bf0) [0x560e1cdfabf0]
postgres: postgres postgres [local]
DELETE(RI_FKey_cascade_del+0x2b0) [0x560e1d200c00]
postgres: postgres postgres [local] DELETE(+0x2e90fb) [0x560e1cf7b0fb]
[..]

$ addr2line -e /usr/pgsql18/bin/postgres  -a -f 0x168bf0
0x00168bf0
get_collation_isdeterministic.cold <<< It's ok with the patch

NOTE: in case one will be testing this: one cannot ./configure with
--enable-debug as it prevents the compiler optimizations that actually
end up with the ".cold" branch optimizations that cause backtrace() to
return the wrong address.

> I recall speculating about whether we could somehow invoke gdb
> to get a more comprehensive and accurate backtrace, but I don't
> really have a concrete idea how that could be made to work.

Oh no, I'm more about micro-fix rather than doing some bigger
revolution. The patch simply adds this one instruction in macro, so
that now backtrace_functions behaves better:

   0x00773d28 <+105>:   call   0x79243f 
   0x00773d2d <+110>:   movzbl -0x12(%rbp),%eax  << this ends
up being added by the patch
   0x00773d31 <+114>:   call   0xdc1a0 

I'll put that as for PG18 in CF, but maybe it could be backpatched too
- no hard opinion on that (I don't see how that ERROR/FATAL path could
cause any performance regressions)

-J.


v1-0001-Tweak-ereport-so-that-it-generates-proper-address.patch
Description: Binary data


Re: cataloguing NOT NULL constraints

2024-05-07 Thread Kyotaro Horiguchi
Hello,

At Wed, 1 May 2024 19:49:35 +0200, Alvaro Herrera  
wrote in 
> Here are two patches that I intend to push soon (hopefully tomorrow).

This commit added and edited two error messages, resulting in using
slightly different wordings "in" and "on" for relation constraints.

+   errmsg("cannot change NO INHERIT status of NOT NULL constraint \"%s\" on 
relation \"%s\"",
===
+   errmsg("cannot change NO INHERIT status of NOT NULL constraint \"%s\" in 
relation \"%s\"",

I think we usually use on in this case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Richard Guo
On Tue, May 7, 2024 at 1:22 PM David Rowley  wrote:

> It would be good to get some build farm coverage of this so we don't
> have to rely on manual testing.  I wonder if it's a good idea to just
> define REALLOCATE_BITMAPSETS when #ifdef CLOBBER_FREED_MEMORY... or if
> we should ask on the buildfarm-members list if someone wouldn't mind
> defining it?


+1 to have build farm coverage of REALLOCATE_BITMAPSETS.  This flag
seems quite useful.

Thanks
Richard


Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Richard Guo
On Tue, May 7, 2024 at 1:46 PM David Rowley  wrote:

> On Tue, 7 May 2024 at 17:28, Tom Lane  wrote:
> > What I'm trying to figure out here is whether we have a live bug
> > in this area in released branches; and if so, why we've not seen
> > reports of that.
>
> We could check what portions of REALLOCATE_BITMAPSETS are
> backpatchable. It may not be applicable very far back because of v16's
> 00b41463c. The bms_del_member() would have left a zero set rather than
> doing bms_free() prior to that commit.  There could be a bug in v16.


I also think there might be a bug in v16, as long as
'sjinfo->syn_lefthand' and 'rinfo->outer_relids' are referencing the
same bitmapset and the content of this bitmapset is altered through
'sjinfo->syn_lefthand' without 'rinfo->outer_relids' being aware of
these changes.  I tried to compose a query that can trigger this bug but
failed though.

Another thing that comes to my mind is that this issue shows that
RestrictInfo.outer_relids could contain references to removed rels and
joins, and RestrictInfo.outer_relids could be referenced after the
removal of useless left joins.  Currently we do not have a mechanism to
clean out the bits in outer_relids during outer join removal.  That is
to say, RestrictInfo.outer_relids might be referenced while including
rels that should have been removed.  I'm not sure if this is a problem.

Thanks
Richard


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-05-07 Thread Daniel Gustafsson
> On 7 May 2024, at 01:31, Michael Paquier  wrote:
> 
> On Fri, May 03, 2024 at 10:39:15AM +0200, Daniel Gustafsson wrote:
>> They are no-ops when linking against v18, but writing an extension which
>> targets all supported versions of postgres along with their respective
>> supported OpenSSL versions make them still required, or am I missing 
>> something?
> 
> Yeah, that depends on how much version you expect your application to
> work on.  Still it seems to me that there's value in mentioning that
> if your application does not care about anything older than OpenSSL 
> 1.1.0, like PG 18 assuming that this patch is merged, then these calls
> are pointless for HEAD.  The routine definitions would be around only
> for the .so compatibility.

Fair enough.  I've taken a stab at documenting that the functions are
deprecated, while at the same time documenting when and how they can be used
(and be useful).  The attached also removes one additional comment in the
testcode which is now obsolete (since removing 1.0.1 support really), and fixes
the spurious whitespace you detected upthread.

--
Daniel Gustafsson



v13-0002-Remove-pg_strong_random-initialization.patch
Description: Binary data


v13-0001-Remove-support-for-OpenSSL-1.0.2.patch
Description: Binary data


Skip adding row-marks for non target tables when result relation is foreign table.

2024-05-07 Thread SAIKIRAN AVULA
Hi PostgreSQL Community,

I would like to bring to your attention an observation regarding the
planner's behavior for foreign table update/delete operations. It appears
that the planner adds rowmarks (ROW_MARK_COPY) for non-target tables, which
I believe is unnecessary when using the postgres-fdw. This is because
postgres-fdw performs early locking on tuples belonging to the target
foreign table by utilizing the SELECT FOR UPDATE clause.

In an attempt to address this, I tried implementing late locking. However,
this approach still doesn't work as intended because the API assumes that
foreign table rows can be re-fetched using TID (ctid). This assumption is
invalid for partitioned tables on the foreign server. Additionally, the
commit afb9249d06f47d7a6d4a89fea0c3625fe43c5a5d, which introduced late
locking for foreign tables, mentions that the benefits of late locking
against a remote server are unclear, as the extra round trips required are
likely to outweigh any potential concurrency improvements.

To address this issue, I have taken the initiative to create a patch that
prevents the addition of rowmarks for non-target tables when the target
table is using early locking. I would greatly appreciate it if you could
review the patch and provide any feedback or insights I may be overlooking.

Example query plan with my change: (foreign scan doesn't fetch whole row
for bar).

postgres=# \d+ bar
  Foreign table "public.bar"
 Column |  Type   | Collation | Nullable | Default |FDW options |
Storage | Stats target | Description
+-+---+--+-++-+--+-
 b1 | integer |   |  | | (column_name 'b1') |
plain   |  |
 b2 | integer |   |  | | (column_name 'b2') |
plain   |  |
Server: postgres_1
FDW options: (schema_name 'public', table_name 'bar')

router=# \d+ foo
  Foreign table "public.foo"
 Column |  Type   | Collation | Nullable | Default |FDW options |
Storage | Stats target | Description
+-+---+--+-++-+--+-
 f1 | integer |   |  | | (column_name 'f1') |
plain   |  |
 f2 | integer |   |  | | (column_name 'f2') |
plain   |  |
Server: postgres_2
FDW options: (schema_name 'public', table_name 'foo')

postgres=# explain verbose update foo set f1 = b1 from bar where f2=b2;
   QUERY PLAN


 Update on public.foo  (cost=200.00..48713.72 rows=0 width=0)
   Remote SQL: UPDATE public.foo SET f1 = $2 WHERE ctid = $1
   ->  Nested Loop  (cost=200.00..48713.72 rows=15885 width=42)
 Output: bar.b1, foo.ctid, foo.*
 Join Filter: (foo.f2 = bar.b2)
 ->  Foreign Scan on public.bar  (cost=100.00..673.20 rows=2560
width=8)
   Output: bar.b1, bar.b2
   Remote SQL: SELECT b1, b2 FROM public.bar
 ->  Materialize  (cost=100.00..389.23 rows=1241 width=42)
   Output: foo.ctid, foo.*, foo.f2
   ->  Foreign Scan on public.foo  (cost=100.00..383.02
rows=1241 width=42)
 Output: foo.ctid, foo.*, foo.f2
 Remote SQL: SELECT f1, f2, ctid FROM public.foo FOR
UPDATE
(13 rows)


Thank you for your time and consideration.


Regards
Saikiran Avula
SDE, Amazon Web Services.


v1-0001-Skip-adding-row-marks-for-not-target-relations-wh.patch
Description: Binary data


Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Alexander Korotkov
On Tue, May 7, 2024 at 1:19 PM Richard Guo  wrote:
> On Tue, May 7, 2024 at 1:46 PM David Rowley  wrote:
>>
>> On Tue, 7 May 2024 at 17:28, Tom Lane  wrote:
>> > What I'm trying to figure out here is whether we have a live bug
>> > in this area in released branches; and if so, why we've not seen
>> > reports of that.
>>
>> We could check what portions of REALLOCATE_BITMAPSETS are
>> backpatchable. It may not be applicable very far back because of v16's
>> 00b41463c. The bms_del_member() would have left a zero set rather than
>> doing bms_free() prior to that commit.  There could be a bug in v16.
>
>
> I also think there might be a bug in v16, as long as
> 'sjinfo->syn_lefthand' and 'rinfo->outer_relids' are referencing the
> same bitmapset and the content of this bitmapset is altered through
> 'sjinfo->syn_lefthand' without 'rinfo->outer_relids' being aware of
> these changes.  I tried to compose a query that can trigger this bug but
> failed though.

Can sjinfo->syn_lefthand became empty set after bms_del_member()?  If
so, rinfo->outer_relids will become an invalid pointer.  If so, it's
obviously a bug, while it still might be very hard to make this
trigger a segfault.

--
Regards,
Alexander Korotkov
Supabase




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Alexander Korotkov
On Tue, May 7, 2024 at 8:29 AM Tom Lane  wrote:
> David Rowley  writes:
> > Yeah, before the revert, that did:
> > -   sjinf->syn_lefthand = replace_relid(sjinf->syn_lefthand, relid, 
> > subst);
> > That replace code seems to have always done a bms_copy()
>
> Hmm, not always; see e0477837c.
>
> What I'm trying to figure out here is whether we have a live bug
> in this area in released branches; and if so, why we've not seen
> reports of that.

I didn't yet spot a particular bug.  But this place looks dangerous,
and it's very hard to prove there is no bug.  Even if there is no bug,
it seems very easy to unintentionally add a bug here.  Should we just
accept to always do bms_copy()?

--
Regards,
Alexander Korotkov
Supabase




Re: CREATE DATABASE with filesystem cloning

2024-05-07 Thread Nazir Bilal Yavuz
Hi,

On Wed, 6 Mar 2024 at 05:17, Thomas Munro  wrote:
>
> The main thing that is missing is support for redo.  It's mostly
> trivial I think, probably just a record type for "try cloning first"
> and then teaching that clone function to fall back to the regular copy
> path if it fails in recovery, do you agree with that idea?  Another
> approach would be to let it fail if it doesn't work on the replica, so
> you don't finish up using dramatically different amounts of disk
> space, but that seems terrible because now your replica is broken.  So
> probably fallback with logged warning (?), though I'm not sure exactly
> which errnos to give that treatment to.

We had an off-list talk with Thomas and we thought making this option
GUC instead of SQL command level could solve this problem.

I am posting a new rebased version of the patch with some important changes:

* 'createdb_file_copy_method' GUC is created. Possible values are
'copy' and 'clone'. Copy is the default option. Clone option can be
chosen if the system supports it, otherwise it gives error at the
startup.

* #else part of the clone_file() function calls pg_unreachable()
instead of ereport().

* Documentation updates.

Also, what should happen when the kernel has clone support but the
file system does not?

- I tested this on Linux and copy_file_range() silently uses normal
copy when this happens. I assume the same thing will happen for
FreeBSD because it uses the copy_file_range() function as well.

- I am not sure about MacOS since the force flag
(COPYFILE_CLONE_FORCE) is used. I do not have MacOS so I can not test
it but I assume it will error out when this happens. If that is the
case, is this a problem? I am asking that since this is a GUC now, the
user will have the full responsibility.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 404e301dbdb252c23ea9d451b817cf6e372d0d9a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 7 May 2024 14:16:09 +0300
Subject: [PATCH v5] Use CLONE method with GUC on CREATE DATABASE ...
 STRATEGY=FILE_COPY.

createdb_file_copy_method GUC option is introduced. It can be set to
either COPY (default) or CLONE (if system supports it).

If CLONE option is chosen, similar to STRATEGY=FILE_COPY; but attempting
to use efficient file copying system calls.  The kernel has the
opportunity to share block ranges in copy-on-write file systems, or
maybe push down the copy to network file systems and storage devices.

Currently works on Linux, FreeBSD and macOS.  More systems could be
supported.

Author: Thomas Munro 
Author: Nazir Bilal Yavuz 
Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
---
 src/include/commands/dbcommands.h |  9 ++
 src/include/storage/copydir.h |  3 +-
 src/backend/commands/dbcommands.c | 21 -
 src/backend/storage/file/copydir.c| 83 ++-
 .../utils/activity/wait_event_names.txt   |  1 +
 src/backend/utils/misc/guc_tables.c   | 13 +++
 src/backend/utils/misc/postgresql.conf.sample |  5 ++
 doc/src/sgml/config.sgml  | 17 
 doc/src/sgml/ref/create_database.sgml | 24 --
 src/tools/pgindent/typedefs.list  |  1 +
 10 files changed, 162 insertions(+), 15 deletions(-)

diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 92e17c71158..2e1d3565edc 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -14,6 +14,15 @@
 #ifndef DBCOMMANDS_H
 #define DBCOMMANDS_H
 
+typedef enum CreateDBFileCopyMethod
+{
+	CREATEDB_FILE_COPY_METHOD_COPY,
+	CREATEDB_FILE_COPY_METHOD_CLONE,
+} CreateDBFileCopyMethod;
+
+/* GUC parameters */
+extern PGDLLIMPORT int createdb_file_copy_method;
+
 #include "access/xlogreader.h"
 #include "catalog/objectaddress.h"
 #include "lib/stringinfo.h"
diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..9ff28f2eec9 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -13,7 +13,8 @@
 #ifndef COPYDIR_H
 #define COPYDIR_H
 
-extern void copydir(const char *fromdir, const char *todir, bool recurse);
+extern void copydir(const char *fromdir, const char *todir, bool recurse,
+	bool clone_files);
 extern void copy_file(const char *fromfile, const char *tofile);
 
 #endif			/* COPYDIR_H */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index be629ea92cf..cf0dff65e69 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -69,6 +69,20 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+/* GUCs */
+int			createdb_file_copy_method = CREATEDB_FILE_COPY_METHOD_COPY;
+
+/*
+ * GUC support
+ */
+const struct config_enum_entry createdb_file_copy_method_options[] = {
+	{"copy", CREATEDB_FILE_COPY_METHOD_COPY, false},
+#if (defined(HAVE_COPYFILE) && 

Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Andrew Dunstan


On 2024-05-07 Tu 06:05, Richard Guo wrote:


On Tue, May 7, 2024 at 1:22 PM David Rowley  wrote:

It would be good to get some build farm coverage of this so we don't
have to rely on manual testing.  I wonder if it's a good idea to just
define REALLOCATE_BITMAPSETS when #ifdef CLOBBER_FREED_MEMORY... or if
we should ask on the buildfarm-members list if someone wouldn't mind
defining it?


+1 to have build farm coverage of REALLOCATE_BITMAPSETS. This flag
seems quite useful.




I have added it to the CPPFLAGS on prion.


cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Control flow in logical replication walsender

2024-05-07 Thread Amit Kapila
On Tue, Apr 30, 2024 at 11:28 PM Christophe Pettus  wrote:
>
> I wanted to check my understanding of how control flows in a walsender doing 
> logical replication.  My understanding is that the (single) thread in each 
> walsender process, in the simplest case, loops on:
>
> 1. Pull a record out of the WAL.
> 2. Pass it to the recorder buffer code, which,
> 3. Sorts it out into the appropriate in-memory structure for that transaction 
> (spilling to disk as required), and then continues with #1, or,
> 4. If it's a commit record, it iteratively passes the transaction data one 
> change at a time to,
> 5. The logical decoding plugin, which returns the output format of that 
> plugin, and then,
> 6. The walsender sends the output from the plugin to the client. It cycles on 
> passing the data to the plugin and sending it to the client until it runs out 
> of changes in that transaction, and then resumes reading the WAL in #1.
>
> In particular, I wanted to confirm that while it is pulling the reordered 
> transaction and sending it to the plugin (and thence to the client), that 
> particular walsender is *not* reading new WAL records or putting them in the 
> reorder buffer.
>
> The specific issue I'm trying to track down is an enormous pileup of spill 
> files.  This is in a non-supported version of PostgreSQL (v11), so an upgrade 
> may fix it, but at the moment, I'm trying to find a cause and a mitigation.
>

In PG-14, we have added a feature in logical replication to stream
long in-progress transactions which should reduce spilling to a good
extent. You might want to try that.

-- 
With Regards,
Amit Kapila.




Re: Control flow in logical replication walsender

2024-05-07 Thread Amit Kapila
On Tue, May 7, 2024 at 9:51 AM Ashutosh Bapat
 wrote:
>
> On Tue, May 7, 2024 at 12:00 AM Christophe Pettus  wrote:
>>
>> Thank you for the reply!
>>
>> > On May 1, 2024, at 02:18, Ashutosh Bapat  
>> > wrote:
>> > Is there a large transaction which is failing to be replicated repeatedly 
>> > - timeouts, crashes on upstream or downstream?
>>
>> AFAIK, no, although I am doing this somewhat by remote control (I don't have 
>> direct access to the failing system).  This did bring up one other question, 
>> though:
>>
>> Are subtransactions written to their own individual reorder buffers (and 
>> thus potentially spill files), or are they appended to the topmost 
>> transaction's reorder buffer?
>
>
> IIRC, they have their own RB,
>

Right.

>
 but once they commit, they are transferred to topmost transaction's RB.
>

I don't think they are transferred to topmost transaction's RB. We
perform a k-way merge between transactions/subtransactions to retrieve
the changes. See comments: "Support for efficiently iterating over a
transaction's and its subtransactions' changes..." in reorderbuffer.c

-- 
With Regards,
Amit Kapila.




Re: bug: copy progress reporting of backends which run multiple COPYs

2024-05-07 Thread Justin Pryzby
On Sat, Jan 21, 2023 at 02:45:40AM +0100, Matthias van de Meent wrote:
> > Would you do anything different in the master branch, with no
> > compatibility constraints ?  I think the progress reporting would still
> > be limited to one row per backend, not one per CopyFrom().
> 
> I think I would at least introduce another parameter to BeginCopyFrom
> for progress reporting (instead of relying on pstate != NULL), like
> how we have a bit in reindex_index's params->options that specifies
> whether we want progress reporting (which is unset for parallel
> workers iirc).

This didn't get fixed for v16, and it seems unlikely that it'll be
addressed in back branches.

But while I was reviewing forgotten threads, it occurred to me to raise
the issue in time to fix it for v17.

-- 
Justin




pg_restore -N loses extension comment

2024-05-07 Thread Justin Pryzby
pg_dump -Fc |pg_restore -l -N schema:

| 2; 3079 18187 EXTENSION - pg_buffercache 

Without -N schema also shows:

| 2562; 0 0 COMMENT - EXTENSION pg_buffercache 

I mean literal s-c-h-e-m-a, but I suppose anything else will work the
same.

BTW, I noticed that pg_restore -v shows that duplicate dependencies can be
stored.  We see things like this (and worse).

| 4284; 1259 191439414 VIEW public wmg_server_view telsasoft
| ;  depends on: 612 612 612 612 612 612 612 612 612 612 612 612 612 612 
612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 
612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 612 
612 612 23087 612

I see that's possible not only for views, but also tables.
That's probaably wasteful of CPU, at least.

-- 
Justin




Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-05-07 Thread Michail Nikolaev
Hello, Matthias and others!

Updated WIP in attach.

Changes are:
* Renaming, now it feels better for me
* More reliable approach in `GlobalVisHorizonKindForRel` to make sure we
have not missed `rd_safeindexconcurrentlybuilding` by calling
`RelationGetIndexList` if required
* Optimization to avoid any additional `RelationGetIndexList` if zero of
concurrently indexes are being built
* TOAST moved to TODO, since looks like it is out of scope - but not sure
yet, need to dive dipper

TODO:
* TOAST
* docs and comments
* make sure non-data tables are not affected
* Per-database scope of optimization
* Handle index building errors correctly in optimization code
* More tests: create index, multiple re-indexes, multiple tables

Thanks,
Michail.
From 63677046efc9b6a1d93f9248c6d9dce14a945a42 Mon Sep 17 00:00:00 2001
From: nkey 
Date: Tue, 7 May 2024 14:24:09 +0200
Subject: [PATCH v2] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations 
 with CONCURRENTLY" which was reverted by e28bb8851969.

Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.

Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).

Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.

To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
 src/backend/catalog/index.c  |  36 ++
 src/backend/commands/indexcmds.c |  20 +++
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procarray.c  |  88 -
 src/backend/utils/cache/relcache.c   |  11 ++
 src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++
 src/include/catalog/index.h  |   5 +
 src/include/utils/rel.h  |   1 +
 8 files changed, 311 insertions(+), 7 deletions(-)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct {
+	pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
 	return result;
 }
 
+
+void SafeICStateShmemInit(void)
+{
+	bool		found;
+
+	SafeICStateShmem = (SafeICSharedState *)
+			ShmemInitStruct("Safe Concurrently Build Indexes",
+			sizeof(SafeICSharedState),
+			&found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+		pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+	} else
+		Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+	if (increment)
+		pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+	else
+		pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+	return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
 /*
  * index_check_primary_key
  *		Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
 	 * hold lock on the parent table.  This might need to change later.
 	 */
 	LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+	if (safe_index && concurrent)
+		UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	/* Commit index as valid before reducing counter of safe concurrently build indexes */
+	CommitTransactionCommand();
 
+	Assert(concurrent);
+	if (safe_index)
+		UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+	/* Start a new transaction to finish process properly */
+	StartTransactionCommand();
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
 	 */
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
+		if (idx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 		/* This function shouldn't be called for tem

Re: CREATE DATABASE with filesystem cloning

2024-05-07 Thread Ranier Vilela
Hi,

Nazir Bilal Yavuz  wrote:

>Any kind of feedback would be appreciated.

I know it's coming from copy-and-paste, but
I believe the flags could be:
- dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL |
PG_BINARY);

The flags:
O_WRONLY | O_TRUNC

Allow the OS to make some optimizations, if you deem it possible.

The flag O_RDWR indicates that the file can be read, which is not true in
this case.
The destination file will just be written.

best regards,
Ranier Vilela


Re: WHERE CURRENT OF with RLS quals that are ctid conditions

2024-05-07 Thread Robert Haas
On Mon, May 6, 2024 at 7:31 PM Tom Lane  wrote:
> Robert pointed out [1] that the planner fails if we have $SUBJECT,
> because tidpath.c can seize on the RLS-derived ctid constraint
> instead of the CurrentOfExpr.  Since the executor can only handle
> CurrentOfExpr in a TidScan's tidquals, that leads to a confusing
> runtime error.
>
> Here's a patch for that.
>
> However ... along the way to testing it, I found that you can only
> get such an RLS qual to work if it accepts "(InvalidBlockNumber,0)",
> because that's what the ctid field will look like in a
> not-yet-stored-to-disk tuple.  That's sufficiently weird, and so
> unduly in bed with undocumented implementation details, that I can't
> imagine anyone is actually using such an RLS condition or ever will.
> So maybe this is not really worth fixing.  Thoughts?

Hmm, I thought the RLS condition needed to accept the old and new
TIDs, but not (InvalidBlockNumber,0). I might well have misunderstood,
though.

As to whether this is worth fixing, I think it is, but it might not be
worth back-patching the fix. Also, I'd really like to get disable_cost
out of the picture here. That would require more code reorganization
than you've done here, but I think it would be worthwhile. I suppose
that could also be done as a separate patch, but I wonder if that
doesn't just amount to changing approximately the same code twice.

Or maybe it doesn't, and this is worth doing on its own. I'm not sure;
I haven't coded what I have in mind yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_restore -N loses extension comment

2024-05-07 Thread Tom Lane
Justin Pryzby  writes:
> pg_dump -Fc |pg_restore -l -N schema:
> | 2; 3079 18187 EXTENSION - pg_buffercache 
> Without -N schema also shows:
> | 2562; 0 0 COMMENT - EXTENSION pg_buffercache 

Hmm, but what happens if you actually do the restore?

I think this may be a bug in -l mode: ProcessArchiveRestoreOptions
saves the result of _tocEntryRequired in te->reqs, but PrintTOCSummary
doesn't, and that will bollix its subsequent _tocEntryRequired checks
for "dependent" TOC entries.

regards, tom lane




Re: WHERE CURRENT OF with RLS quals that are ctid conditions

2024-05-07 Thread Robert Haas
On Tue, May 7, 2024 at 9:47 AM Robert Haas  wrote:
> As to whether this is worth fixing, I think it is, but it might not be
> worth back-patching the fix. Also, I'd really like to get disable_cost
> out of the picture here. That would require more code reorganization
> than you've done here, but I think it would be worthwhile. I suppose
> that could also be done as a separate patch, but I wonder if that
> doesn't just amount to changing approximately the same code twice.
>
> Or maybe it doesn't, and this is worth doing on its own. I'm not sure;
> I haven't coded what I have in mind yet.

Never mind all this. I think what I have in mind requires doing what
you did first. So if you're happy with what you've got, I'd go for it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WHERE CURRENT OF with RLS quals that are ctid conditions

2024-05-07 Thread Tom Lane
Robert Haas  writes:
> On Mon, May 6, 2024 at 7:31 PM Tom Lane  wrote:
>> So maybe this is not really worth fixing.  Thoughts?

> Hmm, I thought the RLS condition needed to accept the old and new
> TIDs, but not (InvalidBlockNumber,0). I might well have misunderstood,
> though.

If you leave the (InvalidBlockNumber,0) alternative out of the RLS
condition, my patch's test case fails because the row "doesn't
satisfy the RLS condition" (I forget the exact error message, but
it was more or less that).

> As to whether this is worth fixing, I think it is, but it might not be
> worth back-patching the fix. Also, I'd really like to get disable_cost
> out of the picture here. That would require more code reorganization
> than you've done here, but I think it would be worthwhile. I suppose
> that could also be done as a separate patch, but I wonder if that
> doesn't just amount to changing approximately the same code twice.

No, because the disable_cost stuff is nowhere near here.  In any case,
what we were talking about was suppressing creation of competing
non-TIDScan paths.  It's still going to be incumbent on tidpath.c to
create a correct path, and as things stand it won't for this case.

regards, tom lane




Re: WHERE CURRENT OF with RLS quals that are ctid conditions

2024-05-07 Thread Tom Lane
Robert Haas  writes:
> Never mind all this. I think what I have in mind requires doing what
> you did first. So if you're happy with what you've got, I'd go for it.

OK.  HEAD-only sounds like a good compromise.  Barring objections,
I'll do that later today.

regards, tom lane




Re: New GUC autovacuum_max_threshold ?

2024-05-07 Thread Robert Haas
On Wed, May 1, 2024 at 10:03 PM David Rowley  wrote:
> Here are some of the problems that I know about:
>
> 1. Autovacuum has exactly zero forward vision and operates reactively
> rather than proactively.  This "blind operating" causes tables to
> either not need vacuumed or suddenly need vacuumed without any
> consideration of how busy autovacuum is at that current moment.
> 2. There is no prioritisation for the order in which tables are autovacuumed.
> 3. With the default scale factor, the larger a table becomes, the more
> infrequent the autovacuums.
> 4. Autovacuum is more likely to trigger when the system is busy
> because more transaction IDs are being consumed and there is more DML
> occurring. This results in autovacuum having less work to do during
> quiet periods when there are more free resources to be doing the
> vacuum work.

I agree with all of these points. For a while now, I've been thinking
that we really needed a prioritization scheme, so that we don't waste
our time on low-priority tasks when there are high-priority tasks that
need to be completed. But lately I've started to think that what
matters most is the rate at which autovacuum work is happening
overall. I feel like prioritization is mostly going to matter when
we're not keeping up, and I think the primary goal should be to keep
up. I think we could use the same data to make both decisions -- if
autovacuum were proactive rather than reactive, that would mean that
we know something about what is going to happen in the future, and I
think that data could be used both to decide whether we're keeping up,
and also to prioritize. But if I had to pick a first target, I'd
forget about trying to make things happen in the right order and just
try to make sure we get all the things done.

> I think we need at least 1a) before we can give autovacuum more work
> to do, especially if we do something like multiply its workload by
> 1024x, per your comment above.

I guess I view it differently. It seems to me that right now, we're
not vacuuming large tables often enough. We should fix that,
independently of anything else. If the result is that small and medium
sized tables get vacuumed less often, then that just means there were
never enough resources to go around in the first place. We haven't
taken a system that was working fine and broken it: we've just moved
the problem from one category of tables (the big ones) to a different
category of tables. If the user wants to solve that problem, they need
to bump up the cost limit or add hardware. I don't see that we have
any particular reason to believe such users will be worse off on
average than they are today. On the other hand, users who do have a
sufficiently high cost limit and enough hardware will be better off,
because we'll start doing all the vacuuming work that needs to be done
instead of only some of it.

Now, if we start vacuuming any class of table whatsoever 1024x as
often as we do today, we are going to lose. But that would still be
true even if we did everything on your list. Large tables need to be
vacuumed more frequently than we now do, but not THAT much more
frequently. Any system that produces that result is just using a wrong
algorithm, or wrong constants, or something. Even if all the necessary
resources are available, nobody is going to thank us for vacuuming
gigantic tables in a tight loop. The problem with such a large
increase is not that we don't have prioritization, but that such a
large increase is fundamentally the wrong thing to do. On the other
hand, I think a more modest increase is the right thing to do, and I
think it's the right thing to do whether we have prioritization or
not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_restore -N loses extension comment

2024-05-07 Thread Tom Lane
I wrote:
> I think this may be a bug in -l mode: ProcessArchiveRestoreOptions
> saves the result of _tocEntryRequired in te->reqs, but PrintTOCSummary
> doesn't, and that will bollix its subsequent _tocEntryRequired checks
> for "dependent" TOC entries.

Yeah ... the attached seems to fix it.

regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index c6c101c118..56e0688154 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1319,10 +1319,13 @@ PrintTOCSummary(Archive *AHX)
 	curSection = SECTION_PRE_DATA;
 	for (te = AH->toc->next; te != AH->toc; te = te->next)
 	{
+		/* This bit must match ProcessArchiveRestoreOptions' marking logic */
 		if (te->section != SECTION_NONE)
 			curSection = te->section;
+		te->reqs = _tocEntryRequired(te, curSection, AH);
+		/* Now, should we print it? */
 		if (ropt->verbose ||
-			(_tocEntryRequired(te, curSection, AH) & (REQ_SCHEMA | REQ_DATA)) != 0)
+			(te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0)
 		{
 			char	   *sanitized_name;
 			char	   *sanitized_schema;


PERIOD foreign key feature

2024-05-07 Thread Bruce Momjian
In this commit:

commit 34768ee3616
Author: Peter Eisentraut 
Date:   Sun Mar 24 07:37:13 2024 +0100

Add temporal FOREIGN KEY contraints

Add PERIOD clause to foreign key constraint definitions.  This is
supported for range and multirange types.  Temporal foreign keys 
check
for range containment instead of equality.

This feature matches the behavior of the SQL standard temporal 
foreign
keys, but it works on PostgreSQL's native ranges instead of SQL's
"periods", which don't exist in PostgreSQL (yet).

Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET DEFAULT}
are not supported yet.

Author: Paul A. Jungwirth 
Reviewed-by: Peter Eisentraut 
Reviewed-by: jian he 
Discussion: 
https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mdhcy4_qq0+noc...@mail.gmail.com

this text was added to create_table.sgml:

In addition, the referenced table must have a primary
key or unique constraint declared with WITHOUT
--> OVERLAPS.  Finally, if one side of the foreign key
--> uses PERIOD, the other side must too.  If the
refcolumn list is
omitted, the WITHOUT OVERLAPS part of the
primary key is treated as if marked with PERIOD.

In the two marked lines, it says "if one side of the foreign key uses
PERIOD, the other side must too."  However, looking at the example
queries, it seems like if the foreign side has PERIOD, the primary side
must have WITHOUT OVERLAPS, not PERIOD.

Does this doc text need correcting?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PERIOD foreign key feature

2024-05-07 Thread David G. Johnston
On Tue, May 7, 2024 at 7:54 AM Bruce Momjian  wrote:

> In this commit:
>
> commit 34768ee3616
> Author: Peter Eisentraut 
> Date:   Sun Mar 24 07:37:13 2024 +0100
>
> Add temporal FOREIGN KEY contraints
>
> Add PERIOD clause to foreign key constraint definitions.  This
> is
> supported for range and multirange types.  Temporal foreign
> keys check
> for range containment instead of equality.
>
> This feature matches the behavior of the SQL standard temporal
> foreign
> keys, but it works on PostgreSQL's native ranges instead of
> SQL's
> "periods", which don't exist in PostgreSQL (yet).
>
> Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET
> DEFAULT}
> are not supported yet.
>
> Author: Paul A. Jungwirth 
> Reviewed-by: Peter Eisentraut 
> Reviewed-by: jian he 
> Discussion:
> https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mdhcy4_qq0+noc...@mail.gmail.com
>
> this text was added to create_table.sgml:
>
> In addition, the referenced table must have a primary
> key or unique constraint declared with WITHOUT
> --> OVERLAPS.  Finally, if one side of the foreign key
> --> uses PERIOD, the other side must too.  If the
> refcolumn list is
> omitted, the WITHOUT OVERLAPS part of the
> primary key is treated as if marked with PERIOD.
>
> In the two marked lines, it says "if one side of the foreign key uses
> PERIOD, the other side must too."  However, looking at the example
> queries, it seems like if the foreign side has PERIOD, the primary side
> must have WITHOUT OVERLAPS, not PERIOD.
>
> Does this doc text need correcting?
>
>
The text is factually correct, though a bit hard to parse.

"the other side" refers to the part after "REFERENCES":

FOREIGN KEY ( column_name [, ... ] [, PERIOD column_name ] ) REFERENCES
reftable [ ( refcolumn [, ... ] [, PERIOD column_name ] ) ]

***(shouldn't the second occurrence be [, PERIOD refcolum] ?)

The text is pointing out that since the refcolumn specification is optional
you may very well not see a second PERIOD keyword in the clause.  Instead
it will be inferred from the PK.

Maybe:

Finally, if the foreign key has a PERIOD column_name specification the
corresponding refcolumn, if present, must also be marked PERIOD.  If the
refcolumn clause is omitted, and thus the reftable's primary key constraint
chosen, the primary key must have its final column marked WITHOUT OVERLAPS.

David J.


Re: allow changing autovacuum_max_workers without restarting

2024-05-07 Thread Nathan Bossart
On Fri, May 03, 2024 at 12:57:18PM +, Imseih (AWS), Sami wrote:
>> That's true, but using a hard-coded limit means we no longer need to add a
>> new GUC. Always allocating, say, 256 slots might require a few additional
>> kilobytes of shared memory, most of which will go unused, but that seems
>> unlikely to be a problem for the systems that will run Postgres v18.
> 
> I agree with this.

Here's what this might look like.  I chose an upper limit of 1024, which
seems like it "ought to be enough for anybody," at least for now.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 72e0496294ef0390c77cef8031ae51c1a44ebde8 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 7 May 2024 10:59:24 -0500
Subject: [PATCH v3 1/1] allow changing autovacuum_max_workers without
 restarting

---
 doc/src/sgml/config.sgml  |  3 +-
 doc/src/sgml/runtime.sgml | 15 ---
 src/backend/access/transam/xlog.c |  2 +-
 src/backend/postmaster/autovacuum.c   | 44 ---
 src/backend/postmaster/postmaster.c   |  2 +-
 src/backend/storage/lmgr/proc.c   |  9 ++--
 src/backend/utils/init/postinit.c | 20 ++---
 src/backend/utils/misc/guc_tables.c   |  7 ++-
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/postmaster/autovacuum.h   |  8 
 src/include/utils/guc_hooks.h |  2 -
 11 files changed, 58 insertions(+), 55 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e93208b2e6..8e2a1d6902 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8528,7 +8528,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;

 Specifies the maximum number of autovacuum processes (other than the
 autovacuum launcher) that may be running at any one time.  The default
-is three.  This parameter can only be set at server start.
+is three.  This parameter can only be set in the
+postgresql.conf file or on the server command line.

   
  
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 6047b8171d..8a672a8383 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -781,13 +781,13 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such

 SEMMNI
 Maximum number of semaphore identifiers (i.e., sets)
-at least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 5) / 16) plus room for other applications
+at least ceil((max_connections + max_wal_senders + max_worker_processes + 1029) / 16) plus room for other applications

 

 SEMMNS
 Maximum number of semaphores system-wide
-ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 5) / 16) * 17 plus room for other applications
+ceil((max_connections + max_wal_senders + max_worker_processes + 1029) / 16) * 17 plus room for other applications

 

@@ -838,7 +838,7 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such
 When using System V semaphores,
 PostgreSQL uses one semaphore per allowed connection
 (), allowed autovacuum worker process
-() and allowed background
+(1024) and allowed background
 process (), in sets of 16.
 Each such set will
 also contain a 17th semaphore which contains a magic
@@ -846,13 +846,14 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such
 other applications. The maximum number of semaphores in the system
 is set by SEMMNS, which consequently must be at least
 as high as max_connections plus
-autovacuum_max_workers plus max_wal_senders,
-plus max_worker_processes, plus one extra for each 16
+max_wal_senders,
+plus max_worker_processes, plus 1024 for autovacuum
+worker processes, plus one extra for each 16
 allowed connections plus workers (see the formula in ).  The parameter SEMMNI
 determines the limit on the number of semaphore sets that can
 exist on the system at one time.  Hence this parameter must be at
-least ceil((max_connections + autovacuum_max_workers + max_wal_senders + max_worker_processes + 5) / 16).
+least ceil((max_connections + max_wal_senders + max_worker_processes + 1029) / 16).
 Lowering the number
 of allowed connections is a temporary workaround for failures,
 which are usually confusingly worded No space
@@ -883,7 +884,7 @@ psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such
 When using POSIX semaphores, the number of semaphores needed is the
 same as for System V, that is one semaphore per allowed connection
 (), allowed autovacuum worker process
-() and allowed background
+(1024) and allowed background
 pro

Re: PERIOD foreign key feature

2024-05-07 Thread Paul Jungwirth

On 5/7/24 08:23, David G. Johnston wrote:

On Tue, May 7, 2024 at 7:54 AM Bruce Momjian mailto:br...@momjian.us>> wrote:
In the two marked lines, it says "if one side of the foreign key uses
PERIOD, the other side must too."  However, looking at the example
queries, it seems like if the foreign side has PERIOD, the primary side
must have WITHOUT OVERLAPS, not PERIOD.

Does this doc text need correcting?


The text is factually correct, though a bit hard to parse.

"the other side" refers to the part after "REFERENCES":

FOREIGN KEY ( column_name [, ... ] [, PERIOD column_name ] ) REFERENCES reftable [ ( refcolumn [, 
... ] [, PERIOD column_name ] ) ]


***(shouldn't the second occurrence be [, PERIOD refcolum] ?)

The text is pointing out that since the refcolumn specification is optional you may very well not 
see a second PERIOD keyword in the clause.  Instead it will be inferred from the PK.


Maybe:

Finally, if the foreign key has a PERIOD column_name specification the corresponding refcolumn, if 
present, must also be marked PERIOD.  If the refcolumn clause is omitted, and thus the reftable's 
primary key constraint chosen, the primary key must have its final column marked WITHOUT OVERLAPS.


Yes, David is correct here on all points. I like his suggestion to clarify the language here also. 
If you need a patch from me let me know, but I assume it's something a committer can just make happen?


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com




Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-05-07 Thread Nathan Bossart
On Sat, May 04, 2024 at 06:45:32PM +0900, Michael Paquier wrote:
> On Fri, May 03, 2024 at 05:22:06PM -0400, Tom Lane wrote:
>> Nathan Bossart  writes:
>>> IIUC this would cause other sessions' temporary sequences to appear in the
>>> view.  Is that desirable?
>> 
>> I assume Michael meant to move the test into the C code, not drop
>> it entirely --- I agree we don't want that.
> 
> Yup.  I meant to remove it from the script and keep only something in
> the C code to avoid the duplication, but you're right that the temp
> sequences would create more noise than now.
> 
>> Moving it has some attraction, but pg_is_other_temp_schema() is also
>> used in a lot of information_schema views, so we couldn't get rid of
>> it without a lot of further hacking.  Not sure we want to relocate
>> that filter responsibility in just one view.
> 
> Okay.

Okay, so are we okay to back-patch something like v1?  Or should we also
return NULL for other sessions' temporary schemas on primaries?  That would
change the condition to something like

char relpersist = seqrel->rd_rel->relpersistence;

if (relpersist == RELPERSISTENCE_PERMANENT ||
(relpersist == RELPERSISTENCE_UNLOGGED && 
!RecoveryInProgress()) ||
!RELATION_IS_OTHER_TEMP(seqrel))
{
...
}

I personally think that would be fine to back-patch since pg_sequences
already filters it out anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-05-07 Thread Tom Lane
Nathan Bossart  writes:
> Okay, so are we okay to back-patch something like v1?  Or should we also
> return NULL for other sessions' temporary schemas on primaries?  That would
> change the condition to something like

>   char relpersist = seqrel->rd_rel->relpersistence;

>   if (relpersist == RELPERSISTENCE_PERMANENT ||
>   (relpersist == RELPERSISTENCE_UNLOGGED && 
> !RecoveryInProgress()) ||
>   !RELATION_IS_OTHER_TEMP(seqrel))
>   {
>   ...
>   }

Should be AND'ing not OR'ing the !TEMP condition, no?  Also I liked
your other formulation of the persistence check better.

> I personally think that would be fine to back-patch since pg_sequences
> already filters it out anyway.

+1 to include that, as it offers a defense if someone invokes this
function directly.  In HEAD we could then rip out the test in the
view.

BTW, I think you also need something like

-   int64   result;
+   int64   result = 0;

Your compiler may not complain about result being possibly
uninitialized, but IME others will.

regards, tom lane




Re: Use pgstat_kind_infos to read fixed shared stats structs

2024-05-07 Thread Tristan Partin

On Tue May 7, 2024 at 1:01 AM CDT, Michael Paquier wrote:

On Tue, May 07, 2024 at 12:44:51AM -0500, Tristan Partin wrote:
> Thanks for the feedback Michael. Should I just throw the patch in the next
> commitfest so it doesn't get left behind?

Better to do so, yes.  I have noted this thread in my TODO list, but
we're a couple of weeks away from the next CF and things tend to get
easily forgotten.


Added here: https://commitfest.postgresql.org/48/4978/

--
Tristan Partin
Neon (https://neon.tech)




Re: Control flow in logical replication walsender

2024-05-07 Thread Christophe Pettus



> On May 7, 2024, at 05:02, Amit Kapila  wrote:
> 
> 
> In PG-14, we have added a feature in logical replication to stream
> long in-progress transactions which should reduce spilling to a good
> extent. You might want to try that.

That's been my principal recommendation (since that would also allow 
controlling the amount of logical replication working memory).  Thank you!



Re: 2024-05-09 release announcement draft

2024-05-07 Thread Noah Misch
On Mon, May 06, 2024 at 11:09:24PM -0400, Jonathan S. Katz wrote:
> * Avoid leaking a query result from 
> [`psql`](https://www.postgresql.org/docs/current/app-psql.html)
> after the query is cancelled.

I'd delete this item about a psql-lifespan memory leak, because (a) it's so
minor and (b) there are other reasonable readings of "leak" that would make
the change look more important.




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Tom Lane
Andrew Dunstan  writes:
> On 2024-05-07 Tu 06:05, Richard Guo wrote:
>> +1 to have build farm coverage of REALLOCATE_BITMAPSETS. This flag
>> seems quite useful.

> I have added it to the CPPFLAGS on prion.

... and as expected, prion fell over.

I find that Richard's proposed fix makes the core regression tests
pass, but we still fail check-world.  So I'm afraid we need something
more aggressive, like the attached which makes make_restrictinfo
copy all its input bitmapsets.  Without that, we still have sharing
of bitmapsets across different RestrictInfos, which seems pretty
scary given what we now see about the effects of 00b41463c.  This
seems annoyingly expensive, but maybe there's little choice?

Given this, we could remove ad-hoc bms_copy calls from the callers
of make_restrictinfo, distribute_quals_to_rels, etc.  I didn't go
looking for possible wins of that sort; there's unlikely to be a
lot of them.

regards, tom lane

diff --git a/src/backend/optimizer/util/restrictinfo.c b/src/backend/optimizer/util/restrictinfo.c
index 0b406e9334..e81861bc8b 100644
--- a/src/backend/optimizer/util/restrictinfo.c
+++ b/src/backend/optimizer/util/restrictinfo.c
@@ -132,8 +132,8 @@ make_restrictinfo_internal(PlannerInfo *root,
 	restrictinfo->is_clone = is_clone;
 	restrictinfo->can_join = false; /* may get set below */
 	restrictinfo->security_level = security_level;
-	restrictinfo->incompatible_relids = incompatible_relids;
-	restrictinfo->outer_relids = outer_relids;
+	restrictinfo->incompatible_relids = bms_copy(incompatible_relids);
+	restrictinfo->outer_relids = bms_copy(outer_relids);
 
 	/*
 	 * If it's potentially delayable by lower-level security quals, figure out
@@ -191,7 +191,7 @@ make_restrictinfo_internal(PlannerInfo *root,
 
 	/* required_relids defaults to clause_relids */
 	if (required_relids != NULL)
-		restrictinfo->required_relids = required_relids;
+		restrictinfo->required_relids = bms_copy(required_relids);
 	else
 		restrictinfo->required_relids = restrictinfo->clause_relids;
 


Re: Use pgstat_kind_infos to read fixed shared stats structs

2024-05-07 Thread Andres Freund
Hi,

On 2024-05-06 14:07:53 -0500, Tristan Partin wrote:
> Instead of needing to be explicit, we can just iterate the
> pgstat_kind_infos array to find the memory locations to read into.

> This was originally thought of by Andres in
> 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> 
> Not a fix, per se, but it does remove a comment. Perhaps the discussion will
> just lead to someone deleting the comment, and keeping the code as is.
> Either way, a win :).

I think it's a good idea. I'd really like to allow extensions to register new
types of stats eventually. Stuff like pg_stat_statements having its own,
fairly ... mediocre, stats storage shouldn't be necessary.

Do we need to increase the stats version, I didn't check if the order we
currently store things in and the numerical order of the stats IDs are the
same.

Greetings,

Andres Freund




Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-05-07 Thread Nathan Bossart
On Tue, May 07, 2024 at 01:44:16PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>>  char relpersist = seqrel->rd_rel->relpersistence;
> 
>>  if (relpersist == RELPERSISTENCE_PERMANENT ||
>>  (relpersist == RELPERSISTENCE_UNLOGGED && 
>> !RecoveryInProgress()) ||
>>  !RELATION_IS_OTHER_TEMP(seqrel))
>>  {
>>  ...
>>  }
> 
> Should be AND'ing not OR'ing the !TEMP condition, no?  Also I liked
> your other formulation of the persistence check better.

Yes, that's a silly mistake on my part.  I changed it to

if ((RelationIsPermanent(seqrel) || !RecoveryInProgress()) &&
!RELATION_IS_OTHER_TEMP(seqrel))
{
...
}

in the attached v2.

>> I personally think that would be fine to back-patch since pg_sequences
>> already filters it out anyway.
> 
> +1 to include that, as it offers a defense if someone invokes this
> function directly.  In HEAD we could then rip out the test in the
> view.

I apologize for belaboring this point, but I don't see how we would be
comfortable removing that check unless we are okay with other sessions'
temporary sequences appearing in the view, albeit with a NULL last_value.
This check lives in the WHERE clause today, so if we remove it, we'd no
longer exclude those sequences.  Michael and you seem united on this, so I
have a sinking feeling that I'm missing something terribly obvious.

> BTW, I think you also need something like
> 
> - int64   result;
> + int64   result = 0;
> 
> Your compiler may not complain about result being possibly
> uninitialized, but IME others will.

Good call.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 974f56896add92983b664c11fd25010ef29ac42c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 30 Apr 2024 20:54:51 -0500
Subject: [PATCH v2 1/1] Fix pg_sequence_last_value() for non-permanent
 sequences on standbys.

---
 src/backend/commands/sequence.c   | 22 --
 src/test/recovery/t/001_stream_rep.pl |  8 
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 46103561c3..9d7468d7bb 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1780,8 +1780,8 @@ pg_sequence_last_value(PG_FUNCTION_ARGS)
 	Buffer		buf;
 	HeapTupleData seqtuple;
 	Form_pg_sequence_data seq;
-	bool		is_called;
-	int64		result;
+	bool		is_called = false;
+	int64		result = 0;
 
 	/* open and lock sequence */
 	init_sequence(relid, &elm, &seqrel);
@@ -1792,12 +1792,22 @@ pg_sequence_last_value(PG_FUNCTION_ARGS)
  errmsg("permission denied for sequence %s",
 		RelationGetRelationName(seqrel;
 
-	seq = read_seq_tuple(seqrel, &buf, &seqtuple);
+	/*
+	 * For the benefit of the pg_sequences system view, we return NULL for
+	 * temporary and unlogged sequences on standbys instead of throwing an
+	 * error.  We also always return NULL for other sessions' temporary
+	 * sequences.
+	 */
+	if ((RelationIsPermanent(seqrel) || !RecoveryInProgress()) &&
+		!RELATION_IS_OTHER_TEMP(seqrel))
+	{
+		seq = read_seq_tuple(seqrel, &buf, &seqtuple);
 
-	is_called = seq->is_called;
-	result = seq->last_value;
+		is_called = seq->is_called;
+		result = seq->last_value;
 
-	UnlockReleaseBuffer(buf);
+		UnlockReleaseBuffer(buf);
+	}
 	sequence_close(seqrel, NoLock);
 
 	if (is_called)
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 5311ade509..4c698b5ce1 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -95,6 +95,14 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Check pg_sequence_last_value() returns NULL for unlogged sequence on standby
+$node_primary->safe_psql('postgres',
+	"CREATE UNLOGGED SEQUENCE ulseq; SELECT nextval('ulseq')");
+$node_primary->wait_for_replay_catchup($node_standby_1);
+is($node_standby_1->safe_psql('postgres',
+	"SELECT pg_sequence_last_value('ulseq'::regclass) IS NULL"),
+	't', 'pg_sequence_last_value() on unlogged sequence on standby 1');
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
2.25.1



Re: Weird test mixup

2024-05-07 Thread Noah Misch
On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
> On Mon, May 06, 2024 at 02:23:24PM -0700, Noah Misch wrote:
> > Here's how I've patched it locally.  It does avoid changing the 
> > backend-side,
> > which has some attraction.  Shall I just push this?
> 
> It looks like you did not rebase on top of HEAD

Yes, the base was 713cfaf (Sunday).

> A side effect is that this causes the conditions to pile
> up on a running server when running installcheck, and assuming that
> many test suites are run on a server left running this could cause
> spurious failures when failing to find a new slot.

Yes, we'd be raising INJ_MAX_CONDITION more often under this approach.

> Always resetting
> condition->name when detaching a point is a simpler flow and saner
> IMO.
> 
> Overall, this switches from one detach behavior to a different one,

Can you say more about that?  The only behavior change known to me is that a
given injection point workload uses more of INJ_MAX_CONDITION.  If there's
another behavior change, it was likely unintended.

> which may or may not be intuitive depending on what one is looking
> for.  FWIW, I see InjectionPointCondition as something that should be
> around as long as its injection point exists, with the condition
> entirely gone once the point is detached because it should not exist
> anymore on the server running, with no information left in shmem.
> 
> Through your patch, you make conditions have a different meaning, with
> a mix of "local" definition, but it is kind of permanent as it keeps a
> trace of the point's name in shmem.  I find the behavior of the patch
> less intuitive.  Perhaps it would be interesting to see first the bug
> and/or problem you are trying to tackle with this different behavior
> as I feel like we could do something even with the module as-is.  As
> far as I understand, the implementation of the module on HEAD allows
> one to emulate a breakpoint with a wait/wake, which can avoid the
> window mentioned in step 2.  Even if a wait point is detached
> concurrently, it can be awaken with its traces in shmem removed.

The problem I'm trying to tackle in this thread is to make
src/test/modules/gin installcheck-safe.  $SUBJECT's commit 5105c90 started
that work, having seen the intarray test suite break when run concurrently
with the injection_points test suite.  That combination still does break at
the exit-time race condition.  To reproduce, apply this attachment to add
sleeps, and run:

make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C 
contrib/intarray installcheck USE_MODULE_DB=1

Separately, I see injection_points_attach() populates InjectionPointCondition
after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
avoid the same sort of race?  I've not tried to reproduce that one.
diff --git a/src/test/modules/gin/expected/gin_incomplete_splits.out 
b/src/test/modules/gin/expected/gin_incomplete_splits.out
index 15574e5..4bc5ef1 100644
--- a/src/test/modules/gin/expected/gin_incomplete_splits.out
+++ b/src/test/modules/gin/expected/gin_incomplete_splits.out
@@ -126,6 +126,12 @@ SELECT 
injection_points_attach('gin-leave-leaf-split-incomplete', 'error');
 select insert_until_fail(:next_i) as next_i
 \gset
 NOTICE:  failed with: error triggered for injection point 
gin-leave-leaf-split-incomplete
+SELECT pg_sleep(10);
+ pg_sleep 
+--
+ 
+(1 row)
+
 SELECT injection_points_detach('gin-leave-leaf-split-incomplete');
  injection_points_detach 
 -
diff --git a/src/test/modules/gin/sql/gin_incomplete_splits.sql 
b/src/test/modules/gin/sql/gin_incomplete_splits.sql
index ebf0f62..fb66bac 100644
--- a/src/test/modules/gin/sql/gin_incomplete_splits.sql
+++ b/src/test/modules/gin/sql/gin_incomplete_splits.sql
@@ -118,6 +118,7 @@ select verify(:next_i);
 SELECT injection_points_attach('gin-leave-leaf-split-incomplete', 'error');
 select insert_until_fail(:next_i) as next_i
 \gset
+SELECT pg_sleep(10);
 SELECT injection_points_detach('gin-leave-leaf-split-incomplete');
 
 -- Verify that a scan works even though there's an incomplete split
diff --git a/src/test/modules/injection_points/injection_points.c 
b/src/test/modules/injection_points/injection_points.c
index a74a4a2..f9e8a1f 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "storage/condition_variable.h"
 #include "storage/dsm_registry.h"
@@ -213,6 +214,14 @@ injection_points_cleanup(int code, Datum arg)
 void
 injection_error(const char *name)
 {
+   if (strcmp(name, "gin-leave-leaf-split-incomplete") == 0 &&
+   !injection_point_allowed(name))
+   {
+   sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+   pg_usleep(20 * USECS_PER_SEC); /* let other suite detach */
+   sigprocmask

Re: backend stuck in DataFileExtend

2024-05-07 Thread Justin Pryzby
On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote:
> On Tue, May 7, 2024 at 6:21 AM Justin Pryzby  wrote:
> > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.
> ...
> > Yes, they're running centos7 with the indicated kernels.
> 
> So far we've got:
> 
> * spurious EIO when opening a file (your previous report)
> * hanging with CPU spinning (?) inside pwritev()
> * old kernel, bleeding edge ZFS
> 
> From an (uninformed) peek at the ZFS code, if it really is spinning
> there is seems like a pretty low level problem: it's finish the write,
> and now is just trying to release (something like our unpin) and
> unlock the buffers, which involves various code paths that might touch
> various mutexes and spinlocks, and to get stuck like that I guess it's
> either corrupted itself or it is deadlocking against something else,
> but what?  Do you see any other processes (including kernel threads)
> with any stuck stacks that might be a deadlock partner?

Sorry, but even after forgetting several times, I finally remembered to
go back to issue, and already rebooted the VM as needed to kill the
stuck process.

But .. it seems to have recurred again this AM.  Note that yesterday,
I'd taken the opportunity to upgrade to zfs-2.2.4.

These two procs are the oldest active postgres procs, and also (now)
adjacent in ps -ef --sort start_time.

-[ RECORD 1 
]+
backend_start| 2024-05-07 09:45:06.228528-06
application_name | 
xact_start   | 2024-05-07 09:55:38.409549-06
query_start  | 2024-05-07 09:55:38.409549-06
state_change | 2024-05-07 09:55:38.409549-06
pid  | 27449
backend_type | autovacuum worker
wait_event_type  | BufferPin
wait_event   | BufferPin
state| active
backend_xid  | 
backend_xmin | 4293757489
query_id | 
left | autovacuum: VACUUM ANALYZE child.
-[ RECORD 2 
]+
backend_start| 2024-05-07 09:49:24.686314-06
application_name | MasterMetricsLoader -n -m Xml
xact_start   | 2024-05-07 09:50:30.387156-06
query_start  | 2024-05-07 09:50:32.744435-06
state_change | 2024-05-07 09:50:32.744436-06
pid  | 5051
backend_type | client backend
wait_event_type  | IO
wait_event   | DataFileExtend
state| active
backend_xid  | 4293757489
backend_xmin | 4293757429
query_id | 2230046159014513529
left | PREPARE mml_0 AS INSERT INTO chil

The earlier proc is doing:
strace: Process 27449 attached
epoll_wait(11, ^Cstrace: Process 27449 detached
 

The later process is stuck, with:
[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/5051/stack 
[] 0x

For good measure:
[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27433/stack 
[] taskq_thread+0x48e/0x4e0 [spl]
[] kthread+0xd1/0xe0
[] ret_from_fork_nospec_end+0x0/0x39
[] 0x
[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/27434/stack 
[] taskq_thread+0x48e/0x4e0 [spl]
[] kthread+0xd1/0xe0
[] ret_from_fork_nospec_end+0x0/0x39
[] 0x

[pryzbyj@telsasoft-centos7 ~]$ ps -O wchan 5051 27449
  PID === S TTY  TIME COMMAND
 5051 ?   R ?02:14:27 postgres: telsasoft ts ::1(53708) 
INSERT
27449 ep_poll S ?00:05:16 postgres: autovacuum worker ts

The interesting procs might be:

ps -eO wchan===,lstart --sort start_time
...
15632 worker_thread  Mon May  6 23:51:34 2024 S ?00:00:00 [kworker/2:2H]
27433 taskq_thread   Tue May  7 09:35:59 2024 S ?00:00:56 [z_wr_iss]
27434 taskq_thread   Tue May  7 09:35:59 2024 S ?00:00:57 [z_wr_iss]
27449 ep_pollTue May  7 09:45:05 2024 S ?00:05:16 postgres: 
autovacuum worker ts
 5051 ?  Tue May  7 09:49:23 2024 R ?02:23:04 postgres: 
telsasoft ts ::1(53708) INSERT
 7861 ep_pollTue May  7 09:51:25 2024 S ?00:03:04 
/usr/local/bin/python3.8 -u /home/telsasoft/server/alarms/core/pr...
 7912 ep_pollTue May  7 09:51:27 2024 S ?00:00:00 postgres: 
telsasoft ts ::1(53794) idle
24518 futex_wait_que Tue May  7 10:42:56 2024 S ?00:00:55 java -jar 
/home/telsasoft/server/alarms/alcatel_lucent/jms/jms2rm...
...

> While looking around for reported issues I found your abandoned report
> against an older ZFS version from a few years ago, same old Linux
> version:
> 
> https://github.com/openzfs/zfs/issues/11641
> 
> I don't know enough to say anything useful about that but it certainly
> smells similar...

Wow - I'd completely forgotten about that problem report.
The symptoms are the same, even with a zfs version 3+ years newer.
I wish the ZFS people would do more with their problem reports.

BTW, we'll be upgrading this VM to a newer kernel, if not a newer OS
(for some reason, these projects take a very long time).  With any luck,
it'll either recur, or not.

I'm 

Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-05-07 Thread Tom Lane
Nathan Bossart  writes:
> On Tue, May 07, 2024 at 01:44:16PM -0400, Tom Lane wrote:
>> +1 to include that, as it offers a defense if someone invokes this
>> function directly.  In HEAD we could then rip out the test in the
>> view.

> I apologize for belaboring this point, but I don't see how we would be
> comfortable removing that check unless we are okay with other sessions'
> temporary sequences appearing in the view, albeit with a NULL last_value.

Oh!  You're right, I'm wrong.  I was looking at the CASE filter, which
we could get rid of -- but the "WHERE NOT pg_is_other_temp_schema(N.oid)"
part has to stay.

regards, tom lane




Re: Use pgstat_kind_infos to read fixed shared stats structs

2024-05-07 Thread Tristan Partin

On Tue May 7, 2024 at 1:29 PM CDT, Andres Freund wrote:

Hi,

On 2024-05-06 14:07:53 -0500, Tristan Partin wrote:
> Instead of needing to be explicit, we can just iterate the
> pgstat_kind_infos array to find the memory locations to read into.

> This was originally thought of by Andres in
> 5891c7a8ed8f2d3d577e7eea34dacff12d7b6bbd.
> 
> Not a fix, per se, but it does remove a comment. Perhaps the discussion will

> just lead to someone deleting the comment, and keeping the code as is.
> Either way, a win :).

I think it's a good idea. I'd really like to allow extensions to register new
types of stats eventually. Stuff like pg_stat_statements having its own,
fairly ... mediocre, stats storage shouldn't be necessary.


Could be useful for Neon in the future too.


Do we need to increase the stats version, I didn't check if the order we
currently store things in and the numerical order of the stats IDs are the
same.


I checked the orders, and they looked the same.

1. Archiver
2. BgWriter
3. Checkpointer
4. IO
5. SLRU
6. WAL

--
Tristan Partin
Neon (https://neon.tech)




Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-05-07 Thread Nathan Bossart
On Tue, May 07, 2024 at 03:02:01PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> On Tue, May 07, 2024 at 01:44:16PM -0400, Tom Lane wrote:
>>> +1 to include that, as it offers a defense if someone invokes this
>>> function directly.  In HEAD we could then rip out the test in the
>>> view.
> 
>> I apologize for belaboring this point, but I don't see how we would be
>> comfortable removing that check unless we are okay with other sessions'
>> temporary sequences appearing in the view, albeit with a NULL last_value.
> 
> Oh!  You're right, I'm wrong.  I was looking at the CASE filter, which
> we could get rid of -- but the "WHERE NOT pg_is_other_temp_schema(N.oid)"
> part has to stay.

Okay, phew.  We can still do something like v3-0002 for v18.  I'll give
Michael a chance to comment on 0001 before committing/back-patching that
one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 2a37834699587eef18b50bf8d58723790bbcdde7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 30 Apr 2024 20:54:51 -0500
Subject: [PATCH v3 1/2] Fix pg_sequence_last_value() for non-permanent
 sequences on standbys.

---
 src/backend/commands/sequence.c   | 22 --
 src/test/recovery/t/001_stream_rep.pl |  8 
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 46103561c3..9d7468d7bb 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1780,8 +1780,8 @@ pg_sequence_last_value(PG_FUNCTION_ARGS)
 	Buffer		buf;
 	HeapTupleData seqtuple;
 	Form_pg_sequence_data seq;
-	bool		is_called;
-	int64		result;
+	bool		is_called = false;
+	int64		result = 0;
 
 	/* open and lock sequence */
 	init_sequence(relid, &elm, &seqrel);
@@ -1792,12 +1792,22 @@ pg_sequence_last_value(PG_FUNCTION_ARGS)
  errmsg("permission denied for sequence %s",
 		RelationGetRelationName(seqrel;
 
-	seq = read_seq_tuple(seqrel, &buf, &seqtuple);
+	/*
+	 * For the benefit of the pg_sequences system view, we return NULL for
+	 * temporary and unlogged sequences on standbys instead of throwing an
+	 * error.  We also always return NULL for other sessions' temporary
+	 * sequences.
+	 */
+	if ((RelationIsPermanent(seqrel) || !RecoveryInProgress()) &&
+		!RELATION_IS_OTHER_TEMP(seqrel))
+	{
+		seq = read_seq_tuple(seqrel, &buf, &seqtuple);
 
-	is_called = seq->is_called;
-	result = seq->last_value;
+		is_called = seq->is_called;
+		result = seq->last_value;
 
-	UnlockReleaseBuffer(buf);
+		UnlockReleaseBuffer(buf);
+	}
 	sequence_close(seqrel, NoLock);
 
 	if (is_called)
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 5311ade509..4c698b5ce1 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -95,6 +95,14 @@ $result = $node_standby_2->safe_psql('postgres', "SELECT * FROM seq1");
 print "standby 2: $result\n";
 is($result, qq(33|0|t), 'check streamed sequence content on standby 2');
 
+# Check pg_sequence_last_value() returns NULL for unlogged sequence on standby
+$node_primary->safe_psql('postgres',
+	"CREATE UNLOGGED SEQUENCE ulseq; SELECT nextval('ulseq')");
+$node_primary->wait_for_replay_catchup($node_standby_1);
+is($node_standby_1->safe_psql('postgres',
+	"SELECT pg_sequence_last_value('ulseq'::regclass) IS NULL"),
+	't', 'pg_sequence_last_value() on unlogged sequence on standby 1');
+
 # Check that only READ-only queries can run on standbys
 is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
 	3, 'read-only queries on standby 1');
-- 
2.25.1

>From b96d1f21f6144640561360c84b361f569a2edc48 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 7 May 2024 14:35:34 -0500
Subject: [PATCH v3 2/2] Simplify pg_sequences a bit.

XXX: NEEDS CATVERSION BUMP
---
 src/backend/catalog/system_views.sql |  6 +-
 src/backend/commands/sequence.c  | 15 +--
 src/test/regress/expected/rules.out  |  5 +
 3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..b32e5c3170 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -176,11 +176,7 @@ CREATE VIEW pg_sequences AS
 S.seqincrement AS increment_by,
 S.seqcycle AS cycle,
 S.seqcache AS cache_size,
-CASE
-WHEN has_sequence_privilege(C.oid, 'SELECT,USAGE'::text)
-THEN pg_sequence_last_value(C.oid)
-ELSE NULL
-END AS last_value
+pg_sequence_last_value(C.oid) AS last_value
 FROM pg_sequence S JOIN pg_class C ON (C.oid = S.seqrelid)
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
 WHERE NOT pg_is_other_temp_schema(N.oid)
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 9d7468d7bb..f129375915 100644
--- a/src/backend/commands/sequ

Re: On disable_cost

2024-05-07 Thread Robert Haas
On Mon, May 6, 2024 at 4:30 PM Peter Geoghegan  wrote:
> FWIW I always found those weird inconsistencies to be annoying at
> best, and confusing at worst. I speak as somebody that uses
> disable_cost a lot.
>
> I certainly wouldn't ask anybody to make it a priority for that reason
> alone -- it's not *that* bad. I've given my opinion on this because
> it's already under discussion.

Thanks, it's good to have other perspectives.

Here are some patches for discussion.

0001 gets rid of disable_cost as a mechanism for forcing a TID scan
plan to be chosen when CurrentOfExpr is present. Instead, it arranges
to generate only the valid path when that case occurs, and skip
everything else. I think this is a good cleanup, and it doesn't seem
totally impossible that it actually prevents a failure in some extreme
case.

0002 cleans up the behavior of enable_indexscan and
enable_indexonlyscan. Currently, setting enable_indexscan=false adds
disable_cost to both the cost of index scans and the cost of
index-only scans. I think that's indefensible and, in fact, a bug,
although I believe David Rowley disagrees. With this patch, we simply
don't generate index scans if enable_indexscan=false, and we don't
generate index-only scans if enable_indexonlyscan=false, which seems a
lot more consistent to me. However, I did revise one major thing from
the patch I posted before, per feedback from David Rowley and also per
my own observations: in this version, if enable_indexscan=true and
enable_indexonlyscan=false, we'll generate index-scan paths for any
cases where, with both set to true, we would have only generated
index-only scan paths. That change makes the behavior of this patch a
lot more comprehensible and intuitive: the only regression test
changes are places where somebody expected that they could disable
both index scans and index-only scans by setting
enable_indexscan=false.

0003 and 0004 extend the approach of "just don't generate the disabled
path" to bitmap scans and gather merge, respectively. I think these
are more debatable, mostly because it's not clear how far we can
really take this approach. Neither breaks any test cases, and 0003 is
closely related to the work done in 0002, which seems like a point in
its favor. 0004 was simply the only other case where it was obvious to
me that this kind of approach made sense. In my view, it makes most
sense to use this kind of approach for planner behaviors that seem
like they're sort of optional: like if you don't use gather merge, you
can still use gather, and if you don't use index scans, you can still
use sequential scans. With all these patches applied, the remaining
cases where we rely on disable_cost are:

sequential scans
sorts
hash aggregation
all 3 join types
hash joins where a bucket holding the inner MCV would exceed hash_mem

Sequential scans are clearly a last-ditch method. I find it a bit hard
to decide whether hashing or sorting is the default, especially giving
the asymmetry between enable_sort - presumptively anywhere - and
enable_hashagg - specific to aggregation. As for the join types, it's
tempting to consider nested-loop the default type -- it's the only way
to satisfy parameterizations, for instance -- but the fact that it's
the only method that can't do a full join undermines that position in
my book. But, I don't want to pretend like I have all the answers
here, either; I'm just sharing some thoughts.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0004-When-enable_gathermerge-false-don-t-generate-gather-.patch
Description: Binary data


0002-Don-t-generate-index-scan-paths-when-enable_indexsca.patch
Description: Binary data


0003-When-enable_bitmapscan-false-just-don-t-generate-bit.patch
Description: Binary data


0001-Remove-grotty-use-of-disable_cost-for-TID-scan-plans.patch
Description: Binary data


Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-05-07 Thread Michail Nikolaev
Hi again!

Made an error in `GlobalVisHorizonKindForRel` logic, and it was caught by a
new test.

Fixed version in attach.

>
From 9a8ea366f6d2d144979e825c4ac0bdd2937bf7c1 Mon Sep 17 00:00:00 2001
From: nkey 
Date: Tue, 7 May 2024 22:10:56 +0200
Subject: [PATCH v3] WIP: fix d9d076222f5b "VACUUM: ignore indexing operations 
 with CONCURRENTLY" which was reverted by e28bb8851969.

Issue was caused by absent of any snapshot actually protects the data in relation in the required to build index correctly.

Introduce new type of visibility horizon to be used for relation with concurrently build indexes (in the case of "safe" index).

Now `GlobalVisHorizonKindForRel` may dynamically decide which horizon to used base of the data about safe indexes being built concurrently.

To reduce performance impact counter of concurrently built indexes updated in shared memory.
---
 src/backend/catalog/index.c  |  36 ++
 src/backend/commands/indexcmds.c |  20 +++
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procarray.c  |  85 -
 src/backend/utils/cache/relcache.c   |  11 ++
 src/bin/pg_amcheck/t/006_concurrently.pl | 155 +++
 src/include/catalog/index.h  |   5 +
 src/include/utils/rel.h  |   1 +
 8 files changed, 309 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_amcheck/t/006_concurrently.pl

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5a8568c55c..3caa2bab12 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -97,6 +97,11 @@ typedef struct
 	Oid			pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER];
 } SerializedReindexState;
 
+typedef struct {
+	pg_atomic_uint32 numSafeConcurrentlyBuiltIndexes;
+} SafeICSharedState;
+static SafeICSharedState *SafeICStateShmem;
+
 /* non-export function prototypes */
 static bool relationHasPrimaryKey(Relation rel);
 static TupleDesc ConstructTupleDescriptor(Relation heapRelation,
@@ -176,6 +181,37 @@ relationHasPrimaryKey(Relation rel)
 	return result;
 }
 
+
+void SafeICStateShmemInit(void)
+{
+	bool		found;
+
+	SafeICStateShmem = (SafeICSharedState *)
+			ShmemInitStruct("Safe Concurrently Build Indexes",
+			sizeof(SafeICSharedState),
+			&found);
+
+	if (!IsUnderPostmaster)
+	{
+		Assert(!found);
+		pg_atomic_init_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 0);
+	} else
+		Assert(found);
+}
+
+void UpdateNumSafeConcurrentlyBuiltIndexes(bool increment)
+{
+	if (increment)
+		pg_atomic_fetch_add_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+	else
+		pg_atomic_fetch_sub_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes, 1);
+}
+
+bool IsAnySafeIndexBuildsConcurrently()
+{
+	return pg_atomic_read_u32(&SafeICStateShmem->numSafeConcurrentlyBuiltIndexes) > 0;
+}
+
 /*
  * index_check_primary_key
  *		Apply special checks needed before creating a PRIMARY KEY index
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..663450ba20 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1636,6 +1636,8 @@ DefineIndex(Oid tableId,
 	 * hold lock on the parent table.  This might need to change later.
 	 */
 	LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
+	if (safe_index && concurrent)
+		UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 	PopActiveSnapshot();
 	CommitTransactionCommand();
@@ -1804,7 +1806,15 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	/* Commit index as valid before reducing counter of safe concurrently build indexes */
+	CommitTransactionCommand();
 
+	Assert(concurrent);
+	if (safe_index)
+		UpdateNumSafeConcurrentlyBuiltIndexes(false);
+
+	/* Start a new transaction to finish process properly */
+	StartTransactionCommand();
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
 	 */
@@ -3902,6 +3912,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
+		if (idx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(true);
 
 		/* This function shouldn't be called for temporary relations. */
 		if (indexRel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
@@ -4345,6 +4357,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		UnlockRelationIdForSession(lockrelid, ShareUpdateExclusiveLock);
 	}
 
+	// now we may clear safe index building flags
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		if (newidx->safe)
+			UpdateNumSafeConcurrentlyBuiltIndexes(false);
+	}
+
 	/* Start a new transaction to finish process properly */
 	StartTransactionCommand();
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc

Re: partitioning and identity column

2024-05-07 Thread Peter Eisentraut

On 30.04.24 12:59, Ashutosh Bapat wrote:

PFA patch which fixes all the three problems.


I have committed this patch.  Thanks all.




Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-07 Thread Jacob Champion
On Mon, May 6, 2024 at 8:43 PM Michael Paquier  wrote:
> On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:
> > We could port something like that to src/common. IMO that'd be more
> > suited for an actual conversion routine, though, as opposed to a
> > parser that for the most part assumes you didn't lie about the input
> > encoding and is just trying not to crash if you're wrong. Most of the
> > time, the parser just copies bytes between delimiters around and it's
> > up to the caller to handle encodings... the exceptions to that are the
> > \u escapes and the error handling.
>
> Hmm.  That would still leave the backpatch issue at hand, which is
> kind of confusing to leave as it is.  Would it be complicated to
> truncate the entire byte sequence in the error message and just give
> up because we cannot do better if the input byte sequence is
> incomplete?

Maybe I've misunderstood, but isn't that what's being done in v2?

> > Maybe I'm missing
> > code somewhere, but I don't see a conversion routine from
> > json_errdetail() to the actual client/locale encoding. (And the parser
> > does not support multibyte input_encodings that contain ASCII in trail
> > bytes.)
>
> Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
> its conversion for FRONTEND, I guess?  Yep.  This limitation looks
> like a problem, especially if plugging that to libpq.

Okay. How we deal with that will likely guide the "optimal" fix to
error reporting, I think...

Thanks,
--Jacob




Re: New GUC autovacuum_max_threshold ?

2024-05-07 Thread Nathan Bossart
On Tue, May 07, 2024 at 10:31:00AM -0400, Robert Haas wrote:
> On Wed, May 1, 2024 at 10:03 PM David Rowley  wrote:
>> I think we need at least 1a) before we can give autovacuum more work
>> to do, especially if we do something like multiply its workload by
>> 1024x, per your comment above.
> 
> I guess I view it differently. It seems to me that right now, we're
> not vacuuming large tables often enough. We should fix that,
> independently of anything else. If the result is that small and medium
> sized tables get vacuumed less often, then that just means there were
> never enough resources to go around in the first place. We haven't
> taken a system that was working fine and broken it: we've just moved
> the problem from one category of tables (the big ones) to a different
> category of tables. If the user wants to solve that problem, they need
> to bump up the cost limit or add hardware. I don't see that we have
> any particular reason to believe such users will be worse off on
> average than they are today. On the other hand, users who do have a
> sufficiently high cost limit and enough hardware will be better off,
> because we'll start doing all the vacuuming work that needs to be done
> instead of only some of it.
> 
> Now, if we start vacuuming any class of table whatsoever 1024x as
> often as we do today, we are going to lose. But that would still be
> true even if we did everything on your list. Large tables need to be
> vacuumed more frequently than we now do, but not THAT much more
> frequently. Any system that produces that result is just using a wrong
> algorithm, or wrong constants, or something. Even if all the necessary
> resources are available, nobody is going to thank us for vacuuming
> gigantic tables in a tight loop. The problem with such a large
> increase is not that we don't have prioritization, but that such a
> large increase is fundamentally the wrong thing to do. On the other
> hand, I think a more modest increase is the right thing to do, and I
> think it's the right thing to do whether we have prioritization or
> not.

This is about how I feel, too.  In any case, I +1'd a higher default
because I think we need to be pretty conservative with these changes, at
least until we have a better prioritization strategy.  While folks may opt
to set this value super low, I think that's more likely to lead to some
interesting secondary effects.  If the default is high, hopefully these
secondary effects will be minimized or avoided.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Weird test mixup

2024-05-07 Thread Noah Misch
On Tue, May 07, 2024 at 11:53:10AM -0700, Noah Misch wrote:
> On Tue, May 07, 2024 at 10:17:49AM +0900, Michael Paquier wrote:
> > Overall, this switches from one detach behavior to a different one,
> 
> Can you say more about that?  The only behavior change known to me is that a
> given injection point workload uses more of INJ_MAX_CONDITION.  If there's
> another behavior change, it was likely unintended.

I see patch inplace030-inj-exit-race-v1.patch does not fix the race seen with
repro-inj-exit-race-v1.patch.  I withdraw inplace030-inj-exit-race-v1.patch,
and I withdraw the above question.

> To reproduce, apply [repro-inj-exit-race-v1.patch] to add
> sleeps, and run:
> 
> make -C src/test/modules/gin installcheck USE_MODULE_DB=1 & sleep 2; make -C 
> contrib/intarray installcheck USE_MODULE_DB=1
> 
> Separately, I see injection_points_attach() populates InjectionPointCondition
> after InjectionPointAttach().  Shouldn't InjectionPointAttach() come last, to
> avoid the same sort of race?  I've not tried to reproduce that one.




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley
On Wed, 8 May 2024 at 06:20, Tom Lane  wrote:
> I find that Richard's proposed fix makes the core regression tests
> pass, but we still fail check-world.  So I'm afraid we need something
> more aggressive, like the attached which makes make_restrictinfo
> copy all its input bitmapsets.  Without that, we still have sharing
> of bitmapsets across different RestrictInfos, which seems pretty
> scary given what we now see about the effects of 00b41463c.  This
> seems annoyingly expensive, but maybe there's little choice?

We could make the policy copy-on-modify.  If you put bms_copy around
the bms_del_member() calls in remove_rel_from_query(), does it pass
then?

David




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley
On Wed, 8 May 2024 at 10:35, David Rowley  wrote:
>
> On Wed, 8 May 2024 at 06:20, Tom Lane  wrote:
> > I find that Richard's proposed fix makes the core regression tests
> > pass, but we still fail check-world.  So I'm afraid we need something
> > more aggressive, like the attached which makes make_restrictinfo
> > copy all its input bitmapsets.  Without that, we still have sharing
> > of bitmapsets across different RestrictInfos, which seems pretty
> > scary given what we now see about the effects of 00b41463c.  This
> > seems annoyingly expensive, but maybe there's little choice?
>
> We could make the policy copy-on-modify.  If you put bms_copy around
> the bms_del_member() calls in remove_rel_from_query(), does it pass
> then?

err, I mean bms_copy() the set before passing to bms_del_member().

David




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Tom Lane
David Rowley  writes:
> On Wed, 8 May 2024 at 06:20, Tom Lane  wrote:
>> I find that Richard's proposed fix makes the core regression tests
>> pass, but we still fail check-world.  So I'm afraid we need something
>> more aggressive, like the attached which makes make_restrictinfo
>> copy all its input bitmapsets.  Without that, we still have sharing
>> of bitmapsets across different RestrictInfos, which seems pretty
>> scary given what we now see about the effects of 00b41463c.  This
>> seems annoyingly expensive, but maybe there's little choice?

> We could make the policy copy-on-modify.  If you put bms_copy around
> the bms_del_member() calls in remove_rel_from_query(), does it pass
> then?

Didn't test, but that route seems awfully invasive and fragile: how
will we find all the places to modify, or ensure that the policy
is followed by future patches?

regards, tom lane




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley
On Wed, 8 May 2024 at 10:40, Tom Lane  wrote:
>
> David Rowley  writes:
> > We could make the policy copy-on-modify.  If you put bms_copy around
> > the bms_del_member() calls in remove_rel_from_query(), does it pass
> > then?
>
> Didn't test, but that route seems awfully invasive and fragile: how
> will we find all the places to modify, or ensure that the policy
> is followed by future patches?

REALLOCATE_BITMAPSETS was invented for this and IMO, it found exactly
the problem it was invented to find.

Copy-on-modify is our policy for node mutation. Why is it ok there but
awfully fragile here?

David




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Tom Lane
David Rowley  writes:
> On Wed, 8 May 2024 at 10:40, Tom Lane  wrote:
>> Didn't test, but that route seems awfully invasive and fragile: how
>> will we find all the places to modify, or ensure that the policy
>> is followed by future patches?

> REALLOCATE_BITMAPSETS was invented for this and IMO, it found exactly
> the problem it was invented to find.

Not in a way that gives me any confidence that we found *all* the
problems.  If check-world finds a problem that the core tests did not,
then there's no reason to think there aren't still more issues that
check-world happened not to trip over either.

> Copy-on-modify is our policy for node mutation. Why is it ok there but
> awfully fragile here?

It's only partly our policy: there are all those places where we don't
do it that way.  The main problem that I see for trying to be 100%
consistent in that way is that once you modify a sub-node, full
copy-on-modify dictates replacing every ancestor node all the way to
the top of the tree.  That's clearly impractical in the planner data
structures.  So where are we going to stop exactly?

regards, tom lane




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread David Rowley
On Wed, 8 May 2024 at 10:55, Tom Lane  wrote:
>
> David Rowley  writes:
> > REALLOCATE_BITMAPSETS was invented for this and IMO, it found exactly
> > the problem it was invented to find.
>
> Not in a way that gives me any confidence that we found *all* the
> problems.

Here are some statements I believe to be true:
1. If REALLOCATE_BITMAPSETS is defined then modifications to a
Bitmapset will make a copy and free the original.
2. If a query runs successfully without REALLOCATE_BITMAPSETS and
Assert fails due to an invalid Bitmapset when REALLOCATE_BITMAPSETS is
defined, then we have > 1 pointer pointing to the same set and not all
of them are being updated when the members are added/removed.

Given the above, I can't see what Bitmapset sharing problems we won't
find with REALLOCATE_BITMAPSETS.

Can you share the exact scenario you're worried that we won't find so
I can understand your concern?

David




Re: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.

2024-05-07 Thread Tom Lane
David Rowley  writes:
> On Wed, 8 May 2024 at 10:55, Tom Lane  wrote:
>> Not in a way that gives me any confidence that we found *all* the
>> problems.

> Here are some statements I believe to be true:
> 1. If REALLOCATE_BITMAPSETS is defined then modifications to a
> Bitmapset will make a copy and free the original.
> 2. If a query runs successfully without REALLOCATE_BITMAPSETS and
> Assert fails due to an invalid Bitmapset when REALLOCATE_BITMAPSETS is
> defined, then we have > 1 pointer pointing to the same set and not all
> of them are being updated when the members are added/removed.

> Given the above, I can't see what Bitmapset sharing problems we won't
> find with REALLOCATE_BITMAPSETS.

Anything where the trouble spots are in a code path we fail to
exercise with our available test suites.  If you think there are
no such code paths, I'm sorry to disillusion you.

I spent a little bit of time wondering if we could find problems in a
more static way by marking bitmapset fields as "const", but I fear
that would create a huge number of false positives.

regards, tom lane




Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-07 Thread Peter Smith
Hi, Here are some review comments for patch v6-0001

==
Commit message

1.
This patch allows user to alter two_phase option

/allows user/allows the user/

/to alter two_phase option/to alter the 'two_phase' option/

==
doc/src/sgml/ref/alter_subscription.sgml

2.
two_phase can be altered only for disabled subscription.

SUGGEST
The two_phase parameter can only be altered when
the subscription is disabled.

==
src/backend/access/transam/twophase.c

3. checkGid
+
+/*
+ * checkGid
+ */
+static bool
+checkGid(char *gid, Oid subid)
+{
+ int ret;
+ Oid subid_written,
+ xid;
+
+ ret = sscanf(gid, "pg_gid_%u_%u", &subid_written, &xid);
+
+ if (ret != 2 || subid != subid_written)
+ return false;
+
+ return true;
+}

3a.
The function comment should give more explanation of what it does. I
think this function is the counterpart of the TwoPhaseTransactionGid()
function of worker.c so the comment can say that too.

~

3b.
Indeed, perhaps the function name should be similar to
TwoPhaseTransactionGid. e.g. call it like
IsTwoPhaseTransactionGidForSubid?

~

3c.
Probably 'xid' should be TransactionId instead of Oid.

~

3d.
Why not have a single return?

SUGGESTION
return (ret == 2 && subid = subid_written);

~

3e.
I am wondering if the existing TwoPhaseTransactionGid function
currently in worker.c should be moved here because IMO these 2
functions belong together and twophase.c seems the right place to put
them.

~~~

4.
+LookupGXactBySubid(Oid subid)
+{
+ bool found = false;
+
+ LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+ for (int i = 0; i < TwoPhaseState->numPrepXacts; i++)
+ {
+ GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+
+ /* Ignore not-yet-valid GIDs. */
+ if (gxact->valid && checkGid(gxact->gid, subid))
+ {
+ found = true;
+ break;
+ }
+ }
+ LWLockRelease(TwoPhaseStateLock);
+ return found;
+}

AFAIK the gxact also has the 'xid' available, so won't it be better to
pass BOTH the 'xid' and 'subid' to the checkGid so you can do a full
comparison instead of comparing only the subid part of the gid?

==
src/backend/commands/subscriptioncmds.c

5. AlterSubscription

+ /* XXX */
+ if (IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
+ {

The "XXX" comment looks like it is meant to say something more...

~~~

6.
+ /*
+ * two_phase can be only changed for disabled
+ * subscriptions
+ */
+ if (form->subenabled)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot set %s for enabled subscription",
+ "two_phase")));

6a.
Should this have a more comprehensive comment giving the reason like
the 'failover' option has?

~~~

6b.
Maybe this should include a "translator" comment to say don't
translate the option name.

~~~

7.
+ /* Check whether the number of prepared transactions */
+ if (!opts.twophase &&
+ form->subtwophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED &&
+ LookupGXactBySubid(subid))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot disable two_phase when uncommitted prepared
transactions present")));
+

7a.
The first comment seems to be an incomplete sentence. I think it
should say something a bit like:
two_phase cannot be disabled if there are any uncommitted prepared
transactions present.

~

7b.
Also, if ereport occurs what is the user supposed to do about it?
Shouldn't the ereport include some errhint with appropriate advice?

~~~

8.
+ /*
+ * The changed failover option of the slot can't be rolled
+ * back.
+ */
+ PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... SET
(two_phase)");
+
+ /* Change system catalog acoordingly */
+ values[Anum_pg_subscription_subtwophasestate - 1] =
+ CharGetDatum(opts.twophase ?
+ LOGICALREP_TWOPHASE_STATE_PENDING :
+ LOGICALREP_TWOPHASE_STATE_DISABLED);
+ replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
+ }

Typo I think: /failover option/two_phase option/

==
.../libpqwalreceiver/libpqwalreceiver.c

9.
 static void
 libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname,
- bool failover)
+ bool two_phase, bool failover)

Same comment as mentioned elsewhere (#15), IMO the new 'two_phase'
parameter should be last.

==
src/backend/replication/logical/launcher.c

10.
+/*
+ * Stop all the subscription workers.
+ */
+void
+logicalrep_workers_stop(Oid subid)
+{
+ List*subworkers;
+ ListCell   *lc;
+
+ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+ subworkers = logicalrep_workers_find(subid, false);
+ LWLockRelease(LogicalRepWorkerLock);
+ foreach(lc, subworkers)
+ {
+ LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc);
+
+ logicalrep_worker_stop(w->subid, w->relid);
+ }
+ list_free(subworkers);
+}

I was confused by the logicalrep_workers_find(subid, false). IIUC the
'false' means everything (instead of 'only_running') but then I don't
know why we want to "stop" anything that is NOT running. OTOH I see
that this code was extracted from where it was previously inlined in
subscriptioncmds.c, so maybe the 'false' is necessary for ano

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-05-07 Thread Michael Paquier
On Tue, May 07, 2024 at 12:36:24PM +0200, Daniel Gustafsson wrote:
> Fair enough.  I've taken a stab at documenting that the functions are
> deprecated, while at the same time documenting when and how they can be used
> (and be useful).  The attached also removes one additional comment in the
> testcode which is now obsolete (since removing 1.0.1 support really), and 
> fixes
> the spurious whitespace you detected upthread.

+   This function is deprecated and only present for backwards 
compatibility,
+   it does nothing.
[...]
+and 
+   are maintained for backwards compatibility, but are no longer required
+   since PostgreSQL 18.

LGTM, thanks for doing this!
--
Michael


signature.asc
Description: PGP signature


Re: bug: copy progress reporting of backends which run multiple COPYs

2024-05-07 Thread Michael Paquier
On Tue, May 07, 2024 at 07:27:54AM -0500, Justin Pryzby wrote:
> This didn't get fixed for v16, and it seems unlikely that it'll be
> addressed in back branches.
> 
> But while I was reviewing forgotten threads, it occurred to me to raise
> the issue in time to fix it for v17.

Thanks for the reminder.

FWIW, I'm rather annoyed by the fact that we rely on the ParseState to
decide if reporting should happen or not.  file_fdw tells, even if
that's accidental, that status reporting can be useful if working on a
single table.  So, shutting down the whole reporting if a caller if
BeginCopyFrom() does not give a ParseState is too heavy-handed, IMO.

The addition of report_progress in the COPY FROM state data is a good
idea, though isn't that something we should give as an argument of
BeginCopyFrom() instead if sticking this knowledge in COPY FROM?

Different idea: could it be something worth controlling with a
query-level option?  It would then be possible to provide the same
level of control for COPY TO which has reporting paths, given the
application control over the reporting even with file_fdw, and store
the value controlling the reporting in CopyFormatOptions.  I am
wondering if there would be a case where someone wants to do a COPY
but hide entirely the reporting done.

The query-level option has some appeal.
--
Michael


signature.asc
Description: PGP signature


Re: backend stuck in DataFileExtend

2024-05-07 Thread Thomas Munro
On Wed, May 8, 2024 at 6:54 AM Justin Pryzby  wrote:
> On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote:
> > https://github.com/openzfs/zfs/issues/11641
> >
> > I don't know enough to say anything useful about that but it certainly
> > smells similar...
>
> Wow - I'd completely forgotten about that problem report.
> The symptoms are the same, even with a zfs version 3+ years newer.
> I wish the ZFS people would do more with their problem reports.

If I had to guess, my first idea would be that your 1MB or ginormous
16MB recordsize (a relatively new option) combined with PostgreSQL's
8KB block-at-a-time random order I/O patterns are tickling strange
corners and finding a bug that no one has seen before.  I would
imagine that almost everyone in the galaxy who uses very large records
does so with 'settled' data that gets streamed out once sequentially
(for example I think some of the OpenZFS maintainers are at Lawrence
Livermore National Lab where I guess they might pump around petabytes
of data produced by particle physics research or whatever it might be,
probably why they they are also adding direct I/O to avoid caches
completely...).  But for us, if we have lots of backends reading,
writing and extending random 8KB fragments of a 16MB page concurrently
(2048 pages per record!), maybe we hit some broken edge...  I'd be
sure to include that sort of detail in any future reports.

Normally I suppress urges to blame problems on kernels, file systems
etc and in the past accusations that ZFS was buggy turned out to be
bugs in PostgreSQL IIRC, but user space sure seems to be off the hook
for this one...




Re: Use pgstat_kind_infos to read fixed shared stats structs

2024-05-07 Thread Michael Paquier
On Tue, May 07, 2024 at 02:07:42PM -0500, Tristan Partin wrote:
> On Tue May 7, 2024 at 1:29 PM CDT, Andres Freund wrote:
>> I think it's a good idea. I'd really like to allow extensions to register new
>> types of stats eventually. Stuff like pg_stat_statements having its own,
>> fairly ... mediocre, stats storage shouldn't be necessary.
> 
> Could be useful for Neon in the future too.

Interesting thing to consider.  If you do that, I'm wondering if you
could, actually, lift the restriction on pg_stat_statements.max and
make it a SIGHUP so as it could be increased on-the-fly..  Hmm, just a
thought in passing.

>> Do we need to increase the stats version, I didn't check if the order we
>> currently store things in and the numerical order of the stats IDs are the
>> same.
> 
> I checked the orders, and they looked the same.
> 
> 1. Archiver
> 2. BgWriter
> 3. Checkpointer
> 4. IO
> 5. SLRU
> 6. WAL

Yup, I've looked at that yesterday and the read order remains the same
so I don't see a need for a version bump.
--
Michael


signature.asc
Description: PGP signature


Re: pg_sequence_last_value() for unlogged sequences on standbys

2024-05-07 Thread Michael Paquier
On Tue, May 07, 2024 at 02:39:42PM -0500, Nathan Bossart wrote:
> Okay, phew.  We can still do something like v3-0002 for v18.  I'll give
> Michael a chance to comment on 0001 before committing/back-patching that
> one.

What you are doing in 0001, and 0002 for v18 sounds fine to me.
--
Michael


signature.asc
Description: PGP signature


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-05-07 Thread Masahiko Sawada
On Wed, May 1, 2024 at 4:29 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> > > - RT_KEY_GET_SHIFT is not covered for key=0:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> > >
> > > That should be fairly simple to add to the tests.
> >
> > There are two paths to call RT_KEY_GET_SHIFT():
> >
> > 1. RT_SET() -> RT_KEY_GET_SHIFT()
> > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
> >
> > In both cases, it's called when key > tree->ctl->max_val. Since the
> > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> > when key=0.
>
> Ah, right, so it is dead code. Nothing to worry about, but it does
> point the way to some simplifications, which I've put together in the
> attached.

Thank you for the patch. It looks good to me.

+   /* compute the smallest shift that will allowing storing the key */
+   start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;

The comment is moved from RT_KEY_GET_SHIFT() but I think s/will
allowing storing/will allow storing/.

>
> > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> > >
> > > That should be easy to add.
> >
> > Agreed. The patch is attached.
>
> LGTM
>
> > > - TidStoreCreate* has some memory clamps that are not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> > >
> > > Maybe we could experiment with using 1MB for shared, and something
> > > smaller for local.
> >
> > I've confirmed that the local and shared tidstore with small max sizes
> > such as 4kB and 1MB worked. Currently the max size is hard-coded in
> > test_tidstore.c but if we use work_mem as the max size, we can pass
> > different max sizes for local and shared in the test script.
>
> Seems okay, do you want to try that and see how it looks?

I've attached a simple patch for this. In test_tidstore.sql, we used
to create two local tidstore and one shared tidstore. I thought of
specifying small work_mem values for these three cases but it would
remove the normal test cases. So I created separate tidstore for this
test. Also, the new test is just to check if tidstore can be created
with such a small size, but it might be a good idea to add some TIDs
to check if it really works fine.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


use_work_mem_as_max_bytes.patch
Description: Binary data


Re: Use WALReadFromBuffers in more places

2024-05-07 Thread Jingtang Zhang
Hi, Bharath. I've been testing this. It's cool. Is there any way we could
monitor the hit rate about directly reading from WAL buffers by exporting
to some views?

---

Regards, Jingtang


Re: SQL function which allows to distinguish a server being in point in time recovery mode and an ordinary replica

2024-05-07 Thread Michael Paquier
On Mon, May 06, 2024 at 06:55:46PM +0300, m.litsa...@postgrespro.ru wrote:
> as a first step I have introduced the `SharedRecoveryDataFlags` bitmask
> instead of three boolean SharedHotStandbyActive, SharedPromoteIsTriggered
> and SharedStandbyModeRequested flags (the last one from my previous patch)
> and made minimal updates in corresponding code based on that change.

Thanks for the patch.

 /*
- * Local copy of SharedHotStandbyActive variable. False actually means "not
+ * Local copy of XLR_HOT_STANDBY_ACTIVE flag. False actually means "not
  * known, need to check the shared state".
  */
 static bool LocalHotStandbyActive = false;
 
 /*
- * Local copy of SharedPromoteIsTriggered variable. False actually means "not
+ * Local copy of XLR_PROMOTE_IS_TRIGGERED flag. False actually means "not
  * known, need to check the shared state".
  */
 static bool LocalPromoteIsTriggered = false;

It's a bit strange to have a bitwise set of flags in shmem while we
keep these local copies as booleans.  Perhaps it would be cleaner to
merge both local variables into a single bits32 store?

+   uint32  SharedRecoveryDataFlags;

I'd switch to bits32 for flags here.

+bool
+StandbyModeIsRequested(void)
+{
+   /*
+* Spinlock is not needed here because XLR_STANDBY_MODE_REQUESTED flag
+* can only be read after startup process is done.
+*/
+   return (XLogRecoveryCtl->SharedRecoveryDataFlags & 
XLR_STANDBY_MODE_REQUESTED) != 0;
+}

How about introducing a single wrapper function that returns the whole
value SharedRecoveryDataFlags, with the flags published in a header?
Sure, XLR_HOT_STANDBY_ACTIVE is not really exciting because being able
to query a standby implies it, but XLR_PROMOTE_IS_TRIGGERED could be
interesting?  Then this could be used with a function that returns a
text[] array with all the states retrieved?

The refactoring pieces and the function pieces should be split, for
clarity.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] json_lex_string: don't overread on bad UTF8

2024-05-07 Thread Michael Paquier
On Tue, May 07, 2024 at 02:06:10PM -0700, Jacob Champion wrote:
> Maybe I've misunderstood, but isn't that what's being done in v2?

Something a bit different..  I was wondering if it could be possible
to tweak this code to truncate the data in the generated error string
so as the incomplete multi-byte sequence is entirely cut out, which
would come to setting token_terminator to "s" (last byte before the
incomplete byte sequence) rather than "term" (last byte available,
even if incomplete):
#define FAIL_AT_CHAR_END(code) \
do { \
   char   *term = s + pg_encoding_mblen(lex->input_encoding, s); \
   lex->token_terminator = (term <= end) ? term : s; \
   return code; \
} while (0)

But looking closer, I can see that in the JSON_INVALID_TOKEN case,
when !tok_done, we set token_terminator to point to the end of the
token, and that would include an incomplete byte sequence like in your
case.  :/

At the end of the day, I think that I'm OK with your patch and avoid
the overread for now in the back-branches.  This situation makes me
uncomfortable and we should put more effort in printing error messages
in a readable format, but that could always be tackled later as a
separate problem..  And I don't see something backpatchable at short
sight for v16.

Thoughts and/or objections?
--
Michael


signature.asc
Description: PGP signature