Re: Insertion Sort Improvements

2023-05-23 Thread Benjamin Coutu
Greetings,

I would like to revisit the discussion and concur with John's perspective that 
incremental progress through small, successive modifications is the appropriate 
approach to move forward. Therefore, I would like to propose two distinct ideas 
aimed at enhancing the functionality of insertion sort:

1. Implementation of a more efficient variant (as described in the introductory 
part of this thread):

 OLD 

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
 pl -= ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);

 NEW 

for (
pm = a + ST_POINTER_STEP;
pm < a + n * ST_POINTER_STEP;
pm += ST_POINTER_STEP
) {
ST_POINTER_TYPE tmp = *pm;
 
for (
pl = pm - ST_POINTER_STEP;
pl >= a && DO_COMPARE(pl, &tmp) > 0;
pl -= ST_POINTER_STEP
)
*(pl + ST_POINTER_STEP) = *pl;

*(pl + ST_POINTER_STEP) = tmp;
}



2. It appears that there is a consensus against disregarding the presorted 
check, despite its questionable value. In light of this, an alternative 
suggestion is to integrate the presorted check into the insertion sort path by 
utilizing an unbounded insertion sort. Only after a maximum number of swaps 
have occurred should we abandon the sorting process. If insertion sort is 
executed on the entire array without any swaps, we can simply return. If not, 
and we continue with quicksort after the swap limit has been reached, we at 
least have left the array in a more sorted state, which may likely be 
advantageous for subsequent iterations.

 OLD 

if (n < 7)
{
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 
0;
 pl -= ST_POINTER_STEP)
DO_SWAP(pl, pl - ST_POINTER_STEP);
return;
}
presorted = 1;
for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
{
DO_CHECK_FOR_INTERRUPTS();
if (DO_COMPARE(pm - ST_POINTER_STEP, pm) > 0)
{
presorted = 0;
break;
}
}
if (presorted)
return;

 NEW 

#define LIMIT_SWAPS 30 /* to be determined empirically */

int swaps = 0;

for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
 pm += ST_POINTER_STEP)
for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
 pl -= ST_POINTER_STEP) {
DO_SWAP(pl, pl - ST_POINTER_STEP);

if (++swaps == LIMIT_SWAPS)
goto quicksort;
}

if (swaps == 0)
return;

quicksort:



Naturally, both modifications (with point 2 being highly speculative) are 
currently independent of each other, and it is also crucial to benchmark the 
combined variant as well as different values for LIMIT_SWAPS.
I would greatly appreciate assistance in benchmarking these proposed changes. 
Your collaboration in this matter would be invaluable.

Cheers, Ben




Re: "38.10.10. Shared Memory and LWLocks" may require a clarification

2023-05-23 Thread Aleksander Alekseev
Hi hackers,

That's me still talking to myself :)

> Thoughts?

Evidently this works differently from what I initially thought on
Windows due to lack of fork() on this system.

PFA the patch v3. Your feedback is most welcomed.

--
Best regards,
Aleksander Alekseev


v3-0001-Clarify-the-38.10.10.-Shared-Memory-and-LWLocks-s.patch
Description: Binary data


Re: running logical replication as the subscription owner

2023-05-23 Thread Amit Kapila
On Mon, May 22, 2023 at 6:06 PM Masahiko Sawada  wrote:
>
> Thank you for updating the patch! Here are review comments:
>
> +   /*
> +* Make sure that the copy command runs as the table owner, unless
> +* the user has opted out of that behaviour.
> +*/
> +   run_as_owner = MySubscription->runasowner;
> +   if (!run_as_owner)
> +   SwitchToUntrustedUser(rel->rd_rel->relowner, &ucxt);
> +
> /* Now do the initial data copy */
> PushActiveSnapshot(GetTransactionSnapshot());
>
> I think we should switch users before the acl check in
> LogicalRepSyncTableStart().
>

Agreed, we should check acl with the user that is going to perform
operations on the target table. BTW, is it okay to perform an
operation on the system table with the changed user as that would be
possible with your suggestion (see replorigin_create())?

>
> While this approach works, I'm not sure we really need a trigger for
> this test. I've attached a patch for discussion that doesn't use
> triggers for the regression tests. We create a new subscription owned
> by a user who doesn't have the permission to the target table. The
> test passes only if run_as_owner = true works.
>

Why in the test do you need to give additional permissions to
regress_admin2 when the actual operation has to be performed by the
table owner?

+# Because the initial data sync is working as the table owner, all
+# dat should be copied.

Typo. /dat/data

-- 
With Regards,
Amit Kapila.




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-05-23 Thread Daniele Varrazzo
On Sat, 20 May 2023 at 00:01, Jacob Champion  wrote:

> - Some clients in the wild (psycopg2/psycopg) suppress all notifications
> during PQconnectPoll().

If there is anything we can improve in psycopg please reach out.

-- Daniele




Re: ERROR: no relation entry for relid 6

2023-05-23 Thread Richard Guo
On Tue, May 23, 2023 at 2:38 PM Richard Guo  wrote:

> I came across $subject on master and here is the query I'm running.
>
> create table t (a int unique, b int);
>
> explain (costs off)
> select 1 from t t1
> left join t t2 on true
>inner join t t3 on true
> left join t t4 on t2.a = t4.a and t2.a = t3.a;
> ERROR:  no relation entry for relid 6
>
> I looked into it and it should be caused by some problem in outer-join
> removal.  After we've decided that the join to t4 is removable, which is
> no problem, one of the things we need to do is to remove any joinquals
> referencing t4 from the joininfo lists.  In this query, there are two
> such quals, 't2.a = t4.a' and 't2.a = t3.a'.  And both quals have two
> versions, one with t1/t2 join in their nulling bitmap and one without.
> The former version would be treated as being "pushed down" because its
> required_relids exceed the scope of joinrelids, due to the t1/t2 join
> included in the qual's nulling bitmap but not included in joinrelids.
> And as a result this version of quals would be put back.  I think this
> is where the problem is.  Ideally we should not put them back.
>
> This issue seems to have existed for a while, and is revealed by the
> change in c8b881d2 recently.  I'm not sure how to fix it yet.  What I'm
> thinking is that maybe this has something to do with the loose ends we
> have in make_outerjoininfo.  Actually in this query the t1/t2 join
> cannot commute with the join to t4.  If we can know that in
> make_outerjoininfo, we'd have added t1/t2 join to the min_lefthand of
> the join to t4, and that would avoid this issue.
>

Considering that clone clauses should always be outer-join clauses not
filter clauses, I'm wondering if we can add an additional check for that
in RINFO_IS_PUSHED_DOWN, something like

 #define RINFO_IS_PUSHED_DOWN(rinfo, joinrelids) \
-   ((rinfo)->is_pushed_down || \
-!bms_is_subset((rinfo)->required_relids, joinrelids))
+   (!((rinfo)->has_clone || ((rinfo)->is_clone)) && \
+((rinfo)->is_pushed_down || \
+ !bms_is_subset((rinfo)->required_relids, joinrelids)))

This change can fix the case shown upthread.  But I doubt it's the
perfect fix we want.

Thanks
Richard


Re: running logical replication as the subscription owner

2023-05-23 Thread Ajin Cherian
On Mon, May 22, 2023 at 10:36 PM Masahiko Sawada  wrote:
>
> On Wed, May 17, 2023 at 10:10 AM Ajin Cherian  wrote:
> >
> > On Mon, May 15, 2023 at 10:47 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, May 15, 2023 at 5:44 PM Ajin Cherian  wrote:
> > > >
> > > > On Fri, May 12, 2023 at 9:55 PM Ajin Cherian  wrote:
> > > > >
> > > > > If nobody else is working on this, I can come up with a patch to fix 
> > > > > this
> > > > >
> > > >
> > > > Attaching a patch which attempts to fix this.
> > > >
> > >
> > > Thank you for the patch! I think we might want to have tests for it.
> > >
> > I have updated the patch with a test case as well.
>
> Thank you for updating the patch! Here are review comments:
>
> +   /*
> +* Make sure that the copy command runs as the table owner, unless
> +* the user has opted out of that behaviour.
> +*/
> +   run_as_owner = MySubscription->runasowner;
> +   if (!run_as_owner)
> +   SwitchToUntrustedUser(rel->rd_rel->relowner, &ucxt);
> +
> /* Now do the initial data copy */
> PushActiveSnapshot(GetTransactionSnapshot());
>
> I think we should switch users before the acl check in
> LogicalRepSyncTableStart().
>
> ---
> +# Create a trigger on table alice.unpartitioned that writes
> +# to a table that regress_alice does not have permission.
> +$node_subscriber->safe_psql(
> +   'postgres', qq(
> +SET SESSION AUTHORIZATION regress_alice;
> +CREATE OR REPLACE FUNCTION alice.alice_audit()
> +RETURNS trigger AS
> +\$\$
> +   BEGIN
> +   insert into public.admin_audit values(2);
> +   RETURN NEW;
> +   END;
> +\$\$
> +LANGUAGE 'plpgsql';
> +CREATE TRIGGER ALICE_TRIGGER AFTER INSERT ON alice.unpartitioned FOR EACH ROW
> +EXECUTE PROCEDURE alice.alice_audit();
> +ALTER TABLE alice.unpartitioned ENABLE ALWAYS TRIGGER ALICE_TRIGGER;
> +));
>
> While this approach works, I'm not sure we really need a trigger for
> this test. I've attached a patch for discussion that doesn't use
> triggers for the regression tests. We create a new subscription owned
> by a user who doesn't have the permission to the target table. The
> test passes only if run_as_owner = true works.
>
this is better, thanks. Since you are testing run_as_owner = false behaviour
during table copy phase, you might as well add a test case that it
correctly behaves during insert replication as well.

regards,
Ajin Cherian
Fujitsu Australia




Re: ResourceOwner refactoring

2023-05-23 Thread Aleksander Alekseev
Hi,

> [...]
> A-ha, it's because I didn't add the new test_resowner directory to the
> src/test/modules/meson.build file. Fixed.
>
> Thanks for the review!

You probably already noticed, but for the record: cfbot seems to be
extremely unhappy with the patch.

-- 
Best regards,
Aleksander Alekseev




Re: pgsql: Move snowball_create.sql creation into perl file

2023-05-23 Thread Christoph Berg
Re: Andres Freund
> Move snowball_create.sql creation into perl file
> 
> This is in preparation for building postgres with meson / ninja.
> 
> We already have duplicated code for this between the make and msvc
> builds. Adding a third copy seems like a bad plan, thus move the generation
> into a perl script.
> 
> As we don't want to rely on perl being available for builds from tarballs,
> generate the file during distprep.
> 
> Author: Peter Eisentraut 
> Author: Andres Freund 
> Discussion: 
> https://postgr.es/m/5e216522-ba3c-f0e6-7f97-5276d0270...@enterprisedb.com

Hi,

this seems to have broken out-of-tree builds from tarballs:

make -C backend/snowball install
make[3]: Entering directory 
'/srv/projects/postgresql/debian/16/build/src/backend/snowball'
/bin/mkdir -p 
'/srv/projects/postgresql/debian/16/build/tmp_install/usr/lib/postgresql/16/lib'
/bin/mkdir -p 
'/srv/projects/postgresql/debian/16/build/tmp_install/usr/share/postgresql/16' 
'/srv/projects/postgresql/debian/16/build/tmp_install/usr/share/postgresql/16/tsearch_data'
/usr/bin/install -c -m 755  dict_snowball.so 
'/srv/projects/postgresql/debian/16/build/tmp_install/usr/lib/postgresql/16/lib/dict_snowball.so'
/usr/bin/install -c -m 644 snowball_create.sql 
'/srv/projects/postgresql/debian/16/build/tmp_install/usr/share/postgresql/16'
/usr/bin/install: cannot stat 'snowball_create.sql': No such file or directory
make[3]: *** [Makefile:110: install] Error 1

The file is present in src/backend/snowball/ but not in 
build/src/backend/snowball/:

-rw-r--r-- 1 myon myon 44176 22. Mai 21:20 
src/backend/snowball/snowball_create.sql

Christoph




Re: pgsql: Move snowball_create.sql creation into perl file

2023-05-23 Thread Christoph Berg
Re: To Andres Freund
> this seems to have broken out-of-tree builds from tarballs:
> 
> /usr/bin/install -c -m 644 snowball_create.sql 
> '/srv/projects/postgresql/debian/16/build/tmp_install/usr/share/postgresql/16'
> /usr/bin/install: cannot stat 'snowball_create.sql': No such file or directory

Fortunately, there is an easy workaround, just delete
src/backend/snowball/snowball_create.sql before building, it will then
be recreated in the proper build directory.

Christoph




Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-05-23 Thread Christoph Berg
Re: Andres Freund
> A prototype of that approach is attached. I pushed the retry handling into the
> pg_* routines where applicable.  I guess we could add pg_* routines for
> FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> 
> Christoph, could you verify this fixes your issue?

Hi,

I believe this issue is still open for PG16.

Christoph




Re: pgsql: Move snowball_create.sql creation into perl file

2023-05-23 Thread Tom Lane
Christoph Berg  writes:
>> this seems to have broken out-of-tree builds from tarballs:
>> 
>> /usr/bin/install -c -m 644 snowball_create.sql 
>> '/srv/projects/postgresql/debian/16/build/tmp_install/usr/share/postgresql/16'
>> /usr/bin/install: cannot stat 'snowball_create.sql': No such file or 
>> directory

I think the attached will do for a proper fix.  I'm not inclined
to re-wrap just for this.

regards, tom lane

diff --git a/src/backend/snowball/Makefile b/src/backend/snowball/Makefile
index 29076371db..4bebfa0250 100644
--- a/src/backend/snowball/Makefile
+++ b/src/backend/snowball/Makefile
@@ -106,10 +106,13 @@ $(SQLSCRIPT): snowball_create.pl snowball_func.sql.in snowball.sql.in
 
 distprep: $(SQLSCRIPT)
 
-install: all installdirs install-lib
-	$(INSTALL_DATA) $(SQLSCRIPT) '$(DESTDIR)$(datadir)'
+install: all installdirs install-lib install-script
 	$(INSTALL_DATA) $(addprefix $(srcdir)/stopwords/,$(stop_files)) '$(DESTDIR)$(datadir)/$(DICTDIR)'
 
+# $(SQLSCRIPT) might be in the srcdir or the build dir
+install-script: $(SQLSCRIPT)
+	$(INSTALL_DATA) $< '$(DESTDIR)$(datadir)'
+
 installdirs: installdirs-lib
 	$(MKDIR_P) '$(DESTDIR)$(datadir)' '$(DESTDIR)$(datadir)/$(DICTDIR)'
 


Re: pgsql: Move snowball_create.sql creation into perl file

2023-05-23 Thread Christoph Berg
Re: Tom Lane
> I think the attached will do for a proper fix.  I'm not inclined
> to re-wrap just for this.

Sure, I just posted it here in case others run into the same problem.

Thanks!

Christoph




Re: Zstandard support for toast compression

2023-05-23 Thread Aleksander Alekseev
Hi,

> Yeah - I think we had better reserve the fourth bit pattern for
> something extensible e.g. another byte or several to specify the
> actual method, so that we don't have a hard limit of 4 methods.

TWIMC there is an ongoing discussion [1] of making TOAST pointers
extendable since this is a dependency for several patches that are
currently in development.

TL;DR the consensus so far seems to be using varattrib_1b_e.va_tag as
a sign of an alternative / extended TOAST pointer content. For the
on-disk values va_tag currently stores a constant 18 (VARTAG_ONDISK).
Where 18 is sizeof(varatt_external) + /* header size */ 2, which seems
to be not extremely useful anyway. If you are interested in the topic
please consider joining the thread.

[1]: 
https://postgr.es/m/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev




Re: drop postmaster symlink

2023-05-23 Thread Hans Buschmann
When creating a private service for another instance of PostgreSQL I used the 
template of postgresql-15.service file installed into /usr/lib/systemd/system 
on Fedora 38 provided through the installation for postgres 15.3 from PGDG 
repositories.


There I noticed that the line ExecStart still uses the postmaster link.


I would propose to change it to

ExecStart=/usr/pgsql-15/bin/postgres -D $(PGDATA)


This change should apply also to back branches to avoid using deprecated links 
in PGDG software.

This seems to be necessesary on upcoming PG16.


(BTW: where is this all documented?)


Hans Buschmann


Re: pgsql: TAP test for logical decoding on standby

2023-05-23 Thread Robert Haas
On Sat, Apr 8, 2023 at 5:26 AM Andres Freund  wrote:
> TAP test for logical decoding on standby

Small nitpicks:

1. The test names generated by check_slots_conflicting_status() start
with a capital letter, while most other test names start with a
lower-case letter.

2. The function is called 7 times, 6 with a true argument and 1 with a
false argument, but the test name only depends on whether the argument
is true or false, so we get the same test name 6 times. Maybe there's
not a reasonable way to do better, I'm not sure, but it's not ideal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-05-23 Thread Jacob Champion
On Tue, May 23, 2023 at 4:22 AM Daniele Varrazzo
 wrote:
> On Sat, 20 May 2023 at 00:01, Jacob Champion  wrote:
> > - Some clients in the wild (psycopg2/psycopg) suppress all notifications
> > during PQconnectPoll().
>
> If there is anything we can improve in psycopg please reach out.

Will do, thank you! But in this case, I think there's nothing to
improve in psycopg -- in fact, it highlighted the problem with my
initial design, and now I think the notice processor will never be an
appropriate avenue for communication of the user code.

The biggest issue is that there's a chicken-and-egg situation: if
you're using the synchronous PQconnect* API, you can't override the
notice hooks while the handshake is in progress, because you don't
have a connection handle yet. The second problem is that there are a
bunch of parameters coming back from the server (user code,
verification URI, expiration time) that the application may choose to
display or use, and communicating those pieces in a (probably already
translated) flat text string is a pretty hostile API.

So I think we'll probably need to provide a global handler API,
similar to the passphrase hook we currently provide, that can receive
these pieces separately and assemble them however the application
desires. The hard part will be to avoid painting ourselves into a
corner, because this particular information is specific to the device
authorization flow, and if we ever want to add other flows into libpq,
we'll probably not want to add even more hooks.

Thanks,
--Jacob




memory leak in trigger handling (since PG12)

2023-05-23 Thread Tomas Vondra
Hi,

it seems there's a fairly annoying memory leak in trigger code,
introduced by

  commit fc22b6623b6b3bab3cb057ccd282c2bfad1a0b30
  Author: Peter Eisentraut 
  Date:   Sat Mar 30 08:13:09 2019 +0100

Generated columns
...

which added GetAllUpdatedColumns() and uses it many places instead of
the original GetUpdatedColumns():

   #define GetUpdatedColumns(relinfo, estate) \
   (exec_rt_fetch((relinfo)->ri_RangeTableIndex,
  estate)->updatedCols)

   #define GetAllUpdatedColumns(relinfo, estate) \
(bms_union(exec_rt_fetch((relinfo)->ri_RangeTableIndex,
 estate)->updatedCols, \
   exec_rt_fetch((relinfo)->ri_RangeTableIndex,
 estate)->extraUpdatedCols))

Notice this creates a new bitmap on every calls. That's a bit of a
problem, because we call this from:

  - ExecASUpdateTriggers
  - ExecARUpdateTriggers
  - ExecBSUpdateTriggers
  - ExecBRUpdateTriggers
  - ExecUpdateLockMode

This means that for an UPDATE with triggers, we may end up calling this
for each row, possibly multiple bitmaps. And those bitmaps are allocated
in ExecutorState, so won't be freed until the end of the query :-(

The bitmaps are typically fairly small (a couple bytes), but for wider
tables it can be a couple dozen bytes. But it's primarily driven by
number of updated rows.

It's easy to leak gigabytes when updating ~10M rows. I've seen cases
with a couple tens of GBs leaked, though, but in that case it seems to
be caused by UPDATE ... FROM missing a join condition (so in fact the
memory leak is proportional to number of rows in the join result, not
the number we end up updating).

Attached is a patch, restoring the pre-12 behavior for me.


While looking for other places allocating stuff in ExecutorState (for
the UPDATE case) and leaving it there, I found two more cases:

1) copy_plpgsql_datums

2) make_expanded_record_from_tupdesc
   make_expanded_record_from_exprecord

All of this is calls from plpgsql_exec_trigger.

Fixing the copy_plpgsql_datums case seems fairly simple, the space
allocated for local copies can be freed during the cleanup. That's what
0002 does.

I'm not sure what to do about the expanded records. My understanding of
the expanded record lifecycle is fairly limited, so my (rather naive)
attempt to free the memory failed ...


I wonder how much we should care about these cases. On the one hand we
often leave the cleanup up to the memory context, but the assumption is
the context is not unnecessarily long-lived. And ExecutorState is not
that. And leaking memory per-row does not seem great either.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 85439e2587f035040c82e7303b96b887e650b01f Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 23 May 2023 17:45:47 +0200
Subject: [PATCH 1/2] Free updatedCols bitmaps

---
 src/backend/commands/trigger.c  | 22 --
 src/backend/executor/execMain.c |  9 +++--
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 4b295f8da5e..fb11203c473 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2916,6 +2916,8 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
 	(errcode(ERRCODE_E_R_I_E_TRIGGER_PROTOCOL_VIOLATED),
 	 errmsg("BEFORE STATEMENT trigger cannot return a value")));
 	}
+
+	bms_free(updatedCols);
 }
 
 void
@@ -2928,12 +2930,18 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 	Assert(relinfo->ri_RootResultRelInfo == NULL);
 
 	if (trigdesc && trigdesc->trig_update_after_statement)
+	{
+		Bitmapset *updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+
 		AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
 			  TRIGGER_EVENT_UPDATE,
 			  false, NULL, NULL, NIL,
-			  ExecGetAllUpdatedCols(relinfo, estate),
+			  updatedCols,
 			  transition_capture,
 			  false);
+
+		bms_free(updatedCols);
+	}
 }
 
 bool
@@ -3043,6 +3051,9 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 heap_freetuple(trigtuple);
 			if (should_free_new)
 heap_freetuple(oldtuple);
+
+			bms_free(updatedCols);
+
 			return false;		/* "do nothing" */
 		}
 		else if (newtuple != oldtuple)
@@ -3068,6 +3079,8 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	if (should_free_trig)
 		heap_freetuple(trigtuple);
 
+	bms_free(updatedCols);
+
 	return true;
 }
 
@@ -3107,6 +3120,7 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 		 */
 		TupleTableSlot *oldslot;
 		ResultRelInfo *tupsrc;
+		Bitmapset *updatedCols;
 
 		Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
 			   !is_crosspart_update);
@@ -3129,14 +3143,18 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 		else
 			ExecClearTuple(oldslot);
 
+		updatedCols = ExecGetAllUpdatedCols(relinfo, estate);
+

Re: RFI: Extending the TOAST Pointer

2023-05-23 Thread Robert Haas
On Thu, May 18, 2023 at 8:06 AM Matthias van de Meent
 wrote:
> This enum still has many options to go before it exceeds the maximum
> of the uint8 va_tag field. Therefore, I don't think we have no disk
> representations left, nor do I think we'll need to add another option
> to the ToastCompressionId enum.
> As an example, we can add another VARTAG option for dictionary-enabled
> external toast; like what the pluggable toast patch worked on. I think
> we can salvage some ideas from that patch, even if the main idea got
> stuck.

Adding another VARTAG option is somewhat different from adding another
ToastCompressionId. I think that right now we have embedded in various
places the idea that VARTAG_EXTERNAL is the only thing that shows up
on disk, and we'd need to track down all such places and adjust them
if we add other VARTAG types in the future. Depending on how it is to
be used, adding a new ToastCompressionId might be less work. However,
I don't think we can use the possibility of adding a new VARTAG value
as a reason why it's OK to use up the last possible ToastCompressionId
value for something non-extensible.

For projects like this, the details matter a lot. If the goal is to
add a new compression type that behaves like the existing compression
types, more or less, then I think we should allocate the last
ToastCompressionId bit to mean "some other compression ID" and add a
1-byte header that says which one is in use. But if the new feature
being added is enough different from the other compression methods,
then it might be better to do it in some other way e.g. a new VARTAG.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Tom Lane
Tomas Vondra  writes:
> it seems there's a fairly annoying memory leak in trigger code,
> introduced by
> ...
> Attached is a patch, restoring the pre-12 behavior for me.

> While looking for other places allocating stuff in ExecutorState (for
> the UPDATE case) and leaving it there, I found two more cases:

> 1) copy_plpgsql_datums

> 2) make_expanded_record_from_tupdesc
>make_expanded_record_from_exprecord

> All of this is calls from plpgsql_exec_trigger.

Not sure about the expanded-record case, but both of your other two
fixes feel like poor substitutes for pushing the memory into a
shorter-lived context.  In particular I'm quite surprised that
plpgsql isn't already allocating that workspace in the "procedure"
memory context.

> I wonder how much we should care about these cases. On the one hand we
> often leave the cleanup up to the memory context, but the assumption is
> the context is not unnecessarily long-lived. And ExecutorState is not
> that. And leaking memory per-row does not seem great either.

I agree per-row leaks in the ExecutorState context are not cool.

regards, tom lane




Wrong command name in writeable-CTE related error messages

2023-05-23 Thread Markus Winand
Hi!

I noticed that errors due to writable CTEs in read-only or non-volatile context 
say the offensive command is SELECT.

For example a writeable CTE in a IMMUTABLE function:

 CREATE TABLE t (x INTEGER);

 CREATE FUNCTION immutable_func()
  RETURNS INTEGER
  LANGUAGE SQL
  IMMUTABLE
  AS $$
  WITH x AS (
INSERT INTO t (x) VALUES (1) RETURNING x
  ) SELECT * FROM x;
  $$;

 SELECT immutable_func();

 ERROR:  SELECT is not allowed in a non-volatile function

Or a writeable CTE in read-only transaction:

 START TRANSACTION READ ONLY;
 WITH x AS (
   INSERT INTO t (x) VALUES (1) RETURNING x
 )
 SELECT * FROM x;

 ERROR:  cannot execute SELECT in a read-only transaction

My first thought was that these error messages should mention INSERT, but after 
looking into the source I’m not sure anymore. The name of the command is 
obtained from CreateCommandName(). After briefly looking around it doesn’t seem 
to be trivial to introduce something along the line of 
CreateModifyingCommandName().

So I started by using a different error message at those places where I think 
it should. I’ve attached a patch for reference, but I’m not happy with it. In 
particular I’m unsure about the SPI stuff (how to test?) and if there are more 
cases as those covered by the patch. Ultimately getting hold of the command 
name might also be beneficial for a new error message.

  A WITH clause containing a data-modifying statement is not allowed in a 
read-only transaction

It wouldn’t make me sad if somebody who touches the code more often than once 
every few years can take care of it.

-markus


WIP-writable_cte_error_message-001.patch
Description: Binary data


Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Andres Freund
Hi,

On 2023-05-23 18:23:00 +0200, Tomas Vondra wrote:
> This means that for an UPDATE with triggers, we may end up calling this
> for each row, possibly multiple bitmaps. And those bitmaps are allocated
> in ExecutorState, so won't be freed until the end of the query :-(

Ugh.


I've wondered about some form of instrumentation to detect such issues
before. It's obviously a problem that we can have fairly large leaks, like the
one you just discovered, without detecting it for a couple years. It's kinda
made worse by the memory context infrastructure, because it hides such issues.

Could it help to have a mode where the executor shutdown hook checks how much
memory is allocated in ExecutorState and warns if its too much? There's IIRC a
few places that allocate large things directly in it, but most of those
probably should be dedicated contexts anyway.  Something similar could be
useful for some other long-lived contexts.


> The bitmaps are typically fairly small (a couple bytes), but for wider
> tables it can be a couple dozen bytes. But it's primarily driven by
> number of updated rows.

Random aside: I've been wondering whether it'd be worth introducing an
in-place representation of Bitmap (e.g. if the low bit is set, the low 63 bits
are in-place, if unset, it's a pointer).


> Attached is a patch, restoring the pre-12 behavior for me.

Hm. Somehow this doesn't seem quite right. Shouldn't we try to use a shorter
lived memory context instead? Otherwise we'll just end up with the same
problem in a few years.


Greetings,

Andres Freund




Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Tom Lane
Andres Freund  writes:
> I've wondered about some form of instrumentation to detect such issues
> before.

Yeah.

> Could it help to have a mode where the executor shutdown hook checks how much
> memory is allocated in ExecutorState and warns if its too much?

It'd be very hard to set a limit for what's "too much", since the amount
of stuff created initially will depend on the plan size.  In any case
I think that the important issue is not how much absolute space, but is
there per-row leakage.  I wonder if we could do something involving
checking for continued growth after the first retrieved tuple, or
something like that.

> Random aside: I've been wondering whether it'd be worth introducing an
> in-place representation of Bitmap (e.g. if the low bit is set, the low 63 bits
> are in-place, if unset, it's a pointer).

Why?  Unlike Lists, those things are already a single palloc chunk.

regards, tom lane




Use COPY for populating all pgbench tables

2023-05-23 Thread Tristan Partin
Hello,

We (Neon) have noticed that pgbench can be quite slow to populate data
in regard to higher latency connections. Higher scale factors exacerbate
this problem. Some employees work on a different continent than the
databases they might be benchmarking. By moving pgbench to use COPY for
populating all tables, we can reduce some of the time pgbench takes for
this particular step. 

I wanted to come with benchmarks, but unfortunately I won't have them
until next month. I can follow-up in a future email.

-- 
Tristan Partin
Neon (https://neon.tech)
From 498f5c7fddceafed3f831963dfca12256746390c Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 09:21:55 -0500
Subject: [PATCH postgres v1 1/2] Move constant into format string

If we are always going to write a 0, just put the 0 in the format
string.
---
 src/bin/pgbench/pgbench.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 70ed034e70..0fbb9a2616 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4939,8 +4939,8 @@ initGenerateDataClientSide(PGconn *con)
 
 		/* "filler" column defaults to blank padded empty string */
 		printfPQExpBuffer(&sql,
-		  INT64_FORMAT "\t" INT64_FORMAT "\t%d\t\n",
-		  j, k / naccounts + 1, 0);
+		  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+		  j, k / naccounts + 1);
 		if (PQputline(con, sql.data))
 			pg_fatal("PQputline failed");
 
-- 
-- 
Tristan Partin
Neon (https://neon.tech)

From 6a326d6e3197142ee03140679ebedc978ee1485c Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 23 May 2023 11:48:16 -0500
Subject: [PATCH postgres v1 2/2] Use COPY instead of INSERT for populating all
 tables

COPY is a better interface for bulk loading and/or high latency
connections. Previously COPY was only used for the pgbench_accounts
table because the amount of data was so much larger than the other
tables. However, as already said there are also gains to be had in the
high latency case, such as loading data across continents.
---
 src/bin/pgbench/pgbench.c | 152 ++
 1 file changed, 88 insertions(+), 64 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 0fbb9a2616..6c7f143e86 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4861,17 +4861,46 @@ initTruncateTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
-/*
- * Fill the standard tables with some data generated and sent from the client
- */
+typedef void initRow(PGconn *con, PQExpBufferData *sql, int64 curr);
+
 static void
-initGenerateDataClientSide(PGconn *con)
+initBranch(PGconn *con, PQExpBufferData *sql, int64 curr)
 {
-	PQExpBufferData sql;
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t0\t\n",
+	  curr + 1);
+}
+
+static void
+initTeller(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to NULL */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / ntellers + 1);
+}
+
+static void
+initAccount(PGconn *con, PQExpBufferData *sql, int64 curr)
+{
+	/* "filler" column defaults to blank padded empty string */
+	printfPQExpBuffer(sql,
+	  INT64_FORMAT "\t" INT64_FORMAT "\t0\t\n",
+	  curr + 1, curr / naccounts + 1);
+}
+
+static void
+initPopulateTable(PGconn *con, const char *table, int64 base, initRow *init_row)
+{
+	int 		n;
+	int64 		k;
 	PGresult   *res;
-	int			i;
-	int64		k;
-	char	   *copy_statement;
+	PQExpBufferData sql;
+	bool		printed = false;
+	char 		copy_statement[256];
+	const char *copy_statement_fmt;
+	const int64 total = base * scale;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4880,50 +4909,19 @@ initGenerateDataClientSide(PGconn *con)
 	/* Stay on the same line if reporting to a terminal */
 	char		eol = isatty(fileno(stderr)) ? '\r' : '\n';
 
-	fprintf(stderr, "generating data (client-side)...\n");
-
-	/*
-	 * we do all of this in one transaction to enable the backend's
-	 * data-loading optimizations
-	 */
-	executeStatement(con, "begin");
-
-	/* truncate away any old data */
-	initTruncateTables(con);
-
 	initPQExpBuffer(&sql);
 
-	/*
-	 * fill branches, tellers, accounts in that order in case foreign keys
-	 * already exist
-	 */
-	for (i = 0; i < nbranches * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(&sql,
-		  "insert into pgbench_branches(bid,bbalance) values(%d,0)",
-		  i + 1);
-		executeStatement(con, sql.data);
-	}
-
-	for (i = 0; i < ntellers * scale; i++)
-	{
-		/* "filler" column defaults to NULL */
-		printfPQExpBuffer(&sql,
-		  "insert into pgbench_tellers(tid,bid,tbalance) values (%d,%d,0)",
-		  i + 1, i / ntellers + 1);
-		executeStatement(con, sql.data);
-	}
-
-	/*
-	 * accounts is big enough to be worth using COPY and tracking runtime
-	 */
-
 	/* use COPY with FREEZ

Re: Wrong command name in writeable-CTE related error messages

2023-05-23 Thread Tom Lane
Markus Winand  writes:
> I noticed that errors due to writable CTEs in read-only or non-volatile 
> context say the offensive command is SELECT.

Good point.

> My first thought was that these error messages should mention INSERT, but 
> after looking into the source I’m not sure anymore. The name of the command 
> is obtained from CreateCommandName(). After briefly looking around it doesn’t 
> seem to be trivial to introduce something along the line of 
> CreateModifyingCommandName().

Yeah, you would have to inspect the plan tree pretty carefully to
determine that.

Given the way the test is written, maybe it'd make sense to forget about
mentioning the command name, and instead identify the table we are
complaining about:

ERROR: table "foo" cannot be modified in a read-only transaction

I don't see any huge point in using PreventCommandIfReadOnly if we
go that way, so no refactoring is needed: just test XactReadOnly
directly.

regards, tom lane




Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Andres Freund
Hi,

On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Could it help to have a mode where the executor shutdown hook checks how 
> > much
> > memory is allocated in ExecutorState and warns if its too much?
>
> It'd be very hard to set a limit for what's "too much", since the amount
> of stuff created initially will depend on the plan size.

I was thinking of some limit that should really never be reached outside of a
leak or work_mem based allocations, say 2GB or so.


> In any case I think that the important issue is not how much absolute space,
> but is there per-row leakage.  I wonder if we could do something involving
> checking for continued growth after the first retrieved tuple, or something
> like that.

As-is I some nodes do large allocations the query context, but are not
guaranteed to be reached when gathering the first row. So we would still have
to move such large allocations out of ExecutorState.


I think it might be best to go for a combination of these two
heuristics. Store the size of es_query_context after standard_ExecutorStart(),
that would include the allocation of the executor tree itself. Then in
standard_ExecutorEnd(), if the difference in size of ExecutorState is bigger
than some constant *and* is bigger than the initial size by some factor, emit
a warning.

The constant size difference avoids spurious warnings in case of a small plan
that just grows due to a few fmgr lookups or such, the factor takes care of
the plan complexity?


> > Random aside: I've been wondering whether it'd be worth introducing an
> > in-place representation of Bitmap (e.g. if the low bit is set, the low 63 
> > bits
> > are in-place, if unset, it's a pointer).
>
> Why?  Unlike Lists, those things are already a single palloc chunk.

We do a fair amount of 8 byte allocations - they have quite a bit of overhead,
even after c6e0fe1f2a0. Not needing allocations for the common case of
bitmapsets with a max member < 63 seems like it could be worth it.

Greetings,

Andres Freund




Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Tom Lane
Andres Freund  writes:
> On 2023-05-23 13:28:30 -0400, Tom Lane wrote:
>> Why?  Unlike Lists, those things are already a single palloc chunk.

> We do a fair amount of 8 byte allocations - they have quite a bit of overhead,
> even after c6e0fe1f2a0. Not needing allocations for the common case of
> bitmapsets with a max member < 63 seems like it could be worth it.

Oh, now I understand what you meant: use the pointer's bits as data.
Dunno that it's a good idea though.  You'd pay for the palloc savings
by needing two or four code paths in every bitmapset function, because
the need to reserve one bit would mean you couldn't readily make the
two cases look alike at the bit-pushing level.

Another big problem is that we'd have to return to treating bitmapsets
as a special-purpose thing rather than a kind of Node.  While that's
not very deeply embedded yet, I recall that the alternatives weren't
attractive.

Also, returning to the original topic: we'd never find leaks of the
sort complained of here, because they wouldn't exist in cases with
fewer than 64 relations per query (or whatever the bitmap is
representing).

regards, tom lane




Re: pgsql: Move snowball_create.sql creation into perl file

2023-05-23 Thread Andres Freund
Hi,

On 2023-05-23 10:46:30 -0400, Tom Lane wrote:
> Christoph Berg  writes:
> >> this seems to have broken out-of-tree builds from tarballs:
> >> 
> >> /usr/bin/install -c -m 644 snowball_create.sql 
> >> '/srv/projects/postgresql/debian/16/build/tmp_install/usr/share/postgresql/16'
> >> /usr/bin/install: cannot stat 'snowball_create.sql': No such file or 
> >> directory
> 
> I think the attached will do for a proper fix.

Thanks.


> I'm not inclined to re-wrap just for this.

Agreed.


I wonder if we should add a CI task to test creating a tarball and building
from it, both inside the source directory and as a vpath build? We rebuild for
both gcc and clang, each with assertions and without, to check if there are
warnings. We could probably just switch to building from the tarball for some
of those.

I guess I need to go and check how long the "release" tarball generation
takes...

Greetings,

Andres Freund




Re: ERROR: no relation entry for relid 6

2023-05-23 Thread Tom Lane
Richard Guo  writes:
>> create table t (a int unique, b int);
>> 
>> explain (costs off)
>> select 1 from t t1
>> left join t t2 on true
>> inner join t t3 on true
>> left join t t4 on t2.a = t4.a and t2.a = t3.a;
>> ERROR:  no relation entry for relid 6

Ugh.

> Considering that clone clauses should always be outer-join clauses not
> filter clauses, I'm wondering if we can add an additional check for that
> in RINFO_IS_PUSHED_DOWN, something like

>  #define RINFO_IS_PUSHED_DOWN(rinfo, joinrelids) \
> -   ((rinfo)->is_pushed_down || \
> -!bms_is_subset((rinfo)->required_relids, joinrelids))
> +   (!((rinfo)->has_clone || ((rinfo)->is_clone)) && \
> +((rinfo)->is_pushed_down || \
> + !bms_is_subset((rinfo)->required_relids, joinrelids)))

I don't like that one bit; it seems way too likely to introduce
bad side-effects elsewhere.

I think the real issue is that "is pushed down" has never been a
conceptually accurate way of thinking about what
remove_rel_from_query needs to do here.  Using RINFO_IS_PUSHED_DOWN
happened to work up to now, but it's an abuse of that macro, and
changing the macro's behavior isn't the right way to fix it.

Having said that, I'm not sure what is a better way to think about it.
It seems like our data structure doesn't have a clear tie between
restrictinfos and their original join level, or at least, to the extent
that it did the "clone clause" mechanism has broken it.

I wonder if we could do something involving recording the
rinfo_serial numbers of all the clauses extracted from a particular
syntactic join level (we could keep this in a bitmapset attached
to each SpecialJoinInfo, perhaps) and then filtering the joinclauses
on the basis of serial numbers instead of required_relids.

regards, tom lane




Re: pgsql: Move snowball_create.sql creation into perl file

2023-05-23 Thread Tom Lane
[ dropping -packagers ]

Andres Freund  writes:
> I guess I need to go and check how long the "release" tarball generation
> takes...

It's quick except for the documentation-generating steps.  Maybe
we could test that part only once?

regards, tom lane




Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Jacob Champion
Hi all,

As touched on in past threads, our SCRAM implementation is slightly
nonstandard and doesn't always protect the entirety of the
authentication handshake:

- the username in the startup packet is not covered by the SCRAM
crypto and can be tampered with if channel binding is not in effect,
or by an attacker holding only the server's key

- low iteration counts accepted by the client make it easier than it
probably should be for a MITM to brute-force passwords (note that
PG16's scram_iterations GUC, being server-side, does not mitigate
this)

- by default, a SCRAM exchange can be exited by the server prior to
sending its verifier, skipping the client's server authentication step
(this is mitigated by requiring channel binding, and PG16 adds
require_auth=scram-sha-256 to help as well)

These aren't currently considered security vulnerabilities, but it'd
be good for the documentation to call them out, considering mutual
authentication is one of the design goals of the SCRAM RFC. (I'd also
like to shore up these problems, eventually, to make SCRAM-based
mutual authn viable with Postgres. But that work has stalled a bit on
my end.)

Here's a patch to explicitly warn people away from SCRAM as a form of
server authentication, and nudge them towards a combination with
verified TLS or gssenc. I've tried to keep the text version-agnostic,
to make a potential backport easier. Is this a good place for the
warning to go? Should I call out that GSS can't use channel binding,
or promote the use of TLS versus GSS for SCRAM, or just keep it
simple?

Thanks,
--Jacob
From 5b346313f1cecd3c4c79b6e104094e50bb1cfa75 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 22 May 2023 16:46:23 -0700
Subject: [PATCH] docs: encourage strong server verification with SCRAM

The server verification in libpq's SCRAM implementation can be subverted
in various ways. While mitigations for some of the issues exist, it's
probably wise for most users to combine it with strong server
authentication, to avoid entering a SCRAM exchange with an untrusted
server. Recommend that in the docs.
---
 doc/src/sgml/runtime.sgml | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index dbe23db54f..cf93d9443c 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -2021,6 +2021,21 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433
 authentication with them.  The TCP client must connect
 using gssencmode=require.
   
+
+  
+
+  While the SCRAM authentication method ()
+  theoretically provides a form of server authentication without the use of
+  certificates, the current SCRAM implementation does not protect the entire
+  authentication exchange. A spoofed server (or an active attacker on the
+  network) may tamper with the startup packet or attempt to bypass the
+  client's server verification step. Additionally, an attacker on the wire
+  may use an intercepted SCRAM exchange to begin a brute-force attack
+  against the password offline. It's recommended to employ strong server
+  authentication, using one of the above anti-spoofing measures, to prevent
+  these attacks.
+
+  
  
 
  
-- 
2.25.1



Atomic ops for unlogged LSN

2023-05-23 Thread John Morris
This is a short patch which cleans up code for unlogged LSNs. It replaces the 
existing spinlock with
atomic ops.  It could provide a performance benefit for future uses of
unlogged LSNS,  but for now it is simply a cleaner implementation.


0001-Use-atomic-ops-for-unloggedLSNs.patch
Description: 0001-Use-atomic-ops-for-unloggedLSNs.patch


Re: PG 16 draft release notes ready

2023-05-23 Thread David Rowley
On Mon, 22 May 2023 at 07:05, Jonathan S. Katz  wrote:
> * Parallel execution of queries that use `FULL` and `OUTER` joins

I think this should be `RIGHT` joins rather than `OUTER` joins.

LEFT joins have been parallelizable I think for a long time now.

David




Re: PG 16 draft release notes ready

2023-05-23 Thread David Rowley
On Tue, 23 May 2023 at 06:04, Bruce Momjian  wrote:
>
> On Mon, May 22, 2023 at 10:59:36AM -0700, Andres Freund wrote:
> > And here it's not just performance, but also memory usage, including steady
> > state memory usage.
>
> Understood.  I continue to need help determining which items to include.
> Can you suggest some text?  This?
>
> Improve efficiency of memory usage to allow for better scaling

Maybe something like:

* Reduce palloc() memory overhead for all memory allocations down to 8
bytes on all platforms. (Andres Freund, David Rowley)

This allows more efficient use of memory and is especially useful in
queries which perform operations (such as sorting or hashing) that
require more than work_mem.

David




Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Tomas Vondra



On 5/23/23 18:39, Tom Lane wrote:
> Tomas Vondra  writes:
>> it seems there's a fairly annoying memory leak in trigger code,
>> introduced by
>> ...
>> Attached is a patch, restoring the pre-12 behavior for me.
> 
>> While looking for other places allocating stuff in ExecutorState (for
>> the UPDATE case) and leaving it there, I found two more cases:
> 
>> 1) copy_plpgsql_datums
> 
>> 2) make_expanded_record_from_tupdesc
>>make_expanded_record_from_exprecord
> 
>> All of this is calls from plpgsql_exec_trigger.
> 
> Not sure about the expanded-record case, but both of your other two
> fixes feel like poor substitutes for pushing the memory into a
> shorter-lived context.  In particular I'm quite surprised that
> plpgsql isn't already allocating that workspace in the "procedure"
> memory context.
> 

I don't disagree, but which memory context should this use and
when/where should we switch to it?

I haven't seen any obvious memory context candidate in the code calling
ExecGetAllUpdatedCols, so I guess we'd have to pass it from above. Is
that a good idea for backbranches ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Stephen Frost
Greetings,

* Jacob Champion (jchamp...@timescale.com) wrote:
> As touched on in past threads, our SCRAM implementation is slightly
> nonstandard and doesn't always protect the entirety of the
> authentication handshake:
> 
> - the username in the startup packet is not covered by the SCRAM
> crypto and can be tampered with if channel binding is not in effect,
> or by an attacker holding only the server's key

Encouraging channel binding when using TLS certainly makes a lot of
sense, particularly in environments where the trust anchors are not
strongly managed (that is- if you trust the huge number of root
certificates that a system may have installed).  Perhaps explicitly
encouraging users to *not* trust every root server installed on their
client for their PG connections would also be a good idea.  We should
probably add language to that effect to this section.

We do default to having channel binding being used though.  Breaking
down exactly the cases at risk (perhaps including what version certain
things were introduced) and what direct action administrators can take
to ensure their systems are as secure as possible (and maybe mention
what things that doesn't address still, so they're aware of what risks
still exist) would be good.

> - low iteration counts accepted by the client make it easier than it
> probably should be for a MITM to brute-force passwords (note that
> PG16's scram_iterations GUC, being server-side, does not mitigate
> this)

This would be good to improve on.

> - by default, a SCRAM exchange can be exited by the server prior to
> sending its verifier, skipping the client's server authentication step
> (this is mitigated by requiring channel binding, and PG16 adds
> require_auth=scram-sha-256 to help as well)

Yeah, encouraging this would also be good and should be mentioned as
something to do when one is using SCRAM.  Clearly that would go into a
PG16 version of this.

> These aren't currently considered security vulnerabilities, but it'd
> be good for the documentation to call them out, considering mutual
> authentication is one of the design goals of the SCRAM RFC. (I'd also
> like to shore up these problems, eventually, to make SCRAM-based
> mutual authn viable with Postgres. But that work has stalled a bit on
> my end.)

Improving the documentation certainly is a good plan.

> Here's a patch to explicitly warn people away from SCRAM as a form of
> server authentication, and nudge them towards a combination with
> verified TLS or gssenc. I've tried to keep the text version-agnostic,
> to make a potential backport easier. Is this a good place for the
> warning to go? Should I call out that GSS can't use channel binding,
> or promote the use of TLS versus GSS for SCRAM, or just keep it
> simple?

Adding channel binding to GSS (which does support it, we just don't
implement it today..) when using TLS or another encryption method for
transport would be a good improvement to make, particularly right now as
we don't yet support encryption with SSPI, meaning that you need to use
TLS to get session encryption when one of the systems is on Windows.  I
do hope to add support for encryption with SSPI during this next release
cycle and would certainly welcome help from anyone else who is
interested in this.

I have to admit that the patch as presented strikes me as a warning
without really providing steps for how to address the issues mentioned
though; there's a reference to what was just covered at the bottom but
nothing really new.  The current documentation leads with the warnings
and then goes into steps to take to address the risk, and I'd say it
would make more sense to put this wording about SCRAM not being a
complete solution to this issue above the steps that one can take to
reduce the spoofing risk, similar to how the documentation currently is.

Perhaps more succinctly- maybe we should be making adjustments to the
current language instead of just adding a new paragraph.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Tomas Vondra



On 5/23/23 19:14, Andres Freund wrote:
> Hi,
> 
> On 2023-05-23 18:23:00 +0200, Tomas Vondra wrote:
>> This means that for an UPDATE with triggers, we may end up calling this
>> for each row, possibly multiple bitmaps. And those bitmaps are allocated
>> in ExecutorState, so won't be freed until the end of the query :-(
> 
> Ugh.
> 
> 
> I've wondered about some form of instrumentation to detect such issues
> before. It's obviously a problem that we can have fairly large leaks, like the
> one you just discovered, without detecting it for a couple years. It's kinda
> made worse by the memory context infrastructure, because it hides such issues.
> 
> Could it help to have a mode where the executor shutdown hook checks how much
> memory is allocated in ExecutorState and warns if its too much? There's IIRC a
> few places that allocate large things directly in it, but most of those
> probably should be dedicated contexts anyway.  Something similar could be
> useful for some other long-lived contexts.
> 

Not sure such simple instrumentation would help much, unfortunately :-(

We only discovered this because the user reported OOM crashes, which
means the executor didn't get to the shutdown hook at all. Yeah, maybe
if we had such instrumentation it'd get triggered for milder cases, but
I'd bet the amount of noise is going to be significant.

For example, there's a nearby thread about hashjoin allocating buffiles
etc. in ExecutorState - we ended up moving that to a separate context,
but surely there are more such cases (not just for ExecutorState).

The really hard thing was determining what causes the memory leak - the
simple instrumentation doesn't help with that at all. It tells you there
might be a leak, but you don't know where did the allocations came from.

What I ended up doing is a simple gdb script that sets breakpoints on
all palloc/pfree variants, and prints info (including the backtrace) for
each call on ExecutorState. And then a script that aggregate those to
identify which backtraces allocated most chunks that were not freed.

This was super slow (definitely useless outside development environment)
but it made it super obvious where the root cause is.

I experimented with instrumenting the code a bit (as alternative to gdb
breakpoints) - still slow, but much faster than gdb. But perhaps useful
for special (valgrind-like) testing mode ...

Would require testing with more data, though. I doubt we'd find much
with our regression tests, which have tiny data sets.

>  ...
>
>> Attached is a patch, restoring the pre-12 behavior for me.
> 
> Hm. Somehow this doesn't seem quite right. Shouldn't we try to use a shorter
> lived memory context instead? Otherwise we'll just end up with the same
> problem in a few years.
>

I agree using a shorter lived memory context would be more elegant, and
more in line with how we do things. But it's not clear to me why we'd
end up with the same problem in a few years with what the patch does.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: memory leak in trigger handling (since PG12)

2023-05-23 Thread Tom Lane
Tomas Vondra  writes:
> The really hard thing was determining what causes the memory leak - the
> simple instrumentation doesn't help with that at all. It tells you there
> might be a leak, but you don't know where did the allocations came from.

> What I ended up doing is a simple gdb script that sets breakpoints on
> all palloc/pfree variants, and prints info (including the backtrace) for
> each call on ExecutorState. And then a script that aggregate those to
> identify which backtraces allocated most chunks that were not freed.

FWIW, I've had some success localizing palloc memory leaks with valgrind's
leak detection mode.  The trick is to ask it for a report before the
context gets destroyed.  Beats writing your own infrastructure ...

> Would require testing with more data, though. I doubt we'd find much
> with our regression tests, which have tiny data sets.

Yeah, it's not clear whether we could make the still-hypothetical check
sensitive enough to find leaks using small test cases without getting an
unworkable number of false positives.  Still, might be worth trying.
It might be an acceptable tradeoff to have stricter rules for what can
be allocated in ExecutorState in order to make this sort of problem
more easily detectable.

regards, tom lane




Re: Atomic ops for unlogged LSN

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 08:24:45PM +, John Morris wrote:
> This is a short patch which cleans up code for unlogged LSNs. It
> replaces the existing spinlock with atomic ops.  It could provide a
> performance benefit for future uses of unlogged LSNS,  but for now
> it is simply a cleaner implementation.

Seeing the code paths where gistGetFakeLSN() is called, I guess that
the answer will be no, still are you seeing a measurable difference in
some cases?

-   /* increment the unloggedLSN counter, need SpinLock */
-   SpinLockAcquire(&XLogCtl->ulsn_lck);
-   nextUnloggedLSN = XLogCtl->unloggedLSN++;
-   SpinLockRelease(&XLogCtl->ulsn_lck);
-
-   return nextUnloggedLSN;
+   return pg_atomic_fetch_add_u64(&XLogCtl->unloggedLSN, 1);

Spinlocks provide a full memory barrier, which may not the case with
add_u64() or read_u64().  Could memory ordering be a problem in these
code paths?
--
Michael


signature.asc
Description: PGP signature


Re: PG 16 draft release notes ready

2023-05-23 Thread Jonathan S. Katz

On 5/23/23 4:37 PM, David Rowley wrote:

On Mon, 22 May 2023 at 07:05, Jonathan S. Katz  wrote:

* Parallel execution of queries that use `FULL` and `OUTER` joins


I think this should be `RIGHT` joins rather than `OUTER` joins.

LEFT joins have been parallelizable I think for a long time now.


I had grabbed it from this line:

  Allow outer and full joins to be performed in parallel (Melanie 
Plageman, Thomas Munro)


If we want to be specific on RIGHT joins, I can update it in the release 
announcement, but it may be too late for the release notes (at least for 
beta 1).


Thanks,

Jonathan


OpenPGP_signature
Description: OpenPGP digital signature


Re: Zstandard support for toast compression

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 05:56:13PM +0300, Aleksander Alekseev wrote:
> TWIMC there is an ongoing discussion [1] of making TOAST pointers
> extendable since this is a dependency for several patches that are
> currently in development.

Thanks for the ping.  I have seen and read the other thread, and yes,
that's an exciting proposal, not only for what's specific to the
thread here.
--
Michael


signature.asc
Description: PGP signature


Re: PG 16 draft release notes ready

2023-05-23 Thread Jonathan S. Katz

On 5/22/23 4:18 PM, Robert Haas wrote:

On Sun, May 21, 2023 at 3:05 PM Jonathan S. Katz  wrote:

* Support for regular expressions for matching usernames and databases
names in `pg_hba.conf`, and user names in `pg_ident.conf`


I suggest that this is not a major feature.

Perhaps the work that I did to improve CREATEROLE could be considered
for inclusion in the major features list. In previous releases,
someone with CREATEROLE can hack the PG OS account. Now they can't. In
previous releases, someone with CREATEROLE can manage all
non-superuser roles, but now they can manage the roles they create (or
ones they are given explicit authority to manage). You can even
control whether or not such users automatically inherit the privileges
of roles they create, as superusers inherit all privileges. There is
certainly some argument that this is not a sufficiently significant
set of changes to justify a major feature mention, and even if it is,
it's not clear to me exactly how it would be best worded. And yet I
feel like it's very likely that if we look back on this release in 3
years, those changes will have had a significant impact on many
PostgreSQL deployments, above all in the cloud, whereas I think it
likely that the ability to have regular expressions in pg_hba.conf and
pg_ident.conf will have had very little effect by comparison.

Of course, there is always a possibility that I'm over-estimating the
impact of my own work.


In general, I'm completely fine with people advocating for their own 
features during this process, in case there's something that I missed.


For this case, while I think this work is very impactful, but I don't 
know if I'd call it a major feature vs. modifying an unintended 
behavior. Additionally, folks have likely put mitigations in place for 
this through the years. I'm happy to be convinced otherwise.


The regular expressions in the files adds an ability that both we didn't 
have before, and has been a request I've heard from users with very 
large deployments. For them, it'll help simplify a lot of their 
configurations/automations for setting this up en masse. Again, I'm 
happy to be convinced otherwise.


I wanted to use the beta release to allow for us to see 1/ how people 
ultimately test these things and 2/ help better sift out what will be 
called a major feature. We could end up shuffling items in the list or 
completely rewriting it, so it's not set in stone.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-05-23 Thread John Naylor
I wrote:
> the current insert/delete paths are quite complex. Using bitmap heap scan
as a motivating use case, I hope to refocus complexity to where it's most
needed, and aggressively simplify where possible.

Sometime in the not-too-distant future, I will start a new thread focusing
on bitmap heap scan, but for now, I just want to share some progress on
making the radix tree usable not only for that, but hopefully a wider range
of applications, while making the code simpler and the binary smaller. The
attached patches are incomplete (e.g. no iteration) and quite a bit messy,
so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 +
0007-09 ).

0001

This combines a few concepts that I didn't bother separating out after the
fact:
- Split insert_impl.h into multiple functions for improved readability and
maintainability.
- Use single-value leaves as the basis for storing values, with the goal to
get to "combined pointer-value slots" for efficiency and flexibility.
- With the latter in mind, searching the child within a node now returns
the address of the slot. This allows the same interface whether the slot
contains a child pointer or a value.
- Starting with RT_SET, start turning some iterative algorithms into
recursive ones. This is a more natural way to traverse a tree structure,
and we already see an advantage: Previously when growing a node, we
searched within the parent to update its reference to the new node, because
we didn't know the slot we descended from. Now we can simply update a
single variable.
- Since we recursively pass the "shift" down the stack, it doesn't have to
be stored in any node -- only the "top-level" start shift is stored in the
tree control struct. This was easy to code since the node's shift value was
hardly ever accessed anyway! The node header shrinks from 5 bytes to 4.

0002

Back in v15, we tried keeping DSA/local pointers as members of a struct. I
did not like the result, but still thought it was a good idea. RT_DELETE is
a complex function and I didn't want to try rewriting it without a pointer
abstraction, so I've resurrected this idea, but in a simpler, less
intrusive way. A key difference from v15 is using a union type for the
non-shmem case.

0004

Rewrite RT_DELETE using recursion. I find this simpler than the previous
open-coded stack.

0005-06

Deletion has an inefficiency: One function searches for the child to see if
it's there, then another function searches for it again to delete it. Since
0001, a successful child search returns the address of the slot, so we can
save it. For the two smaller "linear search" node kinds we can then use a
single subtraction to compute the chunk/slot index for deletion. Also,
split RT_NODE_DELETE_INNER into separate functions, for a similar reason as
the insert case in 0001.

0007

Anticipate node shrinking: If only one node-kind needs to be freed, we can
move a branch to that one code path, rather than every place where RT_FREE
is inlined.

0009

Teach node256 how to shrink *. Since we know the number of children in a
node256 can't possibly be zero, we can use uint8 to store the count and
interpret an overflow to zero as 256 for this node. The node header shrinks
from 4 bytes to 3.

* Other nodes will follow in due time, but only after I figure out how to
do it nicely (ideas welcome!) -- currently node32's two size classes work
fine for growing, but the code should be simplified before extending to
other cases.)

0010

Limited support for "combined pointer-value slots". At compile-time, choose
either that or "single-value leaves" based on the size of the value type
template parameter. Values that are pointer-sized or less can fit in the
last-level child slots of nominal "inner nodes" without duplicated
leaf-node code. Node256 now must act like the previous 'node256 leaf',
since zero is a valid value. Aside from that, this was a small change.

What I've shared here could work (in principal, since it uses uint64
values) for tidstore, possibly faster (untested) because of better code
density, but as mentioned I want to shoot for higher. For tidbitmap.c, I
want to extend this idea and branch at run-time on a per-value basis, so
that a page-table entry that fits in a pointer can go there, and if not,
it'll be a full leaf. (This technique enables more flexibility in
lossifying pages as well.) Run-time info will require e.g. an additional
bit per slot. Since the node header is now 3 bytes, we can spare one more
byte in the node3 case. In addition, we can and should also bump it back up
to node4, still keeping the metadata within 8 bytes (no struct padding).

I've started in this patchset to refer to the node kinds as "4/16/48/256",
regardless of their actual fanout. This is for readability (by matching the
language in the paper) and maintainability (should *not* ever change
again). The size classes (including multiple classes per kind) could be
determined by macros and #ifdef's. For example, in non-SIMD architectures,
it'

Re: RFI: Extending the TOAST Pointer

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 12:33:50PM -0400, Robert Haas wrote:
> For projects like this, the details matter a lot. If the goal is to
> add a new compression type that behaves like the existing compression
> types, more or less, then I think we should allocate the last
> ToastCompressionId bit to mean "some other compression ID" and add a
> 1-byte header that says which one is in use. But if the new feature
> being added is enough different from the other compression methods,
> then it might be better to do it in some other way e.g. a new VARTAG.

Agreed.  While the compression argument and the possibility to add
more options to toast pointers are very appealing, FWIW, I'd like to
think that the primary target is the 4-byte OID assignment limit of
where backends loop infinitely until a OID can be found, which can be
a real pain for users with a large number of blobs or just enough
toast data to trigger it.

Saying that even if I sent the patch for zstd on toast..
--
Michael


signature.asc
Description: PGP signature


Re: Make pgbench exit on SIGINT more reliably

2023-05-23 Thread Michael Paquier
On Mon, May 22, 2023 at 10:02:02AM -0500, Tristan Partin wrote:
> The way that pgbench handled SIGINT changed in
> 1d468b9ad81b9139b4a0b16b416c3597925af4b0. Unfortunately this had a
> couple of unintended consequences, at least from what I can tell[1].
> 
> - CTRL-C no longer stops the program unless the right point in pgbench
>   execution is hit
> - pgbench no longer exits with a non-zero exit code
> 
> An easy reproduction of these problems is to run with a large scale
> factor like: pgbench -i -s 50. Then try to CTRL-C the program.

This comes from the code path where the data is generated client-side,
and where the current CancelRequested may not be that responsive,
isn't it?
--
Michael


signature.asc
Description: PGP signature


Re: Large files for relations

2023-05-23 Thread Thomas Munro
Thanks all for the feedback.  It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode.  And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged
* pg_upgrade would convert if source and target don't match

I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.

I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.




Re: Handle SIGTERM in fe_utils/cancel.c

2023-05-23 Thread Michael Paquier
On Mon, May 22, 2023 at 12:26:34PM -0500, Tristan Partin wrote:
> This is a way that would solve bug #17698[1]. It just reuses the same
> handler as SIGINT (with a function rename).
> 
> This patch works best if it is combined with my previous submission[2].
> I can rebase that submission if and when this patch is pulled in.

Not sure that this is a good idea long-term.  Currently, the code
paths calling setup_cancel_handler() from cancel.c don't have a custom
handling for SIGTERM, but that may not be the case forever.
--
Michael


signature.asc
Description: PGP signature


Re: unnecessary #include "pg_getopt.h"?

2023-05-23 Thread Michael Paquier
On Mon, May 22, 2023 at 06:48:37PM +0900, torikoshia wrote:
> While working on [1], I thought there seems to be unnecessary #include
> "pg_getopt.h".
> getopt_long.h has already included pg_getopt.h, but some files include both
> getopt.h and getopt_long.h.

Right, these could be removed.  I am not seeing other places in the
tree that include both.  That's always nice to clean up.
--
Michael


signature.asc
Description: PGP signature


Re: Allow pg_archivecleanup to remove backup history files

2023-05-23 Thread Michael Paquier
On Mon, May 22, 2023 at 06:24:49PM +0900, torikoshia wrote:
> Thanks for your advice, attached patches.

0001 looks OK, thanks!

+Remove files including backup history file.

This could be reworded as "Remove backup history files.", I assume.

+ Note that when oldestkeptwalfile is a 
backup history file,
+ specified file is kept and only preceding WAL files and backup 
history files are removed.

The same thing is described at the bottom of the documentation, so it
does not seem necessary here.

-   printf(_("  -d, --debug   generate debug output (verbose 
mode)\n"));
-   printf(_("  -n, --dry-run dry run, show the names of the 
files that would be removed\n"));
-   printf(_("  -V, --version output version information, then 
exit\n"));
-   printf(_("  -x --strip-extension=EXT  clean up files if they have this 
extension\n"));
-   printf(_("  -?, --helpshow this help, then exit\n"));
+   printf(_("  -d, --debug generate debug output (verbose 
mode)\n"));
+   printf(_("  -n, --dry-run   dry run, show the names of the 
files that would be removed\n"));
+   printf(_("  -b, --clean-backup-history  clean up files including backup 
history files\n"));
+   printf(_("  -V, --version   output version information, 
then exit\n"));
+   printf(_("  -x --strip-extension=EXTclean up files if they have 
this extension\n"));
+   printf(_("  -?, --help  show this help, then exit\n"));

Perhaps this indentation had better be adjusted in 0001, no big deal
either way.

-   ok(!-f "$tempdir/$walfiles[0]",
-   "$test_name: first older WAL file was cleaned up");
+   if (grep {$_ eq '-x.gz'} @options) {
+   ok(!-f "$tempdir/$walfiles[0]",
+   "$test_name: first older WAL file with .gz was cleaned 
up");
+   } else {
+   ok(-f "$tempdir/$walfiles[0]",
+   "$test_name: first older WAL file with .gz was not 
cleaned up");
+   }
[...]
-   ok(-f "$tempdir/$walfiles[2]",
-   "$test_name: restartfile was not cleaned up");
+   if (grep {$_ eq '-b'} @options) {
+   ok(!-f "$tempdir/$walfiles[2]",
+   "$test_name: Backup history file was cleaned up");
+   } else {
+   ok(-f "$tempdir/$walfiles[2]",
+   "$test_name: Backup history file was not cleaned up");
+   }

That's over-complicated, caused by the fact that all the tests rely on
the same list of WAL files to create, aka @walfiles defined at the
beginning of the script.  Wouldn't it be cleaner to handle the fake
file creations with more than one global list, before each command
run?  The list of files to be created could just be passed down as an
argument of run_check(), for example, before cleaning up the contents
for the next command.
--
Michael


signature.asc
Description: PGP signature


Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 05:02:50PM -0400, Stephen Frost wrote:
> * Jacob Champion (jchamp...@timescale.com) wrote:
>> As touched on in past threads, our SCRAM implementation is slightly
>> nonstandard and doesn't always protect the entirety of the
>> authentication handshake:
>>
>> - the username in the startup packet is not covered by the SCRAM
>> crypto and can be tampered with if channel binding is not in effect,
>> or by an attacker holding only the server's key
>
> Encouraging channel binding when using TLS certainly makes a lot of
> sense, particularly in environments where the trust anchors are not
> strongly managed (that is- if you trust the huge number of root
> certificates that a system may have installed).  Perhaps explicitly
> encouraging users to *not* trust every root server installed on their
> client for their PG connections would also be a good idea.  We should
> probably add language to that effect to this section.

Currently the user name is not treated by the backend, like that in
auth-scram.c:

/*
 * Read username.  Note: this is ignored.  We use the username from the
 * startup message instead, still it is kept around if provided as it
 * proves to be useful for debugging purposes.
 */
state->client_username = read_attr_value(&p, 'n');

Could it make sense to cross-check that with the contents of the
startup package instead, with or without channel binding?

>> - low iteration counts accepted by the client make it easier than it
>> probably should be for a MITM to brute-force passwords (note that
>> PG16's scram_iterations GUC, being server-side, does not mitigate
>> this)
>
> This would be good to improve on.

Hmm.  Interesting.

> > - by default, a SCRAM exchange can be exited by the server prior to
> > sending its verifier, skipping the client's server authentication step
> > (this is mitigated by requiring channel binding, and PG16 adds
> > require_auth=scram-sha-256 to help as well)
>
> Yeah, encouraging this would also be good and should be mentioned as
> something to do when one is using SCRAM.  Clearly that would go into a
> PG16 version of this.

Agreed to improve the docs about ways to mitigate any risks.
--
Michael


signature.asc
Description: PGP signature


Re: unnecessary #include "pg_getopt.h"?

2023-05-23 Thread Andres Freund
Hi,

On 2023-05-24 09:59:18 +0900, Michael Paquier wrote:
> On Mon, May 22, 2023 at 06:48:37PM +0900, torikoshia wrote:
> > While working on [1], I thought there seems to be unnecessary #include
> > "pg_getopt.h".
> > getopt_long.h has already included pg_getopt.h, but some files include both
> > getopt.h and getopt_long.h.
> 
> Right, these could be removed.  I am not seeing other places in the
> tree that include both.  That's always nice to clean up.

This feels more like a matter of taste to me than anything. At least some of
the files touched in the patch use optarg, opterr etc. - which are declared in
pg_getopt.h. Making it reasonable to directly include pg_getopt.h.

I don't really see a need to change anything here?

Greetings,

Andres Freund




Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 04:25:59PM +0200, Christoph Berg wrote:
> I believe this issue is still open for PG16.

Right.  I've added an item to the list, to not forget.
--
Michael


signature.asc
Description: PGP signature


Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Tue, May 23, 2023 at 05:02:50PM -0400, Stephen Frost wrote:
> > * Jacob Champion (jchamp...@timescale.com) wrote:
> >> As touched on in past threads, our SCRAM implementation is slightly
> >> nonstandard and doesn't always protect the entirety of the
> >> authentication handshake:
> >>
> >> - the username in the startup packet is not covered by the SCRAM
> >> crypto and can be tampered with if channel binding is not in effect,
> >> or by an attacker holding only the server's key
> >
> > Encouraging channel binding when using TLS certainly makes a lot of
> > sense, particularly in environments where the trust anchors are not
> > strongly managed (that is- if you trust the huge number of root
> > certificates that a system may have installed).  Perhaps explicitly
> > encouraging users to *not* trust every root server installed on their
> > client for their PG connections would also be a good idea.  We should
> > probably add language to that effect to this section.
> 
> Currently the user name is not treated by the backend, like that in
> auth-scram.c:
> 
> /*
>  * Read username.  Note: this is ignored.  We use the username from the
>  * startup message instead, still it is kept around if provided as it
>  * proves to be useful for debugging purposes.
>  */
> state->client_username = read_attr_value(&p, 'n');
> 
> Could it make sense to cross-check that with the contents of the
> startup package instead, with or without channel binding?

Not without breaking things we support today and for what seems like an
unclear benefit given that we've got channel binding today (though
perhaps we need to make sure there's ways to force it on both sides to
be on and to encourage everyone to do that- which is what this change is
generally about, I think).

As I recall, the reason we do it the way we do is because the SASL spec
that SCRAM is implemented under requires the username to be utf8 encoded
while we support other encodings, and I don't think we want to break
non-utf8 usage.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 09:46:58PM -0400, Stephen Frost wrote:
> Not without breaking things we support today and for what seems like an
> unclear benefit given that we've got channel binding today (though
> perhaps we need to make sure there's ways to force it on both sides to
> be on and to encourage everyone to do that- which is what this change is
> generally about, I think).
> 
> As I recall, the reason we do it the way we do is because the SASL spec
> that SCRAM is implemented under requires the username to be utf8 encoded
> while we support other encodings, and I don't think we want to break
> non-utf8 usage.

Yup, I remember this one, the encoding not being enforced by the
protocol has been quite an issue when this was implemented, still I
was wondering whether that's something that could be worth revisiting
at some degree.
--
Michael


signature.asc
Description: PGP signature


Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-05-23 Thread Kyotaro Horiguchi
At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro  wrote 
in 
> On Tue, Apr 25, 2023 at 12:16 PM Andres Freund  wrote:
> > On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > > We obviously can add a retry loop to FileFallocate(), similar to what's
> > > already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> > > further, and do it for all the fd.c routines where it's remotely plausible
> > > EINTR could be returned? It's a bit silly to add EINTR retries one-by-one 
> > > to
> > > the functions.
> > >
> > >
> > > The following are documented to potentially return EINTR, without fd.c 
> > > having
> > > code to retry:
> > >
> > > - FileWriteback() / pg_flush_data()
> > > - FileSync() / pg_fsync()
> > > - FileFallocate()
> > > - FileTruncate()
> > >
> > > With the first two there's the added complication that it's not entirely
> > > obvious whether it'd be better to handle this in File* or pg_*. I'd argue 
> > > the
> > > latter is a bit more sensible?
> >
> > A prototype of that approach is attached. I pushed the retry handling into 
> > the
> > pg_* routines where applicable.  I guess we could add pg_* routines for
> > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> 
> One problem we ran into with the the shm_open() case (which is nearly
> identical under the covers, since shm_open() just opens a file in a
> tmpfs on Linux) was that a simple retry loop like this could never
> terminate if the process was receiving a lot of signals from the
> recovery process, which is why we went with the idea of masking
> signals instead.  Eventually we should probably grow the file in
> smaller chunks with a CFI in between so that we both guarantee that we
> make progress (by masking for smaller size increases) and service
> interrupts in a timely fashion (by unmasking between loops).  I don't
> think that applies here because we're not trying to fallocate
> humongous size increases in one go, but I just want to note that we're
> making a different choice.  I think this looks reasonable for the use
> cases we actually have here.

FWIW I share the same feeling about looping by EINTR without signals
being blocked. If we just retry the same operation without processing
signals after getting EINTR, I think blocking signals is better. We
could block signals more gracefully, but I'm not sure it's worth the
complexity.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Tue, May 23, 2023 at 09:46:58PM -0400, Stephen Frost wrote:
> > Not without breaking things we support today and for what seems like an
> > unclear benefit given that we've got channel binding today (though
> > perhaps we need to make sure there's ways to force it on both sides to
> > be on and to encourage everyone to do that- which is what this change is
> > generally about, I think).
> > 
> > As I recall, the reason we do it the way we do is because the SASL spec
> > that SCRAM is implemented under requires the username to be utf8 encoded
> > while we support other encodings, and I don't think we want to break
> > non-utf8 usage.
> 
> Yup, I remember this one, the encoding not being enforced by the
> protocol has been quite an issue when this was implemented, still I
> was wondering whether that's something that could be worth revisiting
> at some degree.

To the extent that there was an issue when it was implemented ... it's
now been implemented and so that was presumably overcome (though I don't
really specifically recall what the issues were there?  Seems like it
wouldn't matter at this point though), so I don't understand why we
would wish to revisit it.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Insertion Sort Improvements

2023-05-23 Thread John Naylor
On Tue, May 23, 2023 at 4:10 PM Benjamin Coutu  wrote:
>
> Greetings,
>
> I would like to revisit the discussion and concur with John's perspective
that incremental progress through small, successive modifications is the
appropriate approach to move forward. Therefore, I would like to propose
two distinct ideas aimed at enhancing the functionality of insertion sort:
>
> 1. Implementation of a more efficient variant (as described in the
introductory part of this thread):

That's worth trying out. It might also then be worth trying to push both
unordered values -- the big one up / the small one down. I've seen other
implementations do that, but don't remember where, or what it's called.

> 2. It appears that there is a consensus against disregarding the
presorted check, despite its questionable value. In light of this, an
alternative suggestion is to integrate the presorted check into the
insertion sort path by utilizing an unbounded insertion sort.

"Unbounded" means no bounds check on the array. That's not possible in its
current form, so I think you misunderstood something.

> Only after a maximum number of swaps have occurred should we abandon the
sorting process.

I only remember implementations tracking loop iterations, not swaps. You'd
need evidence that this is better.

> If insertion sort is executed on the entire array without any swaps, we
can simply return. If not, and we continue with quicksort after the swap
limit has been reached, we at least have left the array in a more sorted
state, which may likely be advantageous for subsequent iterations.

An important part not mentioned yet: This might only be worth doing if the
previous recursion level performed no swaps during partitioning and the
current pivot candidates are ordered. That's a bit of work and might not be
convenient now -- it'd be trivial with dual-pivot, but I've not looked at
that in a while. (Fittingly, dual-pivot requires a higher insertion sort
threshold so it goes both ways.)

> I would greatly appreciate assistance in benchmarking these proposed
changes. Your collaboration in this matter would be invaluable.

I advise looking in the archives for scripts from previous benchmarks. No
need to reinvent the wheel -- it's enough work as it is. A key thing here
for #1 is that improving insertion sort requires increasing the threshold
to show the true improvement. It's currently 7, but should really be
something like 10. A script that repeats tests for, say, 7 through 18
should show a concave-up shape in the times. The bottom of the bowl should
shift to higher values, and that minimum is what should be compared.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-05-23 Thread Andres Freund
Hi,

On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote:
> At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro  
> wrote in 
> > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund  wrote:
> > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > > > We obviously can add a retry loop to FileFallocate(), similar to what's
> > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a 
> > > > bit
> > > > further, and do it for all the fd.c routines where it's remotely 
> > > > plausible
> > > > EINTR could be returned? It's a bit silly to add EINTR retries 
> > > > one-by-one to
> > > > the functions.
> > > >
> > > >
> > > > The following are documented to potentially return EINTR, without fd.c 
> > > > having
> > > > code to retry:
> > > >
> > > > - FileWriteback() / pg_flush_data()
> > > > - FileSync() / pg_fsync()
> > > > - FileFallocate()
> > > > - FileTruncate()
> > > >
> > > > With the first two there's the added complication that it's not entirely
> > > > obvious whether it'd be better to handle this in File* or pg_*. I'd 
> > > > argue the
> > > > latter is a bit more sensible?
> > >
> > > A prototype of that approach is attached. I pushed the retry handling 
> > > into the
> > > pg_* routines where applicable.  I guess we could add pg_* routines for
> > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> > 
> > One problem we ran into with the the shm_open() case (which is nearly
> > identical under the covers, since shm_open() just opens a file in a
> > tmpfs on Linux) was that a simple retry loop like this could never
> > terminate if the process was receiving a lot of signals from the
> > recovery process, which is why we went with the idea of masking
> > signals instead.  Eventually we should probably grow the file in
> > smaller chunks with a CFI in between so that we both guarantee that we
> > make progress (by masking for smaller size increases) and service
> > interrupts in a timely fashion (by unmasking between loops).  I don't
> > think that applies here because we're not trying to fallocate
> > humongous size increases in one go, but I just want to note that we're
> > making a different choice.  I think this looks reasonable for the use
> > cases we actually have here.
> 
> FWIW I share the same feeling about looping by EINTR without signals
> being blocked. If we just retry the same operation without processing
> signals after getting EINTR, I think blocking signals is better. We
> could block signals more gracefully, but I'm not sure it's worth the
> complexity.

I seriously doubt it's a good path to go down in this case. As Thomas
mentioned, this case isn't really comparable to the shm_open() one, due to the
bounded vs unbounded amount of memory we're dealing with.

What would be the benefit?

Greetings,

Andres Freund




Re: createuser --memeber and PG 16

2023-05-23 Thread Nathan Bossart
On Tue, May 23, 2023 at 07:50:36AM +0900, Michael Paquier wrote:
> Seeing the precedent with --no-blobs and --blobs, yes, that should be
> enough.  You may want to wait until beta1 is stamped to apply
> something, though, as the period between the stamp and the tag is used
> to check the state of the tarballs to-be-released.

Thanks, committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-05-23 Thread Peter Smith
Hi, and thanks for the patch! It is an interesting idea.

I have not yet fully read this thread, so below are only my first
impressions after looking at patch 0001. Sorry if some of these were
already discussed earlier.

TBH the patch "reuse-workers" logic seemed more complicated than I had
imagined it might be.

1.
IIUC with patch 0001, each/every tablesync worker (a.k.a. TSW) when it
finishes dealing with one table then goes looking to find if there is
some relation that it can process next. So now every TSW has a loop
where it will fight with every other available TSW over who will get
to process the next relation.

Somehow this seems all backwards to me. Isn't it strange for the TSW
to be the one deciding what relation it would deal with next?

IMO it seems more natural to simply return the finished TSW to some
kind of "pool" of available workers and the main Apply process can
just grab a TSW from that pool instead of launching a brand new one in
the existing function process_syncing_tables_for_apply(). Or, maybe
those "available" workers can be returned to a pool maintained within
the launcher.c code, which logicalrep_worker_launch() can draw from
instead of launching a whole new process?

(I need to read the earlier posts to see if these options were already
discussed and rejected)

~~

2.
AFAIK the thing that identifies a  tablesync worker is the fact that
only TSW will have a 'relid'.

But it feels very awkward to me to have a TSW marked as "available"
and yet that LogicalRepWorker must still have some OLD relid field
value lurking (otherwise it will forget that it is a "tablesync"
worker!).

IMO perhaps it is time now to introduce some enum 'type' to the
LogicalRepWorker. Then an "in_use" type=TSW would have a valid 'relid'
whereas an "available" type=TSW would have relid == InvalidOid.

~~

3.
Maybe I am mistaken, but it seems the benchmark results posted are
only using quite a small/default values for
"max_sync_workers_per_subscription", so I wondered how those results
are affected by increasing that GUC. I think having only very few
workers would cause more sequential processing, so conveniently the
effect of the patch avoiding re-launch might be seen in the best
possible light. OTOH, using more TSW in the first place might reduce
the overall tablesync time because the subscriber can do more work in
parallel.

So I'm not quite sure what the goal is here. E.g. if the user doesn't
care much about how long tablesync phase takes then there is maybe no
need for this patch at all. OTOH, I thought if a user does care about
the subscription startup time, won't those users be opting for a much
larger "max_sync_workers_per_subscription" in the first place?
Therefore shouldn't the benchmarking be using a larger number too?

==

Here are a few other random things noticed while looking at patch 0001:

1. Commit message

1a. typo /sequantially/sequentially/

1b. Saying "killed" and "killing" seemed a bit extreme and implies
somebody else is killing the process. But I think mostly tablesync is
just ending by a normal proc exit, so maybe reword this a bit.

~~~

2. It seemed odd that some -- clearly tablesync specific -- functions
are in the worker.c instead of in tablesync.c.

2a. e.g. clean_sync_worker

2b. e.g. sync_worker_exit

~~~

3. process_syncing_tables_for_sync

+ /*
+ * Sync worker is cleaned at this point. It's ready to sync next table,
+ * if needed.
+ */
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+ MyLogicalRepWorker->ready_to_reuse = true;
  SpinLockRelease(&MyLogicalRepWorker->relmutex);
+ }
+
+ SpinLockRelease(&MyLogicalRepWorker->relmutex);

Isn't there a double release of that mutex happening there?

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: PG 16 draft release notes ready

2023-05-23 Thread Bruce Momjian
On Wed, May 24, 2023 at 08:37:45AM +1200, David Rowley wrote:
> On Mon, 22 May 2023 at 07:05, Jonathan S. Katz  wrote:
> > * Parallel execution of queries that use `FULL` and `OUTER` joins
> 
> I think this should be `RIGHT` joins rather than `OUTER` joins.
> 
> LEFT joins have been parallelizable I think for a long time now.

Well, since we can swap left/right easily, why would we not have just
have swappted the tables and done the join in the past?  I think there
are two things missing in my description.

First, I need to mention parallel _hash_ join.  Second, I think this
item is saying that the _inner_ side of a parallel hash join can be an
OUTER or FULL join.  How about?

Allow hash joins to be parallelized where the inner side is
processed as an OUTER or FULL join (Melanie Plageman, Thomas Munro)

In this case, the inner side is the hashed side.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-05-23 Thread Bruce Momjian
On Tue, May 23, 2023 at 06:27:23PM -0400, Jonathan Katz wrote:
> On 5/23/23 4:37 PM, David Rowley wrote:
> > On Mon, 22 May 2023 at 07:05, Jonathan S. Katz  wrote:
> > > * Parallel execution of queries that use `FULL` and `OUTER` joins
> > 
> > I think this should be `RIGHT` joins rather than `OUTER` joins.
> > 
> > LEFT joins have been parallelizable I think for a long time now.
> 
> I had grabbed it from this line:
> 
>   Allow outer and full joins to be performed in parallel (Melanie Plageman,
> Thomas Munro)
> 
> If we want to be specific on RIGHT joins, I can update it in the release
> announcement, but it may be too late for the release notes (at least for
> beta 1).

We will have many more edits before final so I would not worry about
adjusting the beta1 wording.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-05-23 Thread Bruce Momjian
On Tue, May 23, 2023 at 12:14:04PM +0700, John Naylor wrote:
> On Tue, May 23, 2023 at 11:26 AM Bruce Momjian  wrote:
> > > > Allow xid/subxid searches to use vector operations on x86-64
> architectures
> > > (Nathan Bossart)
> > >
> > > When moved to the performance section, it would be something like "improve
> > > scalability when a large number of write transactions are in progress".
> >
> > Uh, again, see above, this does not impact user behavior or choices.  
> 
> So that turns a scalability improvement into "source code"?
> 
> > I assume this is x86-64-only.
> 
> Au contraire, I said "For anything using 16-byte vectors the two architectures
> are equivalently supported". It's not clear from looking at individual commit
> messages, that's why I piped in to help.

Okay, updated text:

Allow xid/subxid searches to use vector operations (Nathan Bossart)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: PG 16 draft release notes ready

2023-05-23 Thread David Rowley
On Wed, 24 May 2023 at 15:54, Bruce Momjian  wrote:
>
> On Wed, May 24, 2023 at 08:37:45AM +1200, David Rowley wrote:
> > On Mon, 22 May 2023 at 07:05, Jonathan S. Katz  wrote:
> > > * Parallel execution of queries that use `FULL` and `OUTER` joins
> >
> > I think this should be `RIGHT` joins rather than `OUTER` joins.
> >
> > LEFT joins have been parallelizable I think for a long time now.
>
> Well, since we can swap left/right easily, why would we not have just
> have swappted the tables and done the join in the past?  I think there
> are two things missing in my description.
>
> First, I need to mention parallel _hash_ join.  Second, I think this
> item is saying that the _inner_ side of a parallel hash join can be an
> OUTER or FULL join.  How about?
>
> Allow hash joins to be parallelized where the inner side is
> processed as an OUTER or FULL join (Melanie Plageman, Thomas Munro)
>
> In this case, the inner side is the hashed side.

I think Jonathan's text is safe to swap OUTER to RIGHT as it mentions
"execution". For the release notes, maybe the mention of it can be
moved away from "E.1.3.1.1. Optimizer" and put under "E.1.3.1.2.
General Performance" and ensure we mention that we're talking about
the executor?

I'm thinking it might be confusing if we claim that this is something
that we switched on in the planner. It was a limitation with the
executor which the planner was just onboard with not producing plans
for.

David




Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-05-23 Thread Kyotaro Horiguchi
At Tue, 23 May 2023 19:28:45 -0700, Andres Freund  wrote in 
> Hi,
> 
> On 2023-05-24 10:56:28 +0900, Kyotaro Horiguchi wrote:
> > At Wed, 26 Apr 2023 11:37:55 +1200, Thomas Munro  
> > wrote in 
> > > On Tue, Apr 25, 2023 at 12:16 PM Andres Freund  wrote:
> > > > On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> > > > > We obviously can add a retry loop to FileFallocate(), similar to 
> > > > > what's
> > > > > already present e.g. in FileRead(). But I wonder if we shouldn't go a 
> > > > > bit
> > > > > further, and do it for all the fd.c routines where it's remotely 
> > > > > plausible
> > > > > EINTR could be returned? It's a bit silly to add EINTR retries 
> > > > > one-by-one to
> > > > > the functions.
> > > > >
> > > > >
> > > > > The following are documented to potentially return EINTR, without 
> > > > > fd.c having
> > > > > code to retry:
> > > > >
> > > > > - FileWriteback() / pg_flush_data()
> > > > > - FileSync() / pg_fsync()
> > > > > - FileFallocate()
> > > > > - FileTruncate()
> > > > >
> > > > > With the first two there's the added complication that it's not 
> > > > > entirely
> > > > > obvious whether it'd be better to handle this in File* or pg_*. I'd 
> > > > > argue the
> > > > > latter is a bit more sensible?
> > > >
> > > > A prototype of that approach is attached. I pushed the retry handling 
> > > > into the
> > > > pg_* routines where applicable.  I guess we could add pg_* routines for
> > > > FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.
> > > 
> > > One problem we ran into with the the shm_open() case (which is nearly
> > > identical under the covers, since shm_open() just opens a file in a
> > > tmpfs on Linux) was that a simple retry loop like this could never
> > > terminate if the process was receiving a lot of signals from the
> > > recovery process, which is why we went with the idea of masking
> > > signals instead.  Eventually we should probably grow the file in
> > > smaller chunks with a CFI in between so that we both guarantee that we
> > > make progress (by masking for smaller size increases) and service
> > > interrupts in a timely fashion (by unmasking between loops).  I don't
> > > think that applies here because we're not trying to fallocate
> > > humongous size increases in one go, but I just want to note that we're
> > > making a different choice.  I think this looks reasonable for the use
> > > cases we actually have here.
> > 
> > FWIW I share the same feeling about looping by EINTR without signals
> > being blocked. If we just retry the same operation without processing
> > signals after getting EINTR, I think blocking signals is better. We
> > could block signals more gracefully, but I'm not sure it's worth the
> > complexity.
> 
> I seriously doubt it's a good path to go down in this case. As Thomas
> mentioned, this case isn't really comparable to the shm_open() one, due to the
> bounded vs unbounded amount of memory we're dealing with.
> 
> What would be the benefit?

I'm not certain what you mean by "it" here. Regarding signal blocking,
the benefit would be a lower chance of getting constantly interrupted
by a string of frequent interrupts, which can't be prevented just by
looping over. From what I gathered, Thomas meant that we don't need to
use chunking to prevent long periods of ignoring interrupts because
we're extending a file by a few blocks. However, I might have
misunderstood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: PG 16 draft release notes ready

2023-05-23 Thread Bruce Momjian
On Tue, May 23, 2023 at 12:14:04PM +0700, John Naylor wrote:
> On Tue, May 23, 2023 at 11:26 AM Bruce Momjian  wrote:
> >
> > On Tue, May 23, 2023 at 09:58:30AM +0700, John Naylor wrote:
> > > > Allow ASCII string detection to use vector operations on x86-64
> architectures
> > > (John Naylor)
> > > > Allow JSON string processing to use vector operations on x86-64
> architectures
> > > (John Naylor)
> > > >
> > > > ARM?
> > >
> > > Arm as well. For anything using 16-byte vectors the two architectures are
> > > equivalently supported. For all the applications, I would just say 
> > > "vector"
> or
> > > "SIMD".
> >
> > Okay, I kept "vector".  I don't think moving them into performance makes
> > sense because there I don't think this would impact user behavior or
> > choice, and it can't be controlled.
> 
> Well, these two items were only committed because of measurable speed
> increases, and have zero effect on how developers work with "source code", so
> that's a category error.
> 
> Whether they rise to the significance of warranting inclusion in release notes
> is debatable.

Okay, let's dissect this.  First, I am excited about these features
because I think they show innovation, particularly for high scaling, so
I want to highlight this.

Second, you might be correct that the section is wrong.  I thought of
CPU instructions as something tied to the compiler, so part of the build
process or source code, but the point we should be make is that we have
these acceleration, not how it is implemented.  We can move the entire
group to the "General Performance" section, or we can split it out:

Keep in source code:

Add support for SSE2 (Streaming SIMD Extensions 2) vector operations on
x86-64 architectures (John Naylor)

Add support for Advanced SIMD (Single Instruction Multiple Data) (NEON)
instructions on ARM architectures (Nathan Bossart)

move to General Performance:

Allow xid/subxid searches to use vector operations (Nathan Bossart)

Allow ASCII string detection to use vector operations (John Naylor)

and add these to data types:

Allow JSON string parsing to use vector operations (John Naylor)

Allow array searches to use vector operations (John Naylor) 

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Docs: Encourage strong server verification with SCRAM

2023-05-23 Thread Michael Paquier
On Tue, May 23, 2023 at 10:01:03PM -0400, Stephen Frost wrote:
> To the extent that there was an issue when it was implemented ... it's
> now been implemented and so that was presumably overcome (though I don't
> really specifically recall what the issues were there?  Seems like it
> wouldn't matter at this point though), so I don't understand why we
> would wish to revisit it.

Well, there are IMO two sides issues here:
1) libpq sends an empty username, which can be an issue if attempting
to make this layer more pluggable with other things that are more
compilation than PG with the SCRAM exchange (OpenLDAP is one, there
could be others).
2) The backend ignoring the username means that it is not mixed in the
hashes.

Relying on channel binding mitigates the range of problems for 2), now
one thing is that the default sslmode does not enforce an SSL
connection, either, though this default has been discussed a lot.  So
I am really wondering whether there wouldn't be some room for a more
compliant, strict HBA option with scram-sha-256 where we'd only allow
UTF-8 compliant strings.  Just some food for thoughts.

And we could do better about the current state that's 1), anyway.
--
Michael


signature.asc
Description: PGP signature


Re: PG 16 draft release notes ready

2023-05-23 Thread John Naylor
On Wed, May 24, 2023 at 11:19 AM Bruce Momjian  wrote:
>
> Second, you might be correct that the section is wrong.  I thought of
> CPU instructions as something tied to the compiler, so part of the build
> process or source code, but the point we should be make is that we have
> these acceleration, not how it is implemented.  We can move the entire
> group to the "General Performance" section, or we can split it out:

Splitting out like that seems like a good idea to me.

> Keep in source code:
>
> Add support for SSE2 (Streaming SIMD Extensions 2) vector
operations on
> x86-64 architectures (John Naylor)
>
> Add support for Advanced SIMD (Single Instruction Multiple Data)
(NEON)
> instructions on ARM architectures (Nathan Bossart)
>
> move to General Performance:
>
> Allow xid/subxid searches to use vector operations (Nathan
Bossart)
>
> Allow ASCII string detection to use vector operations (John
Naylor)

(The ASCII part is most relevant for COPY FROM, just in case that matters.)

> and add these to data types:
>
> Allow JSON string parsing to use vector operations (John Naylor)
>
> Allow array searches to use vector operations (John Naylor)

The last one refers to new internal functions, so it could stay in source
code. (Either way, we don't want to imply that arrays of SQL types are
accelerated this way, it's so far only for internal arrays.)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Support logical replication of DDLs

2023-05-23 Thread vignesh C
On Mon, 22 May 2023 at 11:27, shveta malik  wrote:
>
> On Wed, May 17, 2023 at 4:45 PM vignesh C  wrote:
> >
> > On Wed, 17 May 2023 at 15:41, shveta malik  wrote:
> > >
> > > On Fri, May 12, 2023 at 12:03 PM shveta malik  
> > > wrote:
> > > >
> > > > On Tue, May 9, 2023 at 4:23 PM shveta malik  
> > > > wrote:
> > > > >
> > > > > On Mon, May 8, 2023 at 4:31 PM shveta malik  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, May 8, 2023 at 3:58 PM shveta malik 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, May 2, 2023 at 8:30 AM shveta malik 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Apr 28, 2023 at 5:11 PM Amit Kapila 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Now, I think we can try to eliminate this entire ObjTree 
> > > > > > > > > machinery and
> > > > > > > > > directly from the JSON blob during deparsing. We have 
> > > > > > > > > previously also
> > > > > > > > > discussed this in an email chain at [1]. I think now the 
> > > > > > > > > functionality
> > > > > > > > > of JSONB has also been improved and we should investigate 
> > > > > > > > > whether it
> > > > > > > > > is feasible to directly use JSONB APIs to form the required 
> > > > > > > > > blob.
> > > > > > > >
> > > > > > > > +1.
> > > > > > > > I will investigate this and will share my findings.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Please find the PoC patch for create-table after object-tree 
> > > > > > > removal.
> > > > > >
> > > > >
> > > >
> > > > Please find the new set of patches attached for object-tree removal.
> > >
> > > Please find the new set of patches for object-tree Removal.  The new
> > > changes are in patch 0008 only. The new changes incorporate the
> > > object-tree removal for 'alter table' command.
> >
>
> Please find the new set of patches for object-tree Removal.  The new
> changes are in patch 0008 only. The new changes address object tree
> removal for below commands.
>
> create sequence
> alter sequence
> alter object owner to
> alter object set schema
> alter object rename
>
> In this patch 0008, ddldeparse.c is now object-tree free for all the
> table related commands. Index related commands are yet to be done.

I found few comments while making some changes to the patch:
1) Now that objtree is removed, these comments should be modified:
 * Deparse object tree is created by using:
 * a) new_objtree("know contents") where the complete tree content is known or
 * the initial tree content is known.
 * b) new_objtree("") for the syntax where the object tree will be derived
 * based on some conditional checks.
 * c) new_objtree_VA where the complete tree can be derived using some fixed
 * content or by using the initial tree content along with some variable
 * arguments.
 *

 2) isgrant can be removed as it is not used anymore:
+/*
+ * Return the given object type as a string.
+ *
+ * If isgrant is true, then this function is called while deparsing GRANT
+ * statement and some object names are replaced.
+ */
+const char *
+stringify_objtype(ObjectType objtype, bool isgrant)
+{
+   switch (objtype)
+   {
+   case OBJECT_TABLE:
+   return "TABLE";
+   default:
+   elog(ERROR, "unsupported object type %d", objtype);
+   }
+
+   return "???";   /* keep compiler quiet */
+}

3) This statement is not being handled currently, it should be implemented:
"alter table all in tablespace tbs1 set tablespace"

4) This pub_ddl is selected as the 7th column, it should be 7 instead of 9 here:
@@ -6405,6 +6418,8 @@ describePublications(const char *pattern)
printTableAddCell(&cont, PQgetvalue(res, i, 4), false, false);
printTableAddCell(&cont, PQgetvalue(res, i, 5), false, false);
printTableAddCell(&cont, PQgetvalue(res, i, 6), false, false);
+   if (has_pubddl)
+   printTableAddCell(&cont, PQgetvalue(res, i,
9), false, false);
if (has_pubtruncate)
printTableAddCell(&cont, PQgetvalue(res, i,
7), false, false);
if (has_pubviaroot)

Regards,
Vignesh




RE: walsender performance regression due to logical decoding on standby changes

2023-05-23 Thread Zhijie Hou (Fujitsu)
On Tuesday, May 23, 2023 1:53 AM Andres Freund  wrote:
> On 2023-05-22 12:15:07 +, Zhijie Hou (Fujitsu) wrote:
> > About "a backend doing logical decoding", do you mean the case when a
> user
> > start a backend and invoke pg_logical_slot_get_changes() to do the logical
> > decoding ? If so, it seems the logical decoding in a backend won't be waked
> up
> > by startup process because the backend won't be registered as a walsender
> so
> > the backend won't be found in WalSndWakeup().
> 
> I meant logical decoding happening inside a walsender instance.
> 
> 
> > Or do you mean the deadlock between the real logical walsender and startup
> > process ? (I might miss something) I think the logical decoding doesn't lock
> > the target user relation when decoding because it normally can get the
> needed
> > information from WAL.
> 
> It does lock catalog tables briefly. There's no guarantee that such locks are
> released immediately. I forgot the details, but IIRC there's some outfuncs
> (enum?) that intentionally delay releasing locks till transaction commit.

Thanks for the explanation !

I understand that the startup process can take lock on the catalog(when
replaying record) which may conflict with the lock in walsender.

But in walsender, I think we only start transaction after entering
ReorderBufferProcessTXN(), and the transaction started here will be released
soon after processing and outputting the decoded transaction's data(as the
comment in ReorderBufferProcessTXN() says:" all locks acquired in here to be
released, not reassigned to the parent and we do not want any database access
have persistent effects.").

Besides, during the process and output of the decoded transaction, the
walsender won't wait for the wakeup of startup process(e.g.
WalSndWaitForWal()), it only waits if the data is being sent to subscriber. So
it seems the lock conflict here won't cause the deadlock for now, although it
may have a risk if we change this logic later. Sorry if I missed something, and
thanks again for your patience in explaining.

Best Regards,
Hou zj






Re: ERROR: no relation entry for relid 6

2023-05-23 Thread Richard Guo
On Wed, May 24, 2023 at 2:48 AM Tom Lane  wrote:

> Richard Guo  writes:
> > Considering that clone clauses should always be outer-join clauses not
> > filter clauses, I'm wondering if we can add an additional check for that
> > in RINFO_IS_PUSHED_DOWN, something like
>
> >  #define RINFO_IS_PUSHED_DOWN(rinfo, joinrelids) \
> > -   ((rinfo)->is_pushed_down || \
> > -!bms_is_subset((rinfo)->required_relids, joinrelids))
> > +   (!((rinfo)->has_clone || ((rinfo)->is_clone)) && \
> > +((rinfo)->is_pushed_down || \
> > + !bms_is_subset((rinfo)->required_relids, joinrelids)))
>
> I don't like that one bit; it seems way too likely to introduce
> bad side-effects elsewhere.


Agreed.  I also do not have too much confidence in it.


> I think the real issue is that "is pushed down" has never been a
> conceptually accurate way of thinking about what
> remove_rel_from_query needs to do here.  Using RINFO_IS_PUSHED_DOWN
> happened to work up to now, but it's an abuse of that macro, and
> changing the macro's behavior isn't the right way to fix it.
>
> Having said that, I'm not sure what is a better way to think about it.
> It seems like our data structure doesn't have a clear tie between
> restrictinfos and their original join level, or at least, to the extent
> that it did the "clone clause" mechanism has broken it.
>
> I wonder if we could do something involving recording the
> rinfo_serial numbers of all the clauses extracted from a particular
> syntactic join level (we could keep this in a bitmapset attached
> to each SpecialJoinInfo, perhaps) and then filtering the joinclauses
> on the basis of serial numbers instead of required_relids.


I think this is a better way to fix the issue.  I went ahead and drafted
a patch as attached.  But I doubt that the collecting of rinfo_serial
numbers is thorough in the patch.  Should we also collect the
rinfo_serial of quals generated in reconsider_outer_join_clauses()?  I
believe that we do not need to consider the quals from
generate_base_implied_equalities(), since they are all supposed to be
restriction clauses not join clauses.

Thanks
Richard


v1-0001-WIP-Fix-the-outer-join-removal-problem.patch
Description: Binary data


Re: Insertion Sort Improvements

2023-05-23 Thread Benjamin Coutu
> That's worth trying out. It might also then be worth trying to push both 
> unordered values -- the big one up / the small one down. I've seen other 
> implementations do that, but don't remember where, or what it's called.

It is important that we do not do 2 compares two avoid one copy (assignment to 
temporary) as you did in your patch earlier in this thread, cause compares are 
usually pretty costly (also two compares are then inlined, bloating the binary).
Assigning a sort tuple to a temporary translates to pretty simple assembly 
code, so my suggested modification should outperform. It cuts down the cost of 
the inner loop by ca. 40% comparing the assembly. And it avoids having to 
re-read memory on each comparison, as the temporary can be held in registers.

> "Unbounded" means no bounds check on the array. That's not possible in its 
> current form, so I think you misunderstood something.

Sorry for the confusion. I didn't mean unbounded in the "array bound checking" 
sense, but in the "unrestricted number of loops" sense.

> I only remember implementations tracking loop iterations, not swaps. You'd 
> need evidence that this is better.

Well, the idea was to include the presorted check somehow. Stopping after a 
certain number of iterations is surely more safe than stopping after a number 
of swaps, but we would then implicitly also stop our presort check. We could 
change that though: Count loop iterations and on bailout continue with a pure 
presort check, but from the last position of the insertion sort -- not all over 
again -- by comparing against the maximum value recorded during the insertion 
sort. Thoughts?

> An important part not mentioned yet: This might only be worth doing if the 
> previous recursion level performed no swaps during partitioning and the 
> current pivot candidates are ordered.

Agreed.

> It's currently 7, but should really be something like 10. A script that 
> repeats tests for, say, 7 through 18 should show a concave-up shape in the 
> times. The bottom of the bowl should shift to higher values, and that minimum 
> is what should be compared.

Yeah, as alluded to before, it should be closer to 10 nowadays.
In any case it should be changed at least from 7 to 8, cause then we could at 
least discard the additional check for n > 7 in the quicksort code path (see 
/src/include/lib/sort_template.h#L322). Currently we check n < 7 and a few 
lines down we check for n > 7, if we check n < 8 for insertion sort then the 
second check becomes obsolete.

Benjamin Coutu
http://www.zeyos.com




Re: Large files for relations

2023-05-23 Thread Peter Eisentraut

On 24.05.23 02:34, Thomas Munro wrote:

Thanks all for the feedback.  It was a nice idea and it *almost*
works, but it seems like we just can't drop segmented mode.  And the
automatic transition schemes I showed don't make much sense without
that goal.

What I'm hearing is that something simple like this might be more acceptable:

* initdb --rel-segsize (cf --wal-segsize), default unchanged


makes sense


* pg_upgrade would convert if source and target don't match


This would be good, but it could also be an optional or later feature.

Maybe that should be a different mode, like 
--copy-and-adjust-as-necessary, so that users would have to opt into 
what would presumably be slower than plain --copy, rather than being 
surprised by it, if they unwittingly used incompatible initdb options.



I would probably also leave out those Windows file API changes, too.
--rel-segsize would simply refuse larger sizes until someone does the
work on that platform, to keep the initial proposal small.


Those changes from off_t to pgoff_t?  Yes, it would be good to do 
without those.  Apart of the practical problems that have been brought 
up, this was a major annoyance with the proposed patch set IMO.



I would probably leave the experimental copy_on_write() ideas out too,
for separate discussion in a separate proposal.


right