Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-03-28 Thread Masahiko Sawada
Hi,

Thank you for the patch!

On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
>
> Hi,
>
> Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> completion of alter default privileges like the below statement:
> ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM dba1;

+1

>
> 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> public FOR " like in below statement:
> ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> ON TABLES TO PUBLIC;

Since there is no difference FOR USER and FOR ROLE, I'm not sure we
really want to support both in tab-completion.

>
> 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> REVOKE " like in below statement:
> alter default privileges revoke grant option for select ON tables FROM dba1;

+1. But the v3 patch doesn't cover the following case:

=# alter default privileges for role masahiko revoke [tab]
ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
 REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE

And it doesn't cover MAINTAIN neither:

=# alter default privileges revoke [tab]
ALL   DELETEGRANT OPTION FOR  REFERENCES
 TRIGGER   UPDATE
CREATEEXECUTE   INSERTSELECT
 TRUNCATE  USAGE

The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
but we handle such case in GRANT and REVOKE part:

(around L3958)
/*
 * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
 * privileges (can't grant roles)
 */
if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
  "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
  "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");

Also, I think we can support WITH GRANT OPTION too. For example,

=# alter default privileges for role masahiko grant all on tables to
public [tab]

It's already supported in the GRANT statement.

>
> 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> column-name SET" like in:
> ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
>

+1. The patch looks good to me, so pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Thomas Munro
> v12

Hi all,

I didn't review the patch but one thing jumped out: I don't think it's
OK to hold a spinlock while (1) looping over an array of backends and
(2) making system calls (SetLatch()).




Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread torikoshia

On 2024-03-26 17:16, Masahiko Sawada wrote:

On Tue, Mar 26, 2024 at 3:04 PM Bharath Rupireddy
 wrote:


On Tue, Mar 26, 2024 at 9:56 AM Masahiko Sawada 
 wrote:

>
> > > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
>
> > > I guess it would be better to make the log message clearer to convey
> > > what we did for the malformed row. For example, how about something
> > > like "skipping row due to data type incompatibility at line %llu for
> > > column %s: \"s\""?
> >
> > The summary message which gets printed at the end says that "NOTICE:
> > 6 rows were skipped due to data type incompatibility". Isn't this
> > enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> > such rows get skipped softly and the summary message can help them,
> > no?
>
> I think that in the main log message we should mention what happened
> (or is happening) or what we did (or are doing). If the message "data
> type incompatibility ..." was in the DETAIL message with the main
> message saying something like "skipping row at line %llu for column
> %s: ...", it would make sense to me. But the current message seems not
> to be clear to me and consistent with other NOTICE messages. Also, the
> last summary line would not be written if the user cancelled, and
> someone other than person who used ON_ERROR 'ignore' might check the
> server logs later.

Agree. I changed the NOTICE message to what you've suggested. Thanks.



Thank you for updating the patch! Looks good to me.

Please find the attached patch. I've made some changes for the
documentation and the commit message. I'll push it, barring any
objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Thanks!

Attached patch fixes the doc, but I'm wondering perhaps it might be 
better to modify the codes to prohibit abbreviation of the value.


When seeing the query which abbreviates ON_ERROR value, I feel it's not 
obvious what happens compared to other options which tolerates 
abbreviation of the value such as FREEZE or HEADER.


  COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread Bharath Rupireddy
On Thu, Mar 28, 2024 at 1:43 PM torikoshia  wrote:
>
> Attached patch fixes the doc,

May I know which patch you are referring to? And, what do you mean by
"fixes the doc"?

> but I'm wondering perhaps it might be
> better to modify the codes to prohibit abbreviation of the value.

Please help me understand the meaning here.

> When seeing the query which abbreviates ON_ERROR value, I feel it's not
> obvious what happens compared to other options which tolerates
> abbreviation of the value such as FREEZE or HEADER.
>
>COPY t1 FROM stdin WITH (ON_ERROR);
>
> What do you think?

So, do you mean to prohibit ON_ERROR being specified without any value
like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
other options do allow that [1].

Even if we were to do something like this, shall we discuss this separately?

Having said that, what do you think of the v13 patch posted upthread?

[1]
postgres=# COPY t1 FROM stdin WITH (
DEFAULT ESCAPE  FORCE_QUOTE HEADER  QUOTE
DELIMITER   FORCE_NOT_NULL  FORMAT  NULL
ENCODINGFORCE_NULL  FREEZE  ON_ERROR

postgres=# COPY t1 FROM stdin WITH ( QUOTE );
ERROR:  relation "t1" does not exist
postgres=# COPY t1 FROM stdin WITH ( DEFAULT );
ERROR:  relation "t1" does not exist
postgres=# COPY t1 FROM stdin WITH ( ENCODING );
ERROR:  relation "t1" does not exist

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Can't find not null constraint, but \d+ shows that

2024-03-28 Thread Tender Wang
Alvaro Herrera  于2024年3月26日周二 23:25写道:

> On 2024-Mar-26, Tender Wang wrote:
>
> > postgres=# CREATE TABLE t1(c0 int, c1 int);
> > postgres=# ALTER TABLE  t1 ADD CONSTRAINT Q PRIMARY KEY(c0, c1);
> > postgres=# ALTER TABLE  t1 DROP c1;
> >
> > postgres=# ALTER TABLE  t1  ALTER c0 DROP NOT NULL;
> > ERROR:  could not find not-null constraint on column "c0", relation "t1"
>
> Ooh, hah, what happens here is that we drop the PK constraint
> indirectly, so we only go via doDeletion rather than the tablecmds.c
> code, so we don't check the attnotnull flags that the PK was protecting.
>
> > The attached patch is my workaround solution.  Look forward your apply.
>
> Yeah, this is not a very good approach -- I think you're just guessing
> that the column is marked NOT NULL because a PK was dropped in the
> past -- but really what this catalog state is, is corrupted contents
> because the PK drop was mishandled.  At least in theory there are other
> ways to drop a constraint other than dropping one of its columns (for
> example, maybe this could happen if you drop a collation that the PK
> depends on).  The right approach is to ensure that the PK drop always
> does the dance that ATExecDropConstraint does.  A good fix probably just
> moves some code from dropconstraint_internal to RemoveConstraintById.
>

 RemoveConstraintById() should think recurse(e.g. partition table)? I'm not
sure now.
 If we should think process recurse in RemoveConstraintById(),  the
function will look complicated than before.

-- 
Tender Wang
OpenPie:  https://en.openpie.com/


Re: Combine Prune and Freeze records emitted by vacuum

2024-03-28 Thread Heikki Linnakangas

On 28/03/2024 03:53, Melanie Plageman wrote:

On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:

One change with this is that live_tuples and many of the other fields are
now again updated, even if the caller doesn't need them. It was hard to skip
them in a way that would save any cycles, with the other refactorings.


I am worried we are writing checks that are going to have to come out of
SELECT queries' bank accounts, but I'll do some benchmarking when we're
all done with major refactoring.


Sounds good, thanks.


+*
+* Whether we arrive at the dead HOT tuple first here 
or while
+* following a chain below affects whether preceding 
RECENTLY_DEAD
+* tuples in the chain can be removed or not.  Imagine 
that you
+* have a chain with two tuples: RECENTLY_DEAD -> DEAD. 
 If we
+* reach the RECENTLY_DEAD tuple first, the 
chain-following logic
+* will find the DEAD tuple and conclude that both 
tuples are in
+* fact dead and can be removed.  But if we reach the 
DEAD tuple
+* at the end of the chain first, when we reach the 
RECENTLY_DEAD
+* tuple later, we will not follow the chain because 
the DEAD
+* TUPLE is already 'marked', and will not remove the
+* RECENTLY_DEAD tuple.  This is not a correctness 
issue, and the
+* RECENTLY_DEAD tuple will be removed by a later 
VACUUM.
 */
if (!HeapTupleHeaderIsHotUpdated(htup))


Is this intentional? Like would it be correct to remove the
RECENTLY_DEAD tuple during the current vacuum?


Yes, it would be correct. And if we happen to visit the items in 
different order, the RECENTLY_DEAD tuple first, it will get removed.


(This is just based on me reading the code, I'm not sure if I've missed 
something. Would be nice to construct a test case for that and step 
through it with a debugger to really see what happens. But this is not a 
new issue, doesn't need to be part of this patch)



 From c2ee7508456d0e76de985f9997a6840450e342a8 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 28 Mar 2024 00:45:26 +0200
Subject: [PATCH v8 22/22] WIP

- Got rid of all_visible_except_removable again. We're back to the
   roughly the same mechanism as on 'master', where the all_visible
   doesn't include LP_DEAD items, but at the end of
   heap_page_prune_and_freeze() when we return to the caller, we clear
   it if there were any LP_DEAD items. I considered calling the
   variable 'all_visible_except_lp_dead', which would be more accurate,
   but it's also very long.


not longer than all_visible_except_removable. I would be happy to keep
it more exact, but I'm also okay with just all_visible.


Ok, let's make it 'all_visible_except_lp_dead' for clarity.


What's this "cutoffs TODO"?


+ * cutoffs TODO
+ *
   * presult contains output parameters needed by callers such as the number of
   * tuples removed and the number of line pointers newly marked LP_DEAD.
   * heap_page_prune_and_freeze() is responsible for initializing it.


All the other arguments are documented in the comment, except 'cutoffs'.


@@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
   * DEAD, our visibility test is just too coarse to detect it.
   *
   * In general, pruning must never leave behind a DEAD tuple that still has
- * tuple storage.  VACUUM isn't prepared to deal with that case.  That's why
+ * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: it no 
longer cares, right?).
+ * That's why
   * VACUUM prunes the same heap page a second time (without dropping its lock
   * in the interim) when it sees a newly DEAD tuple that we initially saw as
- * in-progress.  Retrying pruning like this can only happen when an inserting
+ * in-progress (FIXME: Really? Does it still do that?).


Yea, I think all that is no longer true. I missed this comment back
then.


Committed a patch to remove it.


@@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum,
 * RECENTLY_DEAD tuples just in case there's a DEAD one after 
them;
 * but we can't advance past anything else.  We have to make 
sure that
 * we don't miss any DEAD tuples, since DEAD tuples that still 
have
-* tuple storage after pruning will confuse VACUUM.
+* tuple storage after pruning will confuse VACUUM (FIXME: not 
anymore
+* I think?).


Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
with storage after pruning anymore?


I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't 
loop through the items on the page checking their visibility anymore.


Hmm, one confusion remains though: In the 2nd 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-28 Thread John Naylor
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  wrote:
>
> Pushed the refactoring patch.
>
> I've attached the rebased vacuum improvement patch for cfbot. I
> mentioned in the commit message that this patch eliminates the 1GB
> limitation.
>
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

I'll do another pass tomorrow, but first I wanted to get in another
slightly-challenging in-situ test. On my humble laptop, I can still
fit a table large enough to cause PG16 to choke on multiple rounds of
index cleanup:

drop table if exists test;
create unlogged table test (a int, b uuid) with (autovacuum_enabled=false);

insert into test (a,b) select i, gen_random_uuid() from
generate_series(1,1000*1000*1000) i;

create index on test (a);
create index on test (b);

delete from test;

vacuum (verbose, truncate off, parallel 2) test;

INFO:  vacuuming "john.public.test"
INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total)
tuples: 97174 removed, 2826 remain, 0 are dead but not yet removable
tuples missed: 2826 dead from 18 pages not removed due to cleanup lock
contention
removable cutoff: 771, which was 0 XIDs old when operation ended
new relfrozenxid: 767, which is 4 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 6369409 pages from table (100.00% of total) had
97174 dead item identifiers removed
index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted,
2741825 currently deleted, 0 reusable
index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted,
3842056 currently deleted, 0 reusable
avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s
buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied
WAL usage: 3 records, 1 full page images, 2050 bytes
system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s

Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880.

About the "tuples missed" -- I didn't expect contention during this
test. I believe that's completely unrelated behavior, but wanted to
mention it anyway, since I found it confusing.




Re: Can't find not null constraint, but \d+ shows that

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Tender Wang wrote:

>  RemoveConstraintById() should think recurse(e.g. partition table)? I'm not
> sure now.
>  If we should think process recurse in RemoveConstraintById(),  the
> function will look complicated than before.

No -- this function handles just a single constraint, as identified by
OID.  The recursion is handled by upper layers, which can be either
dependency.c or tablecmds.c.  I think the problem you found is caused by
the fact that I worked with the tablecmds.c recursion and neglected the
one in dependency.c.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Jelte Fennema-Nio
On Thu, 28 Mar 2024 at 01:43, David G. Johnston
 wrote:
>
> On Wed, Mar 27, 2024 at 5:17 PM Bruce Momjian  wrote:
>>
>> 

I addressed them all I think. Mostly the small changes that were
suggested, but I rewrote the sentence with "might be discarded". And I
added references to the new GUC in both places suggested by David.


v10-0001-Add-allow_alter_system-GUC.patch
Description: Binary data


Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Jelte Fennema-Nio
On Wed, 27 Mar 2024 at 19:46, Alvaro Herrera  wrote:
>
> On 2024-Mar-27, Alvaro Herrera wrote:
>
> > I changed it so that the error messages are returned as translated
> > phrases, and was bothered by the fact that if errors happen repeatedly,
> > the memory for them might be leaked.  Maybe this is fine depending on
> > the caller's memory context, but since it's only at most one string each
> > time, it's quite easy to just keep track of it so that we can release it
> > on the next.
>
> (Actually this sounds clever but fails pretty obviously if the caller
> does free the string, such as in a memory context reset.  So I guess we
> have to just accept the potential leakage.)

Your changes look good, apart from the prverror stuff indeed. If you
remove the prverror stuff again I think this is ready to commit.




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Jelte Fennema-Nio
On Wed, 27 Mar 2024 at 19:27, Alvaro Herrera  wrote:
> I ended up reducing the two PG_TRY blocks to a single one.  I see no
> reason to split them up, and this way it looks more legible.

I definitely agree this looks better. Not sure why I hadn't done that,
maybe it wasn't possible in one of the earlier iterations of the API.




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Jelte Fennema-Nio
On Thu, 28 Mar 2024 at 10:24, Jelte Fennema-Nio  wrote:
> I addressed them all I think. Mostly the small changes that were
> suggested, but I rewrote the sentence with "might be discarded". And I
> added references to the new GUC in both places suggested by David.

Changed the error hint to use "external tool" too. And removed a
duplicate header definition of AllowAlterSystem (I moved it to guc.h
so it was together with other definitions a few patches ago, but
apparently forgot to remove it from guc_tables.h)


v11-0001-Add-allow_alter_system-GUC.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-28 Thread Bertrand Drouvot
Hi,

On Wed, Mar 27, 2024 at 09:00:37PM +0530, Bharath Rupireddy wrote:
> standby for sync slots. 0002 implementing inactive timeout GUC based
> invalidation mechanism.
> 
> Please have a look.

Thanks!

Regarding 0002:

Some testing:

T1 ===

When the slot is invalidated on the primary, then the reason is propagated to
the sync slot (if any). That's fine but we are loosing the inactive_since on the
standby:

Primary:

postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from 
pg_replication_slots where slot_name='lsub29_slot';
  slot_name  |inactive_since | conflicting | invalidation_reason
-+---+-+-
 lsub29_slot | 2024-03-28 08:24:51.672528+00 | f   | inactive_timeout
(1 row)

Standby:

postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from 
pg_replication_slots where slot_name='lsub29_slot';
  slot_name  | inactive_since | conflicting | invalidation_reason
-++-+-
 lsub29_slot || f   | inactive_timeout
(1 row)

I think in this case it should always reflect the value from the primary (so
that one can understand why it is invalidated).

T2 ===

And it is set to a value during promotion:

postgres=# select pg_promote();
 pg_promote

 t
(1 row)

postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from 
pg_replication_slots where slot_name='lsub29_slot';
  slot_name  |inactive_since| conflicting | invalidation_reason
-+--+-+-
 lsub29_slot | 2024-03-28 08:30:11.74505+00 | f   | inactive_timeout
(1 row)

I think when it is invalidated it should always reflect the value from the
primary (so that one can understand why it is invalidated).

T3 ===

As far the slot invalidation on the primary:

postgres=# SELECT * FROM pg_logical_slot_get_changes('lsub29_slot', NULL, NULL, 
'include-xids', '0');
ERROR:  cannot acquire invalidated replication slot "lsub29_slot"

Can we make the message more consistent with what can be found in 
CreateDecodingContext()
for example?

T4 ===

Also, it looks like querying pg_replication_slots() does not trigger an
invalidation: I think it should if the slot is not invalidated yet (and matches
the invalidation criteria).

Code review:

CR1 ===

+Invalidate replication slots that are inactive for longer than this
+amount of time. If this value is specified without units, it is taken

s/Invalidate/Invalidates/?

Should we mention the relationship with inactive_since?

CR2 ===

+ *
+ * If check_for_invalidation is true, the slot is checked for invalidation
+ * based on replication_slot_inactive_timeout GUC and an error is raised after 
making the slot ours.
  */
 void
-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait,
+  bool check_for_invalidation)


s/check_for_invalidation/check_for_timeout_invalidation/?

CR3 ===

+   if (slot->inactive_since == 0 ||
+   replication_slot_inactive_timeout == 0)
+   return false;

Better to test replication_slot_inactive_timeout first? (I mean there is no
point of testing inactive_since if replication_slot_inactive_timeout == 0)

CR4 ===

+   if (slot->inactive_since > 0 &&
+   replication_slot_inactive_timeout > 0)
+   {

Same.

So, instead of CR3 === and CR4 ===, I wonder if it wouldn't be better to do
something like:

if (replication_slot_inactive_timeout == 0)
return false;
else if (slot->inactive_since > 0)
.
.
.
.
else
return false;

That would avoid checking replication_slot_inactive_timeout and inactive_since
multiple times.

CR5 ===

+* held to avoid race conditions -- for example the restart_lsn could 
move
+* forward, or the slot could be dropped.

Does the restart_lsn example makes sense here?

CR6 ===

+static bool
+InvalidateSlotForInactiveTimeout(ReplicationSlot *slot, bool need_locks)
+{

InvalidatePossiblyInactiveSlot() maybe?

CR7 ===

+   /* Make sure the invalidated state persists across server restart */
+   slot->just_dirtied = true;
+   slot->dirty = true;
+   SpinLockRelease(&slot->mutex);

Maybe we could create a new function say MarkGivenReplicationSlotDirty()
with a slot as parameter, that ReplicationSlotMarkDirty could call too?

Then maybe we could set slot->data.invalidated = RS_INVAL_INACTIVE_TIMEOUT in
InvalidateSlotForInactiveTimeout()? (to avoid multiple 
SpinLockAcquire/SpinLockRelease).

CR8 ===

+   if (persist_state)
+   {
+   charpath[MAXPGPATH];
+
+   sprintf(path, "pg_replslot/%s", NameStr(slot->data.name));
+   SaveSlotToPath(slot, path, ERROR);
+   }

Maybe we could create a new function say Give

Re: Table AM Interface Enhancements

2024-03-28 Thread Pavel Borisov
Hi, Alexander!
Thank you for working on these patches.
On Thu, 28 Mar 2024 at 02:14, Alexander Korotkov 
wrote:

> Hi, Pavel!
>
> Thank you for your feedback.  The revised patch set is attached.
>
> I found that vacuum.c has a lot of heap-specific code.  Thus, it
> should be OK for analyze.c to keep heap-specific code.  Therefore, I
> rolled back the movement of functions between files.  That leads to a
> smaller patch, easier to review.
>
I agree with you.
And with the changes remaining in heapam_handler.c I suppose we can also
remove the includes introduced:

#include 
#include "utils/sampling.h"
#include "utils/spccache.h"

On Wed, Mar 27, 2024 at 2:52 PM Pavel Borisov 
> wrote:
> >> The revised rest of the patchset is attached.
> >> 0001 (was 0006) – I prefer the definition of AcquireSampleRowsFunc to
> >> stay in vacuum.h.  If we move it to sampling.h then we would have to
> >> add there includes to define Relation, HeapTuple etc.  I'd like to
> >> avoid this kind of change.  Also, I've deleted
> >> table_beginscan_analyze(), because it's only called from
> >> tableam-specific AcquireSampleRowsFunc.  Also I put some comments to
> >> heapam_scan_analyze_next_block() and heapam_scan_analyze_next_tuple()
> >> given that there are now no relevant comments for them in tableam.h.
> >> I've removed some redundancies from acquire_sample_rows().  And added
> >> comments to AcquireSampleRowsFunc based on what we have in FDW docs
> >> for this function.  Did some small edits as well.  As you suggested,
> >> turned back declarations for acquire_sample_rows() and compare_rows().
> >
> >
> > In my comment in the thread I was not thinking about returning the old
> name acquire_sample_rows(), it was only about the declarations and the
> order of functions to be one code block. To me heapam_acquire_sample_rows()
> looks better for a name of heap implementation of *AcquireSampleRowsFunc().
> I suggest returning the name heapam_acquire_sample_rows() from v4. Sorry
> for the confusion in this.
>
> I found that the function name acquire_sample_rows is referenced in
> quite several places in the source code.  So, I would prefer to save
> the old name to keep the changes minimal.
>
The full list shows only a couple of changes in analyze.c and several
comments elsewhere.

contrib/postgres_fdw/postgres_fdw.c: * of the relation.  Same
algorithm as in acquire_sample_rows in
src/backend/access/heap/vacuumlazy.c:* match what analyze.c's
acquire_sample_rows() does, otherwise VACUUM
src/backend/access/heap/vacuumlazy.c:* The logic here is a bit
simpler than acquire_sample_rows(), as
src/backend/access/heap/vacuumlazy.c:* what
acquire_sample_rows() does.
src/backend/access/heap/vacuumlazy.c:*
acquire_sample_rows() does, so be consistent.
src/backend/access/heap/vacuumlazy.c:* acquire_sample_rows()
will recognize the same LP_DEAD items as dead
src/backend/commands/analyze.c:static int
acquire_sample_rows(Relation onerel, int elevel,
src/backend/commands/analyze.c: acquirefunc = acquire_sample_rows;
src/backend/commands/analyze.c: * acquire_sample_rows -- acquire a random
sample of rows from the table
src/backend/commands/analyze.c:acquire_sample_rows(Relation onerel, int
elevel,
src/backend/commands/analyze.c: * This has the same API as
acquire_sample_rows, except that rows are
src/backend/commands/analyze.c: acquirefunc =
acquire_sample_rows;

My point for renaming is to make clear that it's a heap implementation of
acquire_sample_rows which could be useful for possible reworking heap
implementations of table am methods into a separate place later. But I'm
also ok with the existing naming.


> > The changed type of static function that always returned true for heap
> looks good to me:
> > static void heapam_scan_analyze_next_block
> >
> > The same is for removing the comparison of always true "block_accepted"
> in (heapam_)acquire_sample_rows()
>
> Ok!
>
> > Removing table_beginscan_analyze and call scan_begin() is not in the
> same style as other table_beginscan_* functions. Though this is not a
> change in functionality, I'd leave this part as it was in v4.
>
> With the patch, this method doesn't have usages outside of table am.
> I don't think we should keep a method, which doesn't have clear
> external usage patterns.  But I agree that starting a scan with
> heap_beginscan() and ending with table_endscan() is not correct.  Now
> ending this scan with heap_endscan().
>
Good!


> > Also, a comment about it was introduced in v5:
> >
> > src/backend/access/heap/heapam_handler.c: * with
> table_beginscan_analyze()
>
> Corrected.

> For comments I'd propose:
> > %s/In addition, store estimates/In addition, a function should store
> estimates/g
> > %s/zerp/zero/g
>
> Fixed.
>
> >> 0002 (was 0007) – I've turned the redundant "if", which you've pointed
> >> out, into an assert.  Also, added some comments, most n

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Bertrand Drouvot
Hi,

On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> When analyzing one BF error[1], we find an issue of slotsync: Since we don't
> perform logical decoding for the synced slots when syncing the lsn/xmin of
> slot, no logical snapshots will be serialized to disk. So, when user starts to
> use these synced slots after promotion, it needs to re-build the consistent
> snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
> position indicates that there are running transactions. This however could
> cause the data that before the consistent point to be missed[2].

I see, nice catch and explanation, thanks!

> This issue doesn't exist on the primary because the snapshot at restart_lsn
> should have been serialized to disk (SnapBuildProcessRunningXacts ->
> SnapBuildSerialize), so even if the logical decoding restarts, it can find
> consistent snapshot immediately at restart_lsn.

Right.

> To fix this, we could use the fast forward logical decoding to advance the 
> synced
> slot's lsn/xmin when syncing these values instead of directly updating the
> slot's info. This way, the snapshot will be serialized to disk when decoding.
> If we could not reach to the consistent point at the remote restart_lsn, the
> slot is marked as temp and will be persisted once it reaches the consistent
> point. I am still analyzing the fix and will share once ready.

Thanks! I'm wondering about the performance impact (even in fast_forward mode),
might be worth to keep an eye on it.

Should we create a 17 open item [1]?

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Fix parallel vacuum buffer usage reporting

2024-03-28 Thread Anthonin Bonnefoy
Hi,

Thanks for the review.

On Thu, Mar 28, 2024 at 4:07 AM Masahiko Sawada 
wrote:

> As for the proposed patch, the following part should handle the temp
> tables too:
>

True, I've missed the local blocks. I've updated the patch:
- read_rate and write_rate now include local block usage
- I've added a specific output for reporting local blocks instead of
combining them in the same output.

Regards,
Anthonin


v2-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data


Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Jelte Fennema-Nio wrote:

> Your changes look good, apart from the prverror stuff indeed. If you
> remove the prverror stuff again I think this is ready to commit.

Great, thanks for looking.  Pushed now, I'll be closing the commitfest
entry shortly.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."(Julien PUYDT)




Re: Streaming I/O, vectored I/O (WIP)

2024-03-28 Thread Thomas Munro
Small bug fix: the condition in the final test at the end of
read_stream_look_ahead() wasn't quite right.  In general when looking
ahead, we don't need to start a read just because the pending read
would bring us up to stream->distance if submitted now (we'd prefer to
build it all the way up to size io_combine_limit if we can), but if
that condition is met AND we have nothing pinned yet, then there is no
chance for the read to grow bigger by a pinned buffer being consumed.
Fixed, comment updated.


v12-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data


v12-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data


v12-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data


Re: AIX support

2024-03-28 Thread Sriram RK
Hi Team,



We are setting up the build environment and trying to build the source and also 
trying to analyze the assert from the Aix point of view.

Also, would like to know if we can access the buildfarm(power machines) to run 
any of the specific tests to hit this assert.

Thanks & regards,
Sriram.

  > From: Sriram RK 
  > Date: Thursday, 21 March 2024 at 10:05 PM
  > To: Tom Lane t...@sss.pgh.pa.us, Alvaro Herrera 

  > Cc: pgsql-hack...@postgresql.org 

  > Subject: Re: AIX support
  > Thanks, Tom and Alvaro, for the info.
  > We shall look into to details and get back.



Re: Experiments with Postgres and SSL

2024-03-28 Thread Matthias van de Meent
On Tue, 5 Mar 2024 at 15:08, Heikki Linnakangas  wrote:
>
> I hope I didn't joggle your elbow reviewing this, Jacob, but I spent
> some time rebase and fix various little things:

With the recent changes to backend startup committed by you, this
patchset has gotten major apply failures.

Could you provide a new version of the patchset so that it can be
reviewed in the context of current HEAD?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Streaming I/O, vectored I/O (WIP)

2024-03-28 Thread Thomas Munro
On Fri, Mar 29, 2024 at 12:06 AM Thomas Munro  wrote:
> Small bug fix: the condition in the final test at the end of
> read_stream_look_ahead() wasn't quite right.  In general when looking
> ahead, we don't need to start a read just because the pending read
> would bring us up to stream->distance if submitted now (we'd prefer to
> build it all the way up to size io_combine_limit if we can), but if
> that condition is met AND we have nothing pinned yet, then there is no
> chance for the read to grow bigger by a pinned buffer being consumed.
> Fixed, comment updated.

Oops, I sent the wrong/unfixed version.  This version has the fix
described above.


v13-0001-Provide-vectored-variant-of-ReadBuffer.patch
Description: Binary data


v13-0002-Provide-API-for-streaming-relation-data.patch
Description: Binary data


v13-0003-Use-streaming-I-O-in-pg_prewarm.patch
Description: Binary data


Re: Fix some resources leaks (src/bin/pg_basebackup/pg_createsubscriber.c)

2024-03-28 Thread Ranier Vilela
Em qua., 27 de mar. de 2024 às 23:08, Euler Taveira 
escreveu:

> On Wed, Mar 27, 2024, at 8:50 PM, Ranier Vilela wrote:
>
> Coverity has some reports in the new code
> pg_createsubscriber.c
> I think that Coverity is right.
>
>
> It depends on your "right" definition.
>
I do not think so.

If your program execution is ephemeral
> and the leak is just before exiting, do you think it's worth it?
>
I think it's worth it.
Even an ephemeral execution can allocate resources, for example, and
mainly, in Windows,
and block these resources for other programs, and on a highly loaded server
with very few free resources,
releasing resources as quickly as possible makes a difference.


> 1.
> CID 1542682: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable buf going out of scope leaks the storage it
> points to.
>
>
> It will exit in the next instruction.
>
Yes, by exit() call function.


>
> 2.
> CID 1542704: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable conn going out of scope leaks the storage it
> points to.
>
>
> The connect_database function whose exit_on_error is false is used in 2
> routines:
>
Calling connect_database with false, per example:
conn = connect_database(dbinfo[i].pubconninfo, false);

If the connection status != CONNECTION_OK, the function returns NULL,
but does not free connection struct, memory and data.

In the loop with thousands of "number of specified databases",
It would quickly end up in problems, especially on Windows.


> * cleanup_objects_atexit: that's about to exit;
> * drop_primary_replication_slot: that will execute a few routines before
> exiting.
>
> 3.
> CID 1542691: (#1 of 1): Resource leak (RESOURCE_LEAK)
> leaked_storage: Variable str going out of scope leaks the storage it
> points to.
>
>
> It will exit in the next instruction.
>
Yes, by exit() call function.

About exit() function:

deallocating-memory-when-using-exit1-c

"exit *does not* call the destructors of any stack based objects so if
those objects have allocated any memory internally then yes that memory
will be leaked. "

reference/cstdlib/exit/ 
"Note that objects with automatic storage are not destroyed by calling exit
(C++)."

I think that relying on the exit function for cleaning is extremely
fragile, especially on Windows.


> Having said that, applying this patch is just a matter of style.
>
IMO, a better and more correct style.

best regards,
Ranier Vilela


Re: Synchronizing slots from primary to standby

2024-03-28 Thread Amit Kapila
On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
 wrote:
>
> When analyzing one BF error[1], we find an issue of slotsync: Since we don't
> perform logical decoding for the synced slots when syncing the lsn/xmin of
> slot, no logical snapshots will be serialized to disk. So, when user starts to
> use these synced slots after promotion, it needs to re-build the consistent
> snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
> position indicates that there are running transactions. This however could
> cause the data that before the consistent point to be missed[2].
>
> This issue doesn't exist on the primary because the snapshot at restart_lsn
> should have been serialized to disk (SnapBuildProcessRunningXacts ->
> SnapBuildSerialize), so even if the logical decoding restarts, it can find
> consistent snapshot immediately at restart_lsn.
>
> To fix this, we could use the fast forward logical decoding to advance the 
> synced
> slot's lsn/xmin when syncing these values instead of directly updating the
> slot's info. This way, the snapshot will be serialized to disk when decoding.
> If we could not reach to the consistent point at the remote restart_lsn, the
> slot is marked as temp and will be persisted once it reaches the consistent
> point. I am still analyzing the fix and will share once ready.
>

Yes, we can use this but one thing to note is that
CreateDecodingContext() will expect that the slot's and current
database are the same. I think the reason for that is we need to check
system tables of the current database while decoding and sending data
to the output_plugin which won't be a requirement for the fast_forward
case. So, we need to skip that check in fast_forward mode.

Next, I was thinking about the case of the first time updating the
restart and confirmed_flush LSN while syncing the slots. I think we
can keep the current logic as it is based on the following analysis.

For each logical slot, cases possible on the primary:
1. The restart_lsn doesn't have a serialized snapshot and hasn't yet
reached the consistent point.
2. The restart_lsn doesn't have a serialized snapshot but has reached
a consistent point.
3. The restart_lsn has a serialized snapshot which means it has
reached a consistent point as well.

Considering we keep the logic to reserve initial WAL positions the
same as the current (Reserve WAL for the currently active local slot
using the specified WAL location (restart_lsn). If the given WAL
location has been removed, reserve WAL using the oldest existing WAL
segment.), I could think of the below scenarios:
A. For 1, we shouldn't sync the slot as it still wouldn't have been
marked persistent on the primary.
B. For 2, we would sync the slot
   B1. If remote_restart_lsn >= local_resart_lsn, then advance the
slot by calling pg_logical_replication_slot_advance().
   B11. If we reach consistent point, then it should be okay
because after promotion as well we should reach consistent point.
B111. But again is it possible that there is some xact
that comes before consistent_point on primary and the same is after
consistent_point on standby? This shouldn't matter as we will start
decoding transactions after confirmed_flush_lsn which would be the
same on primary and standby.
   B22. If we haven't reached consistent_point, then we won't mark
the slot as persistent, and at the next sync we will do the same till
it reaches consistent_point. At that time, the situation will be
similar to B11.
   B2. If remote_restart_lsn < local_restart_lsn, then we will wait
for the next sync cycle and keep the slot as temporary. Once in the
next or some consecutive sync cycle, we reach the condition
remote_restart_lsn >= local_restart_lsn, we will proceed to advance
the slot and we should have the same behavior as B1.
C. For 3, we would sync the slot, but the behavior should be the same as B.

Thoughts?

-- 
With Regards,
Amit Kapila.




Re: Vectored I/O in bulk_write.c

2024-03-28 Thread Thomas Munro
Then I would make the trivial change to respect the new
io_combine_limit GUC that I'm gearing up to commit in another thread.
As attached.
From 7993cede8939cad9172867ccc690a44ea25d1ad6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 29 Mar 2024 00:22:53 +1300
Subject: [PATCH] fixup: respect io_combine_limit in bulk_write.c

---
 src/backend/storage/smgr/bulk_write.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/smgr/bulk_write.c 
b/src/backend/storage/smgr/bulk_write.c
index 848c3054f5..612a9a23b3 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -45,12 +45,6 @@
 
 #define MAX_PENDING_WRITES XLR_MAX_BLOCK_ID
 
-/*
- * How many blocks to send to smgrwritev() at a time.  Arbitrary value for
- * now.
- */
-#define MAX_BLOCKS_PER_WRITE ((128 * 1024) / BLCKSZ)
-
 static const PGIOAlignedBlock zero_buffer = {{0}}; /* worth BLCKSZ */
 
 typedef struct PendingWrite
@@ -232,7 +226,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
for (int i = 0; i < npending; i++)
{
Pagepage;
-   const void *pages[MAX_BLOCKS_PER_WRITE];
+   const void *pages[MAX_IO_COMBINE_LIMIT];
BlockNumber blkno;
int nblocks;
int max_nblocks;
@@ -266,14 +260,14 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 * We're overwriting.  Clamp at the existing size, 
because we
 * can't mix writing and extending in a single 
operation.
 */
-   max_nblocks = Min(lengthof(pages),
+   max_nblocks = Min(io_combine_limit,
  
bulkstate->pages_written - blkno);
}
else
{
/* We're extending. */
Assert(blkno == bulkstate->pages_written);
-   max_nblocks = lengthof(pages);
+   max_nblocks = io_combine_limit;
}
 
/* Find as many consecutive blocks as we can. */
-- 
2.39.3 (Apple Git-146)



Re: Synchronizing slots from primary to standby

2024-03-28 Thread Amit Kapila
On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot
 wrote:
>
> On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
>
> > To fix this, we could use the fast forward logical decoding to advance the 
> > synced
> > slot's lsn/xmin when syncing these values instead of directly updating the
> > slot's info. This way, the snapshot will be serialized to disk when 
> > decoding.
> > If we could not reach to the consistent point at the remote restart_lsn, the
> > slot is marked as temp and will be persisted once it reaches the consistent
> > point. I am still analyzing the fix and will share once ready.
>
> Thanks! I'm wondering about the performance impact (even in fast_forward 
> mode),
> might be worth to keep an eye on it.
>

True, we can consider performance but correctness should be a
priority, and can we think of a better way to fix this issue?

> Should we create a 17 open item [1]?
>
> [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
>

Yes, we can do that.

-- 
With Regards,
Amit Kapila.




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera
Hm, indri failed:

ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new 
-Wendif-labels -Wmissing-format-attribute -Wcast-function-type 
-Wformat-security -fno-strict-aliasing -fwrapv 
-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O2 
-fno-common -Werror  -fvisibility=hidden -bundle -o dblink.dylib  dblink.o 
-L../../src/port -L../../src/common -L../../src/interfaces/libpq -lpq -isysroot 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk
  -L/opt/local/libexec/llvm-15/lib -L/opt/local/lib -L/opt/local/lib 
-L/opt/local/lib  -L/opt/local/lib -Wl,-dead_strip_dylibs  -Werror  
-fvisibility=hidden -bundle_loader ../../src/backend/postgres

Undefined symbols for architecture arm64:
  "_libintl_gettext", referenced from:
  _libpqsrv_cancel in dblink.o
  _libpqsrv_cancel in dblink.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [dblink.dylib] Error 1
make: *** [all-dblink-recurse] Error 2

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Robert Haas
On Thu, Mar 28, 2024 at 5:42 AM Jelte Fennema-Nio  wrote:
> On Thu, 28 Mar 2024 at 10:24, Jelte Fennema-Nio  wrote:
> > I addressed them all I think. Mostly the small changes that were
> > suggested, but I rewrote the sentence with "might be discarded". And I
> > added references to the new GUC in both places suggested by David.
>
> Changed the error hint to use "external tool" too. And removed a
> duplicate header definition of AllowAlterSystem (I moved it to guc.h
> so it was together with other definitions a few patches ago, but
> apparently forgot to remove it from guc_tables.h)

I disagree with a lot of these changes. I think the old version was
mostly better. But I can live with a lot of it if it makes other
people happy. However:

+Which might result in unintended behavior, such as the external tool
+discarding the change at some later point in time when it updates the
+configuration.

This is not OK from a grammatical point of view. You can't just start
a sentence with "which" like this. You could replace "Which" with
"This", though.

+ if (!AllowAlterSystem)
+ {
+
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ALTER SYSTEM is not allowed in this environment"),
+ errhint("Global configuration changes should be made using an
external tool, not by using ALTER SYSTEM.")));
+ }

The extra blank line should go. The brackets should go. And I think
the errhint should go, too, because the errhint implies that we know
why the user chose to set allow_alter_system=false. There's no reason
for this message to be opinionated about that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Support logical replication of DDLs

2024-03-28 Thread Andrey M. Borodin



> On 18 Jul 2023, at 12:09, Zhijie Hou (Fujitsu)  wrote:
> 
> Here is the POC patch(0004) for the second approach

Hi everyone!

This thread is registered on CF [0] but is not active since 2023. Is anyone 
working on this? I understand that this is a nice feature. Should we move it to 
next CF or withdraw CF entry?

Thanks!


[0] https://commitfest.postgresql.org/47/3595/



elog/ereport VS misleading backtrace_function function address

2024-03-28 Thread Jakub Wartak
Hi -hackers,

While chasing some other bug I've learned that backtrace_functions
might be misleading with top elog/ereport() address.

Reproducer:

# using Tom's reproducer on master:
wget 
https://www.postgresql.org/message-id/attachment/112394/ri-collation-bug-example.sql
echo '' >> ri-collation-bug-example.sql
echo '\errverbose' >> ri-collation-bug-example.sql
-- based on grepping the source code locations many others could go in here:
psql -p 5437 -c "alter system set backtrace_functions =
'RI_FKey_cascade_del,get_collation_isdeterministic';"
psql -p 5437 -c "select pg_reload_conf();"
dropdb -p 5437 test1 ; createdb -p 5437 test1 ; psql -p 5437 -d test1
-f ri-collation-bug-example.sql

gives (note "get_collation_isdeterministic"):
psql:ri-collation-bug-example.sql:42: ERROR:  cache lookup failed
for collation 0
psql:ri-collation-bug-example.sql:44: error: ERROR:  XX000: cache
lookup failed for collation 0
LOCATION:  get_collation_isdeterministic, lsyscache.c:1088

and in log note the 0x14c6bb:
2024-03-27 14:39:13.065 CET [12899] postgres@test1 ERROR:  cache
lookup failed for collation 0
2024-03-27 14:39:13.065 CET [12899] postgres@test1 BACKTRACE:
postgres: 16/main: postgres test1 [local] DELETE(+0x14c6bb)
[0x5565c5a9c6bb]
postgres: 16/main: postgres test1 [local]
DELETE(RI_FKey_cascade_del+0x323) [0x5565c5ec0973]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d401f)
[0x5565c5c2401f]
postgres: 16/main: postgres test1 [local] DELETE(+0x2d5a04)
[0x5565c5c25a04]
postgres: 16/main: postgres test1 [local]
DELETE(AfterTriggerEndQuery+0x78) [0x5565c5c2a918]
[..]
2024-03-27 14:39:13.065 CET [12899] postgres@test1 STATEMENT:  delete
from revisions where id='5c617ce7-688d-4bea-9d66-c0f0ebc635da';

based on \errverbose it is OK (error matches the code, thanks to
Alvaro for this hint):

 9   bool
 8   get_collation_isdeterministic(Oid colloid)
 7   {
 6   │   HeapTuple>  tp;
 5   │   Form_pg_collation colltup;
 4   │   bool>   >   result;
 3   │
 2   │   tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
 1   │   if (!HeapTupleIsValid(tp))
  1088   │   │   elog(ERROR, "cache lookup failed for collation %u", colloid);
[..]


but based on backtrace address 0x14c6bb (!) and it resolves
differently which seems to be highly misleading (note the
"get_language_name.cold"):

$ addr2line -e /usr/lib/postgresql/16/bin/postgres -a -f 0x14c6bb
0x0014c6bb
get_language_name.cold
./build/src/backend/utils/cache/./build/../src/backend/utils/cache/lsyscache.c:1181

When i disassemble the get_collation_isdeterministic() (and I know the
name from \errverbose):

Dump of assembler code for function get_collation_isdeterministic:
Address range 0x5c7680 to 0x5c76c1:
   0x005c7680 <+0>: push   %rbp
[..]
   0x005c7694 <+20>:call   0x5d63b0 
   0x005c7699 <+25>:test   %rax,%rax
   0x005c769c <+28>:je 0x14c686

   0x005c76a2 <+34>:mov%rax,%rdi
[..]
   0x005c76bf <+63>:leave
   0x005c76c0 <+64>:ret
Address range 0x14c686 to 0x14c6bb:
   0x0014c686 <-4698106>:   xor%esi,%esi
   0x0014c688 <-4698104>:   mov$0x15,%edi
   0x0014c68d <-4698099>:   call   0x14ec86 
   0x0014c692 <-4698094>:   mov%r12d,%esi
   0x0014c695 <-4698091>:   lea0x5028dc(%rip),%rdi
   # 0x64ef78
   0x0014c69c <-4698084>:   xor%eax,%eax
   0x0014c69e <-4698082>:   call   0x5df320 
   0x0014c6a3 <-4698077>:   lea0x6311a6(%rip),%rdx
   # 0x77d850 <__func__.34>
   0x0014c6aa <-4698070>:   mov$0x440,%esi
   0x0014c6af <-4698065>:   lea0x630c8a(%rip),%rdi
   # 0x77d340
   0x0014c6b6 <-4698058>:   call   0x5df100 
   << NOTE the exact 0x14c6bb is even missing here(!)

notice the interesting thing here: according to GDB
get_collation_isdeterministic() is @ 0x5c7680 + jump to 0x14c686
 till 0x14c6bb (but without it)
without any stack setup for that .cold. But backtrace() just captured
the elog/ereport (cold) at the end of 0x14c6bb instead, so if I take
that exact address from backtrace_functions output as it is
("DELETE(+0x14c6bb)"), it also shows WRONG function (just like
addr2line):

(gdb) disassemble 0x14c6bb
Dump of assembler code for function get_language_name:
Address range 0x5c7780 to 0x5c77ee:
[..]
   0x005c77ed <+109>:   ret
Address range 0x14c6bb to 0x14c6f0:
   0x0014c6bb <-4698309>:   xor%esi,%esi
[..]
   0x0014c6eb <-4698261>:   call   0x5df100 
End of assembler dump.

so this is wrong (as are failing in "get_collation_isdeterministic"
not in "get_language_name").

I was very confused at the beginning with the main question being: why
are we compiling elog/ereport() so that it is incompatible with
backtrace? When looking for it I've found two threads [1] by David
that were ac

Re: Support logical replication of DDLs

2024-03-28 Thread Amit Kapila
On Thu, Mar 28, 2024 at 5:31 PM Andrey M. Borodin  wrote:
>
> This thread is registered on CF [0] but is not active since 2023. Is anyone 
> working on this? I understand that this is a nice feature. Should we move it 
> to next CF or withdraw CF entry?
>

At this stage, we should close either Returned With Feedback or
Withdrawn as this requires a lot of work.

-- 
With Regards,
Amit Kapila.




Re: Can't find not null constraint, but \d+ shows that

2024-03-28 Thread Tender Wang
Alvaro Herrera  于2024年3月28日周四 17:18写道:

> On 2024-Mar-28, Tender Wang wrote:
>
> >  RemoveConstraintById() should think recurse(e.g. partition table)? I'm
> not
> > sure now.
> >  If we should think process recurse in RemoveConstraintById(),  the
> > function will look complicated than before.
>
> No -- this function handles just a single constraint, as identified by
> OID.  The recursion is handled by upper layers, which can be either
> dependency.c or tablecmds.c.  I think the problem you found is caused by
> the fact that I worked with the tablecmds.c recursion and neglected the
> one in dependency.c.
>

Indeed.

create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
alter table skip_wal_skip_rewrite_index alter c type varchar(20);

Above SQL need attnotnull to be true when re-add index, but
RemoveConstraintById() is hard to recognize this scenario as I know.
We should re-set attnotnull to be true before re-add index. I add a new
AT_PASS in attached patch.
Any thoughts?
--
Tender Wang
OpenPie:  https://en.openpie.com/


v4-0001-Fix-pg_attribute-attnotnull-not-reset-when-droppe.patch
Description: Binary data


Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2024-03-28 Thread Andrey M. Borodin



> On 8 Aug 2023, at 12:31, John Naylor  wrote:
> 
> > > Also the shared counter is the cause of the slowdown, but not the reason 
> > > for the numeric limit.
> >
> > Isn't it both? typedef Oid is unsigned int = 2^32, and according to 
> > GetNewOidWithIndex() logic if we exhaust the whole OID space it will hang 
> > indefinitely which has the same semantics as "being impossible"/permanent 
> > hang (?)
> 
> Looking again, I'm thinking the OID type size is more relevant for the first 
> paragraph, and the shared/global aspect is more relevant for the second.
> 
> The last issue is how to separate the notes at the bottom, since there are 
> now two topics.

Jakub, do you have plans to address this feedback? Is the CF entry still 
relevant?

Thanks!


Best regards, Andrey Borodin.



Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2024-03-28 Thread Andrey M. Borodin



> On 1 Dec 2023, at 19:00, Karl O. Pinc  wrote:
> 
>  I won't be able to submit
> new patches based on reviews for 2 weeks.

Hi everyone!

Is there any work going on? Maybe is someone interested in moving this forward?

Thanks!


Best regards, Andrey Borodin.



Re: Use virtual tuple slot for Unique node

2024-03-28 Thread Andrey M. Borodin



> On 29 Oct 2023, at 21:30, Denis Smirnov  wrote:
> 
> I have taken a look at this discussion, at the code and I am confused how we 
> choose tuple table slot (TTS) type in PG. 

After offline discussion with Denis, we decided to withdraw this patch from CF 
for now. If anyone is willing to revive working on this, please register a new 
entry in next commitfest.
Thanks!


Best regards, Andrey Borodin.



Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Alvaro Herrera wrote:

> Undefined symbols for architecture arm64:
>   "_libintl_gettext", referenced from:
>   _libpqsrv_cancel in dblink.o
>   _libpqsrv_cancel in dblink.o
> ld: symbol(s) not found for architecture arm64
> clang: error: linker command failed with exit code 1 (use -v to see 
> invocation)
> make[1]: *** [dblink.dylib] Error 1
> make: *** [all-dblink-recurse] Error 2

I just removed the _() from the new function.  There's not much point in
wasting more time on this, given that contrib doesn't have translation
support anyway, and we're not using this in libpqwalreceiver.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Jelte Fennema-Nio
On Thu, 28 Mar 2024 at 12:57, Robert Haas  wrote:
> I disagree with a lot of these changes. I think the old version was
> mostly better. But I can live with a lot of it if it makes other
> people happy.

I'd have been fine with many of the previous versions of the docs too.
(I'm not a native english speaker though, so that might be part of it)

> However:

Attached is a patch with your last bit of feedback addressed.


v12-0001-Add-allow_alter_system-GUC.patch
Description: Binary data


Re: Various small doc improvements; plpgsql, schemas, permissions, oidvector

2024-03-28 Thread Robert Treat
On Thu, Mar 28, 2024 at 8:16 AM Andrey M. Borodin  wrote:
> > On 1 Dec 2023, at 19:00, Karl O. Pinc  wrote:
> >
> >  I won't be able to submit
> > new patches based on reviews for 2 weeks.
>
> Hi everyone!
>
> Is there any work going on? Maybe is someone interested in moving this 
> forward?
>
> Thanks!
>

Hey Andrey,

I spoke with Karl briefly on this and he is working on getting an
updated patch together. The work now involves incorporating feedback
and some rebasing, but hopefully we will see something in the next few
days.

Robert Treat
https://xzilla.net




Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread torikoshia

On 2024-03-28 17:27, Bharath Rupireddy wrote:
On Thu, Mar 28, 2024 at 1:43 PM torikoshia  
wrote:


Attached patch fixes the doc,


May I know which patch you are referring to? And, what do you mean by
"fixes the doc"?


Ugh, I replied to the wrong thread.
Sorry for making you confused and please ignore it.


but I'm wondering perhaps it might be
better to modify the codes to prohibit abbreviation of the value.


Please help me understand the meaning here.

When seeing the query which abbreviates ON_ERROR value, I feel it's 
not

obvious what happens compared to other options which tolerates
abbreviation of the value such as FREEZE or HEADER.

   COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?


So, do you mean to prohibit ON_ERROR being specified without any value
like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
other options do allow that [1].

Even if we were to do something like this, shall we discuss this 
separately?


Having said that, what do you think of the v13 patch posted upthread?

[1]
postgres=# COPY t1 FROM stdin WITH (
DEFAULT ESCAPE  FORCE_QUOTE HEADER  QUOTE
DELIMITER   FORCE_NOT_NULL  FORMAT  NULL
ENCODINGFORCE_NULL  FREEZE  ON_ERROR

postgres=# COPY t1 FROM stdin WITH ( QUOTE );
ERROR:  relation "t1" does not exist
postgres=# COPY t1 FROM stdin WITH ( DEFAULT );
ERROR:  relation "t1" does not exist
postgres=# COPY t1 FROM stdin WITH ( ENCODING );
ERROR:  relation "t1" does not exist



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Synchronizing slots from primary to standby

2024-03-28 Thread Bertrand Drouvot
Hi,

On Thu, Mar 28, 2024 at 05:05:35PM +0530, Amit Kapila wrote:
> On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
> >
> > > To fix this, we could use the fast forward logical decoding to advance 
> > > the synced
> > > slot's lsn/xmin when syncing these values instead of directly updating the
> > > slot's info. This way, the snapshot will be serialized to disk when 
> > > decoding.
> > > If we could not reach to the consistent point at the remote restart_lsn, 
> > > the
> > > slot is marked as temp and will be persisted once it reaches the 
> > > consistent
> > > point. I am still analyzing the fix and will share once ready.
> >
> > Thanks! I'm wondering about the performance impact (even in fast_forward 
> > mode),
> > might be worth to keep an eye on it.
> >
> 
> True, we can consider performance but correctness should be a
> priority,

Yeah of course.

> and can we think of a better way to fix this issue?

I'll keep you posted if there is one that I can think of.

> > Should we create a 17 open item [1]?
> >
> > [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
> >
> 
> Yes, we can do that.

done.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Experiments with Postgres and SSL

2024-03-28 Thread Heikki Linnakangas

On 28/03/2024 13:15, Matthias van de Meent wrote:

On Tue, 5 Mar 2024 at 15:08, Heikki Linnakangas  wrote:


I hope I didn't joggle your elbow reviewing this, Jacob, but I spent
some time rebase and fix various little things:


With the recent changes to backend startup committed by you, this
patchset has gotten major apply failures.

Could you provide a new version of the patchset so that it can be
reviewed in the context of current HEAD?


Here you are.

--
Heikki Linnakangas
Neon (https://neon.tech)
From 83484696e470ab130bcd3038f0e28d494065071a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 5 Mar 2024 15:40:30 +0200
Subject: [PATCH v9 1/6] Move Kerberos module

So that we can reuse it in new tests.
---
 src/test/kerberos/t/001_auth.pl   | 174 ++--
 src/test/perl/PostgreSQL/Test/Kerberos.pm | 229 ++
 2 files changed, 240 insertions(+), 163 deletions(-)
 create mode 100644 src/test/perl/PostgreSQL/Test/Kerberos.pm

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index e51e87d0a2e..d4f1ec58092 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -21,6 +21,7 @@ use strict;
 use warnings FATAL => 'all';
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Kerberos;
 use Test::More;
 use Time::HiRes qw(usleep);
 
@@ -34,177 +35,27 @@ elsif (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bkerberos\b/)
 	  'Potentially unsafe test GSSAPI/Kerberos not enabled in PG_TEST_EXTRA';
 }
 
-my ($krb5_bin_dir, $krb5_sbin_dir);
-
-if ($^O eq 'darwin' && -d "/opt/homebrew")
-{
-	# typical paths for Homebrew on ARM
-	$krb5_bin_dir = '/opt/homebrew/opt/krb5/bin';
-	$krb5_sbin_dir = '/opt/homebrew/opt/krb5/sbin';
-}
-elsif ($^O eq 'darwin')
-{
-	# typical paths for Homebrew on Intel
-	$krb5_bin_dir = '/usr/local/opt/krb5/bin';
-	$krb5_sbin_dir = '/usr/local/opt/krb5/sbin';
-}
-elsif ($^O eq 'freebsd')
-{
-	$krb5_bin_dir = '/usr/local/bin';
-	$krb5_sbin_dir = '/usr/local/sbin';
-}
-elsif ($^O eq 'linux')
-{
-	$krb5_sbin_dir = '/usr/sbin';
-}
-
-my $krb5_config = 'krb5-config';
-my $kinit = 'kinit';
-my $klist = 'klist';
-my $kdb5_util = 'kdb5_util';
-my $kadmin_local = 'kadmin.local';
-my $krb5kdc = 'krb5kdc';
-
-if ($krb5_bin_dir && -d $krb5_bin_dir)
-{
-	$krb5_config = $krb5_bin_dir . '/' . $krb5_config;
-	$kinit = $krb5_bin_dir . '/' . $kinit;
-	$klist = $krb5_bin_dir . '/' . $klist;
-}
-if ($krb5_sbin_dir && -d $krb5_sbin_dir)
-{
-	$kdb5_util = $krb5_sbin_dir . '/' . $kdb5_util;
-	$kadmin_local = $krb5_sbin_dir . '/' . $kadmin_local;
-	$krb5kdc = $krb5_sbin_dir . '/' . $krb5kdc;
-}
-
-my $host = 'auth-test-localhost.postgresql.example.com';
-my $hostaddr = '127.0.0.1';
-my $realm = 'EXAMPLE.COM';
-
-my $krb5_conf = "${PostgreSQL::Test::Utils::tmp_check}/krb5.conf";
-my $kdc_conf = "${PostgreSQL::Test::Utils::tmp_check}/kdc.conf";
-my $krb5_cache = "${PostgreSQL::Test::Utils::tmp_check}/krb5cc";
-my $krb5_log = "${PostgreSQL::Test::Utils::log_path}/krb5libs.log";
-my $kdc_log = "${PostgreSQL::Test::Utils::log_path}/krb5kdc.log";
-my $kdc_port = PostgreSQL::Test::Cluster::get_free_port();
-my $kdc_datadir = "${PostgreSQL::Test::Utils::tmp_check}/krb5kdc";
-my $kdc_pidfile = "${PostgreSQL::Test::Utils::tmp_check}/krb5kdc.pid";
-my $keytab = "${PostgreSQL::Test::Utils::tmp_check}/krb5.keytab";
-
 my $pgpass = "${PostgreSQL::Test::Utils::tmp_check}/.pgpass";
 
 my $dbname = 'postgres';
 my $username = 'test1';
 my $application = '001_auth.pl';
 
-note "setting up Kerberos";
-
-my ($stdout, $krb5_version);
-run_log [ $krb5_config, '--version' ], '>', \$stdout
-  or BAIL_OUT("could not execute krb5-config");
-BAIL_OUT("Heimdal is not supported") if $stdout =~ m/heimdal/;
-$stdout =~ m/Kerberos 5 release ([0-9]+\.[0-9]+)/
-  or BAIL_OUT("could not get Kerberos version");
-$krb5_version = $1;
-
 # Construct a pgpass file to make sure we don't use it
 append_to_file($pgpass, '*:*:*:*:abc123');
 
 chmod 0600, $pgpass or die $!;
 
-# Build the krb5.conf to use.
-#
-# Explicitly specify the default (test) realm and the KDC for
-# that realm to avoid the Kerberos library trying to look up
-# that information in DNS, and also because we're using a
-# non-standard KDC port.
-#
-# Also explicitly disable DNS lookups since this isn't really
-# our domain and we shouldn't be causing random DNS requests
-# to be sent out (not to mention that broken DNS environments
-# can cause the tests to take an extra long time and timeout).
-#
-# Reverse DNS is explicitly disabled to avoid any issue with a
-# captive portal or other cases where the reverse DNS succeeds
-# and the Kerberos library uses that as the canonical name of
-# the host and then tries to acquire a cross-realm ticket.
-append_to_file(
-	$krb5_conf,
-	qq![logging]
-default = FILE:$krb5_log
-kdc = FILE:$kdc_log
-
-[libdefaults]
-dns_lookup_realm = false
-dns_lookup_kdc = false
-default_realm = $realm
-forwardable = false
-rdns = fals

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread torikoshia

On 2024-03-28 10:20, Masahiko Sawada wrote:

Hi,

On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada  
wrote:


On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov 
 wrote:

>
> On Thu, Jan 18, 2024 at 4:16 AM torikoshia  wrote:
> > On 2024-01-18 10:10, jian he wrote:
> > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> > > wrote:
> > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> > >> > You will need a separate parameter anyway to specify the destination
> > >> > of "log", unless "none" became an illegal table name when I wasn't
> > >> > looking.  I don't buy that one parameter that has some special values
> > >> > while other values could be names will be a good design.  Moreover,
> > >> > what if we want to support (say) log-to-file along with log-to-table?
> > >> > Trying to distinguish a file name from a table name without any other
> > >> > context seems impossible.
> > >>
> > >> I've been thinking we can add more values to this option to log errors
> > >> not only to the server logs but also to the error table (not sure
> > >> details but I imagined an error table is created for each table on
> > >> error), without an additional option for the destination name. The
> > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > >>
> > >
> > > another idea:
> > > on_error {error|ignore|other_future_option}
> > > if not specified then by default ERROR.
> > > You can also specify ERROR or IGNORE for now.
> > >
> > > I agree, the parameter "error_action" is better than "location".
> >
> > I'm not sure whether error_action or on_error is better, but either way
> > "error_action error" and "on_error error" seems a bit odd to me.
> > I feel "stop" is better for both cases as Tom suggested.
>
> OK.  What about this?
> on_error {stop|ignore|other_future_option}
> where other_future_option might be compound like "file 'copy.log'" or
> "table 'copy_log'".

+1



I realized that ON_ERROR syntax synoposis in the documentation is not
correct. The option doesn't require the value to be quoted and the
value can be omitted. The attached patch fixes it.

Regards,


Thanks!

Attached patch fixes the doc, but I'm wondering perhaps it might be 
better to modify the codes to prohibit abbreviation of the value.


When seeing the query which abbreviates ON_ERROR value, I feel it's not 
obvious what happens compared to other options which tolerates 
abbreviation of the value such as FREEZE or HEADER.


  COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Robert Haas
On Wed, Mar 27, 2024 at 6:24 PM Bruce Momjian  wrote:
> Please ignore my complaints, and my apologies.
>
> As far as the GUC change, let's just be careful since we have a bad
> history of pushing things near the end that we regret.  I am not saying
> that would be this feature, but let's be careful.

Even if what I had pushed was the patch itself, so what? This patch
has been sitting around, largely unchanged, for six months. There has
been plenty of time for wordsmithing the documentation, yet nobody got
interested in doing it until I expressed interest in committing the
patch. Meanwhile, there are over 100 other patches that no committer
is paying attention to right now, some of which could probably really
benefit from some wordsmithing of the documentation. It drives me
insane that this is the patch everyone is getting worked up about.
This is a 27-line code change that does something many people want,
and we're acting like the future of the project depends on it. Both I
and others have committed thousands of lines of new code over the last
few months that could easily be full of bugs that will eat your data
without nearly the scrutiny that this patch is getting.

To be honest, I had every intention of pushing the main patch right
after I pushed that preliminary patch, but I stopped because I saw you
had emailed the thread. I'm pretty sure that I would have missed the
fact that the documentation hadn't been properly updated for the fact
that the sense of the GUC had been inverted. That would have been
embarrassing, and I would have had to push a follow-up commit to fix
that. But no real harm would have been done, except that somebody
surely would have seized on my mistake as proof that this patch wasn't
ready to be committed and that I was being irresponsible and
inconsiderate by pushing forward with it, which is a garbage argument.
Committers make mistakes like that all the time, every week, even
every day. It doesn't mean that they're bad committers, and it doesn't
mean that the patches suck. Some of the patches that get committed do
suck, but it's not because there are a few words wrong in the
documentation.

Let's please, please stop pretending like this patch is somehow
deserving of special scrutiny. There's barely even anything to
scrutinize. It's literally if (!variable) ereport(...) plus some
boilerplate and docs. I entirely agree that, because of the risk of
someone filing a bogus CVE, the docs do need to be carefully worded.
But, I'm going to be honest: I feel completely confident in my ability
to review a patch well enough to know whether the documentation for a
single test-and-ereport has been done up to project standard. It
saddens and frustrates me that you don't seem to agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov
On Thu, Mar 28, 2024 at 9:37 AM Thomas Munro  wrote:
>
> > v12
>
> Hi all,
>
> I didn't review the patch but one thing jumped out: I don't think it's
> OK to hold a spinlock while (1) looping over an array of backends and
> (2) making system calls (SetLatch()).

Good catch, thank you.

Fixed along with other issues spotted by Alexander Lakhin.

--
Regards,
Alexander Korotkov


v13-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patch
Description: Binary data


Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread Masahiko Sawada
On Thu, Mar 28, 2024 at 9:38 PM torikoshia  wrote:
>
> On 2024-03-28 10:20, Masahiko Sawada wrote:
> > Hi,
> >
> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> > wrote:
> >>
> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >>  wrote:
> >> >
> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> >> > wrote:
> >> > > On 2024-01-18 10:10, jian he wrote:
> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> > > > 
> >> > > > wrote:
> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> >> > > >> > "error")?
> >> > > >> > You will need a separate parameter anyway to specify the 
> >> > > >> > destination
> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > > >> > looking.  I don't buy that one parameter that has some special 
> >> > > >> > values
> >> > > >> > while other values could be names will be a good design.  
> >> > > >> > Moreover,
> >> > > >> > what if we want to support (say) log-to-file along with 
> >> > > >> > log-to-table?
> >> > > >> > Trying to distinguish a file name from a table name without any 
> >> > > >> > other
> >> > > >> > context seems impossible.
> >> > > >>
> >> > > >> I've been thinking we can add more values to this option to log 
> >> > > >> errors
> >> > > >> not only to the server logs but also to the error table (not sure
> >> > > >> details but I imagined an error table is created for each table on
> >> > > >> error), without an additional option for the destination name. The
> >> > > >> values would be like error_action 
> >> > > >> {error|ignore|save-logs|save-table}.
> >> > > >>
> >> > > >
> >> > > > another idea:
> >> > > > on_error {error|ignore|other_future_option}
> >> > > > if not specified then by default ERROR.
> >> > > > You can also specify ERROR or IGNORE for now.
> >> > > >
> >> > > > I agree, the parameter "error_action" is better than "location".
> >> > >
> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >
> >> > OK.  What about this?
> >> > on_error {stop|ignore|other_future_option}
> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> > "table 'copy_log'".
> >>
> >> +1
> >>
> >
> > I realized that ON_ERROR syntax synoposis in the documentation is not
> > correct. The option doesn't require the value to be quoted and the
> > value can be omitted. The attached patch fixes it.
> >
> > Regards,
>
> Thanks!
>
> Attached patch fixes the doc, but I'm wondering perhaps it might be
> better to modify the codes to prohibit abbreviation of the value.
>
> When seeing the query which abbreviates ON_ERROR value, I feel it's not
> obvious what happens compared to other options which tolerates
> abbreviation of the value such as FREEZE or HEADER.
>
>COPY t1 FROM stdin WITH (ON_ERROR);
>
> What do you think?

Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread Masahiko Sawada
On Thu, Mar 28, 2024 at 5:28 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 1:43 PM torikoshia  wrote:
> >
> > Attached patch fixes the doc,
>
> May I know which patch you are referring to? And, what do you mean by
> "fixes the doc"?
>
> > but I'm wondering perhaps it might be
> > better to modify the codes to prohibit abbreviation of the value.
>
> Please help me understand the meaning here.
>
> > When seeing the query which abbreviates ON_ERROR value, I feel it's not
> > obvious what happens compared to other options which tolerates
> > abbreviation of the value such as FREEZE or HEADER.
> >
> >COPY t1 FROM stdin WITH (ON_ERROR);
> >
> > What do you think?
>
> So, do you mean to prohibit ON_ERROR being specified without any value
> like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
> other options do allow that [1].
>
> Even if we were to do something like this, shall we discuss this separately?
>
> Having said that, what do you think of the v13 patch posted upthread?
>

This topic accidentally jumped in this thread, but it made me think
that the same might be true for the LOG_VERBOSITY option. That is,
since the LOG_VERBOSITY option is an enum parameter, it might make
more sense to require the value, instead of making the value optional.
For example, the following command could not be obvious for users:

COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Table AM Interface Enhancements

2024-03-28 Thread Pavel Borisov
Hi, Alexander!

The other extensibility that seems quite clear and uncontroversial to me is
0006.

It simply shifts the decision on whether tuple inserts should invoke
inserts to the related indices to the table am level. It doesn't change the
current heap insert behavior so it's safe for the existing heap access
method. But new table access methods could redefine this (only for tables
created with these am's) and make index inserts independently
of ExecInsertIndexTuples inside their own implementations of
tuple_insert/tuple_multi_insert methods.

I'd propose changing the comment:

1405  * This function sets `*insert_indexes` to true if expects caller to
return
1406  * the relevant index tuples.  If `*insert_indexes` is set to false,
then
1407  * this function cares about indexes itself.

in the following way

Tableam implementation of tuple_insert should set `*insert_indexes` to true
if it expects the caller to insert the relevant index tuples (as in heap
 implementation). It should set `*insert_indexes` to false if it cares
about index inserts itself and doesn't want the caller to do index inserts.

Maybe, a commit message is also better to reformulate to describe better
who should do what.

I think, with rebase and correction in the comments/commit message patch
0006 is ready to be committed.

Regards,
Pavel Borisov.


Re: POC, WIP: OR-clause support for indexes

2024-03-28 Thread Alexander Korotkov
On Tue, Mar 19, 2024 at 7:17 AM Andrei Lepikhov
 wrote:
> On 14/3/2024 16:31, Alexander Korotkov wrote:
> > On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov
> > As you can see this case is not related to partial indexes.  Just no
> > index selective for the whole query.  However, splitting scan by the OR
> > qual lets use a combination of two selective indexes.
> I have rewritten the 0002-* patch according to your concern. A candidate
> and some thoughts are attached.
> As I see, we have a problem here: expanding each array and trying to
> apply an element to each index can result in a lengthy planning stage.
> Also, an index scan with the SAOP may potentially be more effective than
> with the list of OR clauses.
> Originally, the transformation's purpose was to reduce a query's
> complexity and the number of optimization ways to speed up planning and
> (sometimes) execution. Here, we reduce planning complexity only in the
> case of an array size larger than MAX_SAOP_ARRAY_SIZE.
> Maybe we can fall back to the previous version of the second patch,
> keeping in mind that someone who wants to get maximum profit from the
> BitmapOr scan of multiple indexes can just disable this optimization,
> enabling deep search of the most optimal scanning way?
> As a compromise solution, I propose adding one more option to the
> previous version: if an element doesn't fit any partial index, try to
> cover it with a plain index.
> In this case, we still do not guarantee the most optimal fit of elements
> to the set of indexes, but we speed up planning. Does that make sense?

Thank you for your research Andrei.  Now things get more clear on the
advantages and disadvantages of this transformation.

The current patch has a boolean guc enable_or_transformation.
However, when we have just a few ORs to be transformated, then we
should get less performance gain from the transformation and higher
chances to lose a good bitmap scan plan from that.  When there is a
huge list of ORs to be transformed, then the performance gain is
greater and it is less likely we could lose a good bitmap scan plan.

What do you think about introducing a GUC threshold value: the minimum
size of list to do OR-to-ANY transformation?
min_list_or_transformation or something.

--
Regards,
Alexander Korotkov




RE: Synchronizing slots from primary to standby

2024-03-28 Thread Zhijie Hou (Fujitsu)
On Thursday, March 28, 2024 7:32 PM Amit Kapila  wrote:
> 
> On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > When analyzing one BF error[1], we find an issue of slotsync: Since we
> > don't perform logical decoding for the synced slots when syncing the
> > lsn/xmin of slot, no logical snapshots will be serialized to disk. So,
> > when user starts to use these synced slots after promotion, it needs
> > to re-build the consistent snapshot from the restart_lsn if the
> > WAL(xl_running_xacts) at restart_lsn position indicates that there are
> > running transactions. This however could cause the data that before the
> consistent point to be missed[2].
> >
> > This issue doesn't exist on the primary because the snapshot at
> > restart_lsn should have been serialized to disk
> > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the
> > logical decoding restarts, it can find consistent snapshot immediately at
> restart_lsn.
> >
> > To fix this, we could use the fast forward logical decoding to advance
> > the synced slot's lsn/xmin when syncing these values instead of
> > directly updating the slot's info. This way, the snapshot will be 
> > serialized to
> disk when decoding.
> > If we could not reach to the consistent point at the remote
> > restart_lsn, the slot is marked as temp and will be persisted once it
> > reaches the consistent point. I am still analyzing the fix and will share 
> > once
> ready.
> >
> 
> Yes, we can use this but one thing to note is that
> CreateDecodingContext() will expect that the slot's and current database are
> the same. I think the reason for that is we need to check system tables of the
> current database while decoding and sending data to the output_plugin which
> won't be a requirement for the fast_forward case. So, we need to skip that
> check in fast_forward mode.

Agreed.

> 
> Next, I was thinking about the case of the first time updating the restart and
> confirmed_flush LSN while syncing the slots. I think we can keep the current
> logic as it is based on the following analysis.
> 
> For each logical slot, cases possible on the primary:
> 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet reached 
> the
> consistent point.
> 2. The restart_lsn doesn't have a serialized snapshot but has reached a
> consistent point.
> 3. The restart_lsn has a serialized snapshot which means it has reached a
> consistent point as well.
> 
> Considering we keep the logic to reserve initial WAL positions the same as the
> current (Reserve WAL for the currently active local slot using the specified 
> WAL
> location (restart_lsn). If the given WAL location has been removed, reserve
> WAL using the oldest existing WAL segment.), I could think of the below
> scenarios:
> A. For 1, we shouldn't sync the slot as it still wouldn't have been marked
> persistent on the primary.
> B. For 2, we would sync the slot
>B1. If remote_restart_lsn >= local_resart_lsn, then advance the slot by 
> calling
> pg_logical_replication_slot_advance().
>B11. If we reach consistent point, then it should be okay because after
> promotion as well we should reach consistent point.
> B111. But again is it possible that there is some xact that comes
> before consistent_point on primary and the same is after consistent_point on
> standby? This shouldn't matter as we will start decoding transactions after
> confirmed_flush_lsn which would be the same on primary and standby.
>B22. If we haven't reached consistent_point, then we won't mark the 
> slot
> as persistent, and at the next sync we will do the same till it reaches
> consistent_point. At that time, the situation will be similar to B11.
>B2. If remote_restart_lsn < local_restart_lsn, then we will wait for the 
> next
> sync cycle and keep the slot as temporary. Once in the next or some
> consecutive sync cycle, we reach the condition remote_restart_lsn >=
> local_restart_lsn, we will proceed to advance the slot and we should have the
> same behavior as B1.
> C. For 3, we would sync the slot, but the behavior should be the same as B.
> 
> Thoughts?

Looks reasonable to me.

Here is the patch based on above lines.
I am also testing and verifying the patch locally.

Best Regards,
Hou zj


0001-advance-the-restart_lsn-of-synced-slots-using-logica.patch
Description:  0001-advance-the-restart_lsn-of-synced-slots-using-logica.patch


Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas
On Wed, Mar 27, 2024 at 5:28 PM Tom Lane  wrote:
> After mulling it over for awhile, I still think the extra checking
> is appropriate, especially since this patch is enlarging the set of
> things that can happen in parallel mode.  How do you want to proceed?

I sort of assumed you were going to commit the patch as you had it.
I'm not a huge fan of that, but I don't think that's it's catastrophe,
either. It pains me a bit to add CPU cycles that I consider
unnecessary to a very frequently taken code path, but as you say, it's
not a lot of CPU cycles, so maybe nobody will ever notice. I actually
really wish we could find some way of making subtransactions
significantly lighter-wait, because I think the cost of spinning up
and tearing down a trivial subtransaction is a real performance
problem, but fixing that is probably a pretty hard problem whether
this patch gets committed or not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Tom Lane
Robert Haas  writes:
> I sort of assumed you were going to commit the patch as you had it.

OK, I will move ahead on that.

> I actually
> really wish we could find some way of making subtransactions
> significantly lighter-wait, because I think the cost of spinning up
> and tearing down a trivial subtransaction is a real performance
> problem, but fixing that is probably a pretty hard problem whether
> this patch gets committed or not.

Yeah.  The whole ResourceOwner mechanism is not exactly lightweight,
but it's hard to argue that we don't need it.  I wonder whether we
could get anywhere by deeming that a "small enough" subtransaction
doesn't need to have its resources cleaned up instantly, and
instead re-use its ResourceOwner to accumulate resources of the
next subtransaction, and the next, until there's enough to be
worth cleaning up.

Having said that, it's hard to see any regime under which tied-up
parallel workers wouldn't count as a resource worth releasing ASAP.
I started this mail with the idea of suggesting that parallel contexts
ought to become a ResourceOwner-managed resource, but maybe that
wouldn't be an improvement after all.

regards, tom lane




To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Alexander Lakhin

Hello hackers,

When running multiple 027_stream_regress.pl test instances in parallel
(and with aggressive autovacuum) on a rather slow machine, I encountered
test failures due to the subselect test instability just as the following
failures on buildfarm:
1) 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-03-27%2010%3A16%3A12

--- 
/home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/subselect.out 
2024-03-19 22:20:34.435867114 +
+++ /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/subselect.out 
2024-03-27 10:28:38.185776605 +

@@ -2067,16 +2067,16 @@
    QUERY PLAN
 -
  Hash Join
-   Hash Cond: (c.odd = b.odd)
+   Hash Cond: (c.hundred = a.hundred)
    ->  Hash Join
- Hash Cond: (a.hundred = c.hundred)
- ->  Seq Scan on tenk1 a
+ Hash Cond: (b.odd = c.odd)
+ ->  Seq Scan on tenk2 b
  ->  Hash
    ->  HashAggregate
  Group Key: c.odd, c.hundred
  ->  Seq Scan on tenk2 c
    ->  Hash
- ->  Seq Scan on tenk2 b
+ ->  Seq Scan on tenk1 a
 (11 rows)

2) 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-03-27%2009%3A49%3A38

(That query was added recently (by 9f1337639 from 2023-02-15) and the
failure evidentially depends on timing, so the number of the failures I
could find on buildfarm is moderate for now.)

With the subselect test modified as in attached, I could see what makes
the plan change:
- ->  Seq Scan on public.tenk2 c (cost=0.00..445.00 
rows=1 width=8)
+ ->  Seq Scan on public.tenk2 c (cost=0.00..444.95 
rows=9995 width=8)

  relname | relpages | reltuples | autovacuum_count | autoanalyze_count
 -+--+---+--+---
- tenk2   |  345 | 1 |    0 | 0
+ tenk2   |  345 |  9995 |    0 | 0

Using the trick Thomas proposed in [1] (see my modification attached), I
could reproduce the failure easily on my workstation with no specific
conditions:
2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup LOG:  !!!ConditionalLockBufferForCleanup() 
returning false
2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup CONTEXT:  while scanning block 29 of relation 
"public.tenk2"

2024-03-28 14:05:13.792 UTC client backend[2358012] pg_regress/test_setup 
STATEMENT:  VACUUM ANALYZE tenk2;
...
  relname | relpages | reltuples | autovacuum_count | autoanalyze_count
 -+--+---+--+---
- tenk2   |  345 | 1 |    0 | 0
+ tenk2   |  345 |  9996 |    0 | 0
 (1 row)

So it looks to me like a possible cause of the failure, and I wonder
whether checks for query plans should be immune to such changes or results
of VACUUM ANALYZE should be 100% stable?

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKYNHmL_DhmVRiidHv6YLAL8jViifwwn2ABY__Y3BCphg%40mail.gmail.com

Best regards,
Alexanderdiff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 29b11f11aa..6a7bb6b7a9 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -2079,6 +2079,32 @@ ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
  ->  Seq Scan on tenk2 b
 (11 rows)
 
+explain (verbose)
+SELECT * FROM tenk1 A INNER JOIN tenk2 B
+ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
+   QUERY PLAN   
+
+ Hash Join  (cost=1087.50..13845.00 rows=100 width=488)
+   Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty, a.hundred, a.thousand, a.twothousand, a.fivethous, a.tenthous, a.odd, a.even, a.stringu1, a.stringu2, a.string4, b.unique1, b.unique2, b.two, b.four, b.ten, b.twenty, b.hundred, b.thousand, b.twothousand, b.fivethous, b.tenthous, b.odd, b.even, b.stringu1, b.stringu2, b.string4
+   Hash Cond: (c.odd = b.odd)
+   ->  Hash Join  (cost=517.50..2000.00 rows=1 width=248)
+ Output: a.unique1, a.unique2, a.two, a.four, a.ten, a.twenty, a

Re: Combine Prune and Freeze records emitted by vacuum

2024-03-28 Thread Melanie Plageman
On Thu, Mar 28, 2024 at 4:49 AM Heikki Linnakangas  wrote:
>
> On 28/03/2024 03:53, Melanie Plageman wrote:
> > On Thu, Mar 28, 2024 at 01:04:04AM +0200, Heikki Linnakangas wrote:
> >> One change with this is that live_tuples and many of the other fields are
> >> now again updated, even if the caller doesn't need them. It was hard to 
> >> skip
> >> them in a way that would save any cycles, with the other refactorings.
> >
> > I am worried we are writing checks that are going to have to come out of
> > SELECT queries' bank accounts, but I'll do some benchmarking when we're
> > all done with major refactoring.
>
> Sounds good, thanks.

Actually, after having looked at it again, there really are only a few
more increments of various counters, the memory consumed by revisit,
and the additional setting of offsets in marked. I think a few
carefully constructed queries which do on-access pruning could test
the impact of this (as opposed to a bigger benchmarking endeavor).

I also wonder if there would be any actual impact of marking the
various record_*() routines inline.

> >> @@ -728,10 +832,12 @@ htsv_get_valid_status(int status)
> >>* DEAD, our visibility test is just too coarse to detect it.
> >>*
> >>* In general, pruning must never leave behind a DEAD tuple that still 
> >> has
> >> - * tuple storage.  VACUUM isn't prepared to deal with that case.  That's 
> >> why
> >> + * tuple storage.  VACUUM isn't prepared to deal with that case (FIXME: 
> >> it no longer cares, right?).
> >> + * That's why
> >>* VACUUM prunes the same heap page a second time (without dropping its 
> >> lock
> >>* in the interim) when it sees a newly DEAD tuple that we initially saw 
> >> as
> >> - * in-progress.  Retrying pruning like this can only happen when an 
> >> inserting
> >> + * in-progress (FIXME: Really? Does it still do that?).
> >
> > Yea, I think all that is no longer true. I missed this comment back
> > then.
>
> Committed a patch to remove it.
>
> >> @@ -981,7 +1069,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> >> rootoffnum,
> >>   * RECENTLY_DEAD tuples just in case there's a DEAD one after 
> >> them;
> >>   * but we can't advance past anything else.  We have to make 
> >> sure that
> >>   * we don't miss any DEAD tuples, since DEAD tuples that 
> >> still have
> >> - * tuple storage after pruning will confuse VACUUM.
> >> + * tuple storage after pruning will confuse VACUUM (FIXME: 
> >> not anymore
> >> + * I think?).
> >
> > Meaning, it won't confuse vacuum anymore or there won't be DEAD tuples
> > with storage after pruning anymore?
>
> I meant that it won't confuse VACUUM anymore. lazy_scan_prune() doesn't
> loop through the items on the page checking their visibility anymore.
>
> Hmm, one confusion remains though: In the 2nd phase of vacuum, we remove
> all the dead line pointers that we have now removed from the indexes.
> When we do that, we assume them all to be dead line pointers, without
> storage, rather than normal tuples that happen to be HEAPTUPLE_DEAD. So
> it's important that if pruning would leave behind HEAPTUPLE_DEAD tuples,
> they are not included in 'deadoffsets'.

It seems like master only adds items it is marking LP_DEAD to
deadoffsets. And things marked LP_DEAD have lp_len set to 0.

> In any case, let's just make sure that pruning doesn't leave
> HEAPTUPLE_DEAD tuples. There's no reason it should.

Maybe worth adding an assert to

static void
heap_prune_record_unchanged_lp_dead(ItemId itemid, PruneState
*prstate, OffsetNumber offnum)
{
...
Assert(!ItemIdHasStorage(itemid));
prstate->deadoffsets[prstate->lpdead_items++] = offnum;
}

By the way, I wasn't quite sure about the purpose of
heap_prune_record_unchanged_lp_dead(). It is called in
heap_prune_chain() in a place where we didn't add things to
deadoffsets before, no?

/*
 * Likewise, a dead line pointer can't be part of the chain. (We
 * already eliminated the case of dead root tuple outside this
 * function.)
 */
if (ItemIdIsDead(lp))
{
/*
 * If the caller set PRUNE_DO_MARK_UNUSED_NOW, we can set dead
 * line pointers LP_UNUSED now.
 */
if (unlikely(prstate->actions & PRUNE_DO_MARK_UNUSED_NOW))
heap_prune_record_unused(prstate, offnum, false);
else
heap_prune_record_unchanged_lp_dead(lp, prstate, offnum);
break;
}

> >> @@ -1224,10 +1327,21 @@ heap_prune_record_live_or_recently_dead(Page page, 
> >> PruneState *prstate, OffsetNu
> >>   * ensure the math works out. The assumption that the transaction will
> >>   * commit and update the counters after we report is a bit shaky; but 
> >> it
> >>   * is what acquire_sample_rows() does, so we do the same to be 
> >> consistent.
> >> + *
> >> + * HEAPTUPLE_DEAD are handled by the other hea

Re: Table AM Interface Enhancements

2024-03-28 Thread Japin Li


On Thu, 28 Mar 2024 at 21:26, Alexander Korotkov  wrote:
> Hi Pavel!
>
> Revised patchset is attached.
>
> On Thu, Mar 28, 2024 at 3:12 PM Pavel Borisov  wrote:
>> The other extensibility that seems quite clear and uncontroversial to me is 
>> 0006.
>>
>> It simply shifts the decision on whether tuple inserts should invoke inserts 
>> to the related indices to the table am level. It doesn't change the current 
>> heap insert behavior so it's safe for the existing heap access method. But 
>> new table access methods could redefine this (only for tables created with 
>> these am's) and make index inserts independently of ExecInsertIndexTuples 
>> inside their own implementations of tuple_insert/tuple_multi_insert methods.
>>
>> I'd propose changing the comment:
>>
>> 1405  * This function sets `*insert_indexes` to true if expects caller to 
>> return
>> 1406  * the relevant index tuples.  If `*insert_indexes` is set to false, 
>> then
>> 1407  * this function cares about indexes itself.
>>
>> in the following way
>>
>> Tableam implementation of tuple_insert should set `*insert_indexes` to true
>> if it expects the caller to insert the relevant index tuples (as in heap
>>  implementation). It should set `*insert_indexes` to false if it cares
>> about index inserts itself and doesn't want the caller to do index inserts.
>
> Changed as you proposed.
>
>> Maybe, a commit message is also better to reformulate to describe better who 
>> should do what.
>
> Done.
>
> Also, I removed extra includes in 0001 as you proposed and edited the
> commit message in 0002.
>
>> I think, with rebase and correction in the comments/commit message patch 
>> 0006 is ready to be committed.
>
> I'm going to push 0001, 0002 and 0006 if no objections.

Thanks for updating the patches.  Here are some minor sugesstion.

0003

+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,

I'm not entirely certain whether the "inline" keyword has any effect.

0004

+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+   Datum reloptions, bool validate)
+{
+   return index_reloptions(amoptions, reloptions, validate);
+}

Could you please explain why we are not verifying the relkind like
heapam_reloptions()?


-   case RELKIND_VIEW:
case RELKIND_MATVIEW:
+   case RELKIND_VIEW:
case RELKIND_PARTITIONED_TABLE:

I think this change is unnecessary.

+   {
+   Form_pg_class classForm;
+   HeapTuple   classTup;
+
+   /* fetch the relation's relcache entry */
+   if (relation->rd_index->indrelid >= 
FirstNormalObjectId)
+   {
+   classTup = SearchSysCacheCopy1(RELOID, 
ObjectIdGetDatum(relation->rd_index->indrelid));
+   classForm = (Form_pg_class) 
GETSTRUCT(classTup);
+   if (classForm->relam >= 
FirstNormalObjectId)
+   tableam = 
GetTableAmRoutineByAmOid(classForm->relam);
+   else
+   tableam = 
GetHeapamTableAmRoutine();
+   heap_freetuple(classTup);
+   }
+   else
+   {
+   tableam = GetHeapamTableAmRoutine();
+   }
+   amoptsfn = relation->rd_indam->amoptions;
+   }

- We can reduce the indentation by moving the classFrom and classTup into
  the if branch.
- Perhaps we could remove the brace of else branch to maintain consistency
  in the code style.

--
Regards,
Japin Li




Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas
On Thu, Mar 28, 2024 at 10:59 AM Tom Lane  wrote:
> Yeah.  The whole ResourceOwner mechanism is not exactly lightweight,
> but it's hard to argue that we don't need it.  I wonder whether we
> could get anywhere by deeming that a "small enough" subtransaction
> doesn't need to have its resources cleaned up instantly, and
> instead re-use its ResourceOwner to accumulate resources of the
> next subtransaction, and the next, until there's enough to be
> worth cleaning up.

Hmm, I wonder if that's actually where the cycles are going. There's
an awful lot of separate function calls inside CommitSubTransaction(),
and in the common case, each one of them has to individually decide
that it doesn't need to do anything. Sure, they're all fast, but if
you have enough of them, it's still going to add up, at least a bit.
In that sense, the resource owner mechanism seems like it should, or
at least could, be better. I'm not sure this is quite the way it works
now, but if you had one single list/array/thingamabob that listed all
of the resources that needed releasing, that should in theory be
better when there's a lot of kinds of resources that you COULD hold
but only a small number of kinds of resources that you actually do
hold -- and it also shouldn't be any worse if it turns out that you
hold a whole lot of resources of many different types.

But I haven't done any benchmarking of this area in a long time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Tom Lane
Robert Haas  writes:
> Hmm, I wonder if that's actually where the cycles are going. There's
> an awful lot of separate function calls inside CommitSubTransaction(),
> and in the common case, each one of them has to individually decide
> that it doesn't need to do anything. Sure, they're all fast, but if
> you have enough of them, it's still going to add up, at least a bit.
> In that sense, the resource owner mechanism seems like it should, or
> at least could, be better.

Yeah, I was thinking about that too.  The normal case is that you
don't hold any releasable resources except locks when arriving at
CommitSubTransaction --- if you do, it's a bug and we're going to
print leak warnings.  Seems like maybe it'd be worth trying to
have a fast path for that case.  (Also, given that we probably
do need to release locks right away, this point invalidates my
earlier idea of postponing the work.)

> But I haven't done any benchmarking of this area in a long time.

Ditto.

regards, tom lane




Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-26, Justin Pryzby wrote:

> Looks right.  That's how I originally wrote it, except for the
> "stmt->accessMethod != NULL" case.
> 
> I prefered my way - the grammar should refuse to set stmt->accessMethod
> for inappropriate relkinds.  And you could assert that.

Hmm, I didn't like this at first sight, because it looked convoluted and
baroque, but I compared both versions for a while and I ended up liking
yours more than mine, so I adopted it.

> I also prefered to set "accessMethodId = InvalidOid" once, rather than
> twice.

Grumble.  I don't like initialization at declare time, so far from the
code that depends on the value.  But the alternative would have been to
assign right where this blocks starts, an additional line.  I pushed it
like you had it.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php




Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas
On Thu, Mar 28, 2024 at 11:50 AM Tom Lane  wrote:
> Yeah, I was thinking about that too.  The normal case is that you
> don't hold any releasable resources except locks when arriving at
> CommitSubTransaction --- if you do, it's a bug and we're going to
> print leak warnings.  Seems like maybe it'd be worth trying to
> have a fast path for that case.

Well, there's the abort case, too, which I think is almost equally important.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Tom Lane
Robert Haas  writes:
> On Thu, Mar 28, 2024 at 11:50 AM Tom Lane  wrote:
>> Yeah, I was thinking about that too.  The normal case is that you
>> don't hold any releasable resources except locks when arriving at
>> CommitSubTransaction --- if you do, it's a bug and we're going to
>> print leak warnings.  Seems like maybe it'd be worth trying to
>> have a fast path for that case.

> Well, there's the abort case, too, which I think is almost equally important.

True, but in the abort case there probably *are* resources to be
cleaned up, so I'm not seeing that the fast-path idea helps.
Although maybe the idea of batching multiple cleanups would?

regards, tom lane




Re: [PATCH] plpython function causes server panic

2024-03-28 Thread Robert Haas
On Thu, Mar 28, 2024 at 12:01 PM Tom Lane  wrote:
> > Well, there's the abort case, too, which I think is almost equally 
> > important.
>
> True, but in the abort case there probably *are* resources to be
> cleaned up, so I'm not seeing that the fast-path idea helps.
> Although maybe the idea of batching multiple cleanups would?

Yes, I think we should be trying to optimize for the case where the
(sub)transaction being cleaned up holds a small but non-zero number of
resources. I think if we just optimize the case where it's exactly
zero, there will be enough cases where the optimization doesn't apply
that we'll feel like we haven't really solved the problem. Whether the
specific idea of trying to batch the cleanups could be made to help
enough to matter, I'm not quite sure. Another idea I had at one point
was to have some kind of bitmask where each bit tells you whether or
not one particular resource type might be held, so that
{Commit,Abort}{Sub,}Transaction would end up doing a bunch of stuff
like if (ResourcesNeedingCleanup & MIGHT_HOLD_THINGY)
AtEO(Sub)Xact_Thingy(...). But I wasn't sure that would really move
the needle, either. This seems to be one of those annoying cases where
the problem is much more obvious than the solution.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera
Eh, kestrel has also failed[1], apparently every query after the large
JOIN that this commit added as test fails with a statement timeout error.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-28%2016%3A01%3A14

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Jelte Fennema-Nio
On Thu, 28 Mar 2024 at 17:34, Alvaro Herrera  wrote:
>
> Eh, kestrel has also failed[1], apparently every query after the large
> JOIN that this commit added as test fails with a statement timeout error.
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-28%2016%3A01%3A14

Ugh that's annoying, the RESET is timing out too I guess. That can
hopefully be easily fixed by changing the new test to:

BEGIN;
SET LOCAL statement_timeout = '10ms';
select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5;
-- this takes very long
ROLLBACK;




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Jelte Fennema-Nio wrote:

> Ugh that's annoying, the RESET is timing out too I guess.

Hah, you're right, I can reproduce with a smaller timeout, and using SET
LOCAL works as a fix.  If we're doing that, why not reduce the timeout
to 1ms?  We don't need to wait extra 9ms ...

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)




Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Tomas Vondra
On 3/28/24 16:00, Alexander Lakhin wrote:
> ...
>
> Using the trick Thomas proposed in [1] (see my modification attached), I
> could reproduce the failure easily on my workstation with no specific
> conditions:
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup LOG:  !!!ConditionalLockBufferForCleanup()
> returning false
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup CONTEXT:  while scanning block 29 of relation
> "public.tenk2"
> 2024-03-28 14:05:13.792 UTC client backend[2358012]
> pg_regress/test_setup STATEMENT:  VACUUM ANALYZE tenk2;
> ...
>   relname | relpages | reltuples | autovacuum_count | autoanalyze_count
>  -+--+---+--+---
> - tenk2   |  345 | 1 |    0 | 0
> + tenk2   |  345 |  9996 |    0 | 0
>  (1 row)
> 
> So it looks to me like a possible cause of the failure, and I wonder
> whether checks for query plans should be immune to such changes or results
> of VACUUM ANALYZE should be 100% stable?
> 

Yeah. I think it's good to design the data/queries in such a way that
the behavior does not flip due to minor noise like in this case.

But I'm a bit confused - how come the estimates do change at all? The
analyze simply fetches 30k rows, and tenk only has 10k of them. So we
should have *exact* numbers, and it should be exactly the same for all
the analyze runs. So how come it changes like this?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: remaining sql/json patches

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Amit Langote wrote:

> Here's patch 1 for the time being that implements barebones
> JSON_TABLE(), that is, without NESTED paths/columns and PLAN clause.
> I've tried to shape the interfaces so that those features can be added
> in future commits without significant rewrite of the code that
> implements barebones JSON_TABLE() functionality.  I'll know whether
> that's really the case when I rebase the full patch over it.

I think this barebones patch looks much closer to something that can be
committed for pg17, given the current commitfest timeline.  Maybe we
should just slip NESTED and PLAN to pg18 to focus current efforts into
getting the basic functionality in 17.  When I looked at the JSON_TABLE
patch last month, it appeared far too large to be reviewable in
reasonable time.  The fact that this split now exists gives me hope that
we can get at least the first part of it.

(A note that PLAN seems to correspond to separate features T824+T838, so
leaving that one out would still let us claim T821 "Basic SQL/JSON query
operators" ... however, the NESTED clause does not appear to be a
separate SQL feature; in particular it does not appear to correspond to
T827, though I may be reading the standard wrong.  So if we don't have
NESTED, apparently we could not claim to support T821.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Jelte Fennema-Nio
On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera  wrote:
> Hah, you're right, I can reproduce with a smaller timeout, and using SET
> LOCAL works as a fix.  If we're doing that, why not reduce the timeout
> to 1ms?  We don't need to wait extra 9ms ...

I think we don't really want to make the timeout too short. Otherwise
the query might get cancelled before we push any query down to the
FDW. I guess that means that for some slow machines even 10ms is not
enough to make the test do the intended purpose. I'd keep it at 10ms,
which seems long enough for normal systems, while still being pretty
short.




Re: To what extent should tests rely on VACUUM ANALYZE?

2024-03-28 Thread Tom Lane
Tomas Vondra  writes:
> Yeah. I think it's good to design the data/queries in such a way that
> the behavior does not flip due to minor noise like in this case.

+1

> But I'm a bit confused - how come the estimates do change at all? The
> analyze simply fetches 30k rows, and tenk only has 10k of them. So we
> should have *exact* numbers, and it should be exactly the same for all
> the analyze runs. So how come it changes like this?

It's plausible that the VACUUM ANALYZE done by test_setup fails
ConditionalLockBufferForCleanup() sometimes because of concurrent
activity like checkpointer writes.  I'm not quite sure how we
get from that to the observed symptom though.  Maybe the
VACUUM needs DISABLE_PAGE_SKIPPING?

regards, tom lane




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Tom Lane
Jelte Fennema-Nio  writes:
> On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera  wrote:
>> Hah, you're right, I can reproduce with a smaller timeout, and using SET
>> LOCAL works as a fix.  If we're doing that, why not reduce the timeout
>> to 1ms?  We don't need to wait extra 9ms ...

> I think we don't really want to make the timeout too short. Otherwise
> the query might get cancelled before we push any query down to the
> FDW. I guess that means that for some slow machines even 10ms is not
> enough to make the test do the intended purpose. I'd keep it at 10ms,
> which seems long enough for normal systems, while still being pretty
> short.

If the test fails both when the machine is too slow and when it's
too fast, then there's zero hope of making it stable and we should
just remove it.

regards, tom lane




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Bruce Momjian
On Thu, Mar 28, 2024 at 08:38:24AM -0400, Robert Haas wrote:
> Let's please, please stop pretending like this patch is somehow
> deserving of special scrutiny. There's barely even anything to
> scrutinize. It's literally if (!variable) ereport(...) plus some
> boilerplate and docs. I entirely agree that, because of the risk of
> someone filing a bogus CVE, the docs do need to be carefully worded.
> But, I'm going to be honest: I feel completely confident in my ability
> to review a patch well enough to know whether the documentation for a
> single test-and-ereport has been done up to project standard. It
> saddens and frustrates me that you don't seem to agree.

The concern about this patch is not its contents but because it is our
first attempt at putting limits on the superuser for an external tool. 
If done improperly, this could open a flood of problems, including CVE
and user confusion, which would reflect badly on the project.

I think the email discussion has expressed those concerns clearly, and
it is only recently that we have gotten to a stage where we are ready to
add this, and doing this near the closing of the last commitfest can be
a valid concern.  I do agree with your analysis of other patches in the
commitfest, but I just don't see them stretching our boundaries like
this patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Tom Lane wrote:

> Jelte Fennema-Nio  writes:
> 
> > I think we don't really want to make the timeout too short. Otherwise
> > the query might get cancelled before we push any query down to the
> > FDW. I guess that means that for some slow machines even 10ms is not
> > enough to make the test do the intended purpose. I'd keep it at 10ms,
> > which seems long enough for normal systems, while still being pretty
> > short.
> 
> If the test fails both when the machine is too slow and when it's
> too fast, then there's zero hope of making it stable and we should
> just remove it.

It doesn't fail when it's too fast -- it's just that it doesn't cover
the case we want to cover.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Tomas Vondra
On 3/28/24 06:20, Thomas Munro wrote:
> With the unexplained but apparently somewhat systematic regression
> patterns on certain tests and settings, I wonder if they might be due
> to read_stream.c trying to form larger reads, making it a bit lazier.
> It tries to see what the next block will be before issuing the
> fadvise.  I think that means that with small I/O concurrency settings,
> there might be contrived access patterns where it loses, and needs
> effective_io_concurrency to be set one notch higher to keep up, or
> something like that.

Yes, I think we've speculated this might be the root cause before, but
IIRC we didn't manage to verify it actually is the problem.

FWIW I don't think the tests use synthetic data, but I don't think it's
particularly contrived.

> One way to test that idea would be to run the
> tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
> eagerly when io_combine_limit is reached, so I suppose it should be
> exactly as eager as master.  The only difference then should be that
> it automatically suppresses sequential fadvise calls.

Sure, I'll give that a try. What are some good values to test? Perhaps
32 and 1, i.e. the default and "no coalescing"?

If this turns out to be the problem, does that mean we would consider
using a more conservative default value? Is there some "auto tuning" we
could do? For example, could we reduce the value combine limit if we
start not finding buffers in memory, or something like that?

I recognize this may not be possible with buffered I/O, due to not
having any insight into page cache. And maybe it's misguided anyway,
because how would we know if the right response is to increase or reduce
the combine limit?

Anyway, doesn't the combine limit work against the idea that
effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect
we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should
then process 32 pages before we actually need X (and we expect the page
to already be in memory, thanks to the gap). But with the combine limit
set to 32, is this still true?

I've tried going through read_stream_* to determine how this will
behave, but read_stream_look_ahead/read_stream_start_pending_read does
not make this very clear. I'll have to experiment with some tracing.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-28 Thread Tom Lane
Alvaro Herrera  writes:
> On 2024-Mar-28, Tom Lane wrote:
>> If the test fails both when the machine is too slow and when it's
>> too fast, then there's zero hope of making it stable and we should
>> just remove it.

> It doesn't fail when it's too fast -- it's just that it doesn't cover
> the case we want to cover.

That's hardly better, because then you think you have test
coverage but maybe you don't.

Could we make this test bulletproof by using an injection point?
If not, I remain of the opinion that we're better off without it.

regards, tom lane




Re: elog/ereport VS misleading backtrace_function function address

2024-03-28 Thread Tom Lane
Jakub Wartak  writes:
> While chasing some other bug I've learned that backtrace_functions
> might be misleading with top elog/ereport() address.

That was understood from the beginning: this type of backtrace is
inherently pretty imprecise, and I doubt there is much that can
be done to make it better.  IIRC the fundamental problem is it only
looks at global symbols, so static functions inherently defeat it.
It was argued that this is better than nothing, which is true, but
you have to take the info with a mountain of salt.

I recall speculating about whether we could somehow invoke gdb
to get a more comprehensive and accurate backtrace, but I don't
really have a concrete idea how that could be made to work.

regards, tom lane




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Robert Haas
On Thu, Mar 28, 2024 at 1:46 PM Bruce Momjian  wrote:
> The concern about this patch is not its contents but because it is our
> first attempt at putting limits on the superuser for an external tool.
> If done improperly, this could open a flood of problems, including CVE
> and user confusion, which would reflect badly on the project.
>
> I think the email discussion has expressed those concerns clearly, and
> it is only recently that we have gotten to a stage where we are ready to
> add this, and doing this near the closing of the last commitfest can be
> a valid concern.  I do agree with your analysis of other patches in the
> commitfest, but I just don't see them stretching our boundaries like
> this patch.

I do understand the concern, and I'm not saying that you're wrong to
have it at some level, but I do sincerely think it's excessive. I
don't think this is even close to being the scariest patch in this
release, or even in this CommitFest. I also agree that doing things
near the end of the last CommitFest isn't great, because even if your
patch is fantastic, people start to think maybe you're only committing
it to beat the deadline, and then the conversation can get unpleasant.
However, I don't think that's really what is happening here. If this
patch gets bounced out of this release, it won't be in any better
shape a year from now than it is right now. It can't be, because the
code is completely trivial; and the documentation has already been
extensively wordsmithed. Surely we don't need another whole release
cycle to polish three paragraphs of documentation. I think it has to
be right to get this done while we're all thinking about it and the
issue is fresh in everybody's mind.

How would you like to proceed from here? I think that in addressing
all of the comments given in the last few days, the documentation has
gotten modestly worse. I think it was crisp and clear before, and now
it feels a little ... over-edited. But if you're happy with the latest
version, we can go with that. Or, do you need more time to review?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Psql meta-command conninfo+

2024-03-28 Thread Maiquel Grassi
Hi Sami!

(v21)

First and foremost, thank you very much for the review.


> 1/ I looked through other psql-meta commands and the “+” adds details but

> does not change output format. In this patch, conninfo and conninfo+

> have completely different output. The former is a string with all the details

> and the latter is a table. Should conninfo just be a table with the minimal 
> info

> and additional details can be displayed with "+" appended?

The initial and central idea was always to keep the metacommand
"\conninfo" in its original state, that is, to preserve the string as it is.
The idea of "\conninfo+" is to expand this to include more information.
If I change the "\conninfo" command and transform it into a table,
I would have to remove all the translation part (files) that has already
been implemented in the past. I believe that keeping the string and
its translations is a good idea and there is no great reason to change that.

> 2/ GSSAPI details should show the full details, such as "principal",

> "encrypted" and "credentials_delegated".

In this new patch, I added these columns. I handled the 'server versions'
for cases where the 'credentials_delegated' column is not in the view.
It turned out well. Thank you for the idea.

3/ A very nice to have is "Application Name", in the case one

sets the application_name within a connection or in the connection string.

I did the same here. I liked your idea and added this column in the SQL.

Look below to see how it turned out after it's finished.

Exemple 1:

[postgres@localhost ~]$ /home/pgsql-17devel/bin/psql -x -p 5432

psql (17devel, server 15.6)
Type "help" for help.

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in 
"/tmp" at port "5432".
postgres=# \conninfo+
Current Connection Information
-[ RECORD 1 ]+-
Database | postgres
Authenticated User   | postgres
System User  |
Current User | postgres
Session User | postgres
Application Name | psql
Backend PID  | 19439
Server Address   |
Server Port  | 5432
Client Address   |
Client Port  |
Socket Directory | /tmp
Host |
SSL Connection   | f
SSL Protocol |
SSL Cipher   |
SSL Compression  |
GSSAPI Authenticated | f
GSSAPI Principal |
GSSAPI Encrypted | f
GSSAPI Credentials Delegated |

Exemple 2:

[postgres@localhost ~]$ /home/pgsql-17devel/bin/psql -x -p 5000
psql (17devel)
Type "help" for help.

postgres=# \conninfo
You are connected to database "postgres" as user "postgres" via socket in 
"/tmp" at port "5000".
postgres=# \conninfo+
Current Connection Information
-[ RECORD 1 ]+-
Database | postgres
Authenticated User   | postgres
System User  |
Current User | postgres
Session User | postgres
Application Name | psql
Backend PID  | 19468
Server Address   |
Server Port  | 5000
Client Address   |
Client Port  |
Socket Directory | /tmp
Host |
SSL Connection   | f
SSL Protocol |
SSL Cipher   |
SSL Compression  |
GSSAPI Authenticated | f
GSSAPI Principal |
GSSAPI Encrypted | f
GSSAPI Credentials Delegated | f

Regards,
Maiquel Grassi.


v21-0001-psql-meta-command-conninfo-plus.patch
Description: v21-0001-psql-meta-command-conninfo-plus.patch


Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Bruce Momjian
On Thu, Mar 28, 2024 at 02:43:38PM -0400, Robert Haas wrote:
> How would you like to proceed from here? I think that in addressing
> all of the comments given in the last few days, the documentation has
> gotten modestly worse. I think it was crisp and clear before, and now
> it feels a little ... over-edited. But if you're happy with the latest
> version, we can go with that. Or, do you need more time to review?

I am fine with moving ahead.  I thought my later emails explaining we
have to be careful communicated that.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Properly pathify the union planner

2024-03-28 Thread Tom Lane
David Rowley  writes:
> The problem is informing the UNION child query about what it is.  I
> thought I could do root->parent_root->parse->setOperations for a UNION
> child to know what it is, but that breaks for a query such as:

Yeah, having grouping_planner poke into the parent level
doesn't seem like a great idea here.  I continue to not like
the name "PlannerContext" but I agree passing down the setop
explicitly is the way to go.

>> Perhaps "SubqueryContext" or the like would be better?  It
>> still has the conflict with memory contexts though.

> Maybe something with "Parameters" in the name?

SubqueryParameters might be OK.  Or SubqueryPlannerExtra?
Since this is a bespoke struct that will probably only ever
be used with subquery_planner, naming it after that function
seems like a good idea.  (And, given that fact and the fact
that it's not a Node, I'm not sure it belongs in pathnodes.h.
We could just declare it in planner.h.)

Some minor comments now that I've looked at 66c0185a3 a little:

* Near the head of grouping_planner is this bit:

if (parse->setOperations)
{
/*
 * If there's a top-level ORDER BY, assume we have to fetch all the
 * tuples.  This might be too simplistic given all the hackery below
 * to possibly avoid the sort; but the odds of accurate estimates here
 * are pretty low anyway.  XXX try to get rid of this in favor of
 * letting plan_set_operations generate both fast-start and
 * cheapest-total paths.
 */
if (parse->sortClause)
root->tuple_fraction = 0.0;

I'm pretty sure this comment is mine, but it's old enough that I don't
recall exactly what I had in mind.  Still, it seems like your patch
has addressed precisely the issue of generating fast-start plans for
setops.  Should we now remove this reset of tuple_fraction?

* generate_setop_child_grouplist does this:

   /* assign a tleSortGroupRef, or reuse the existing one */
   sgc->tleSortGroupRef = assignSortGroupRef(tle, targetlist);
   tle->ressortgroupref = sgc->tleSortGroupRef;

That last line is redundant and confusing.  It is not this code's
charter to change ressortgroupref.

regards, tom lane




Re: Flushing large data immediately in pqcomm

2024-03-28 Thread Melih Mutlu
On Wed, Mar 27, 2024 at 14:39 David Rowley  wrote:

> On Fri, 22 Mar 2024 at 12:46, Melih Mutlu  wrote:
> > I did all of the above changes and it seems like those resolved the
> regression issue.
>
> Thanks for adjusting the patch.   The numbers do look better, but on
> looking at your test.sh script from [1], I see:
>
> meson setup --buildtype debug -Dcassert=true
> --prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \
>
> can you confirm if the test was done in debug with casserts on?   If
> so, it would be much better to have asserts off and have
> -Dbuildtype=release.


Yes, previous numbers were with --buildtype debug -Dcassert=true. I can
share new numbers with release build and asserts off soon.

Thanks,
Melih


Re: Flushing large data immediately in pqcomm

2024-03-28 Thread Melih Mutlu
On Wed, Mar 27, 2024 at 18:54 Robert Haas  wrote:

> On Wed, Mar 27, 2024 at 7:39 AM David Rowley  wrote:
> > Robert, I understand you'd like a bit more from this patch. I'm
> > wondering if you planning on blocking another committer from going
> > ahead with this? Or if you have a reason why the current state of the
> > patch is not a meaningful enough improvement that would justify
> > possibly not getting any improvements in this area for PG17?
>
> So, I think that the first version of the patch, when it got a big
> chunk of data, would just flush whatever was already in the buffer and
> then send the rest without copying.


Correct.

The current version, as I
> understand it, only does that if the buffer is empty; otherwise, it
> copies data as much data as it can into the partially-filled buffer.


Yes, currently it should fill and flush the buffer first, if it’s not
already empty. Only then it sends the rest without copying.

Thanks,
Melih


Re: Properly pathify the union planner

2024-03-28 Thread Tom Lane
I wrote:
> David Rowley  writes:
>> Maybe something with "Parameters" in the name?

> SubqueryParameters might be OK.  Or SubqueryPlannerExtra?
> Since this is a bespoke struct that will probably only ever
> be used with subquery_planner, naming it after that function
> seems like a good idea.

On third thought, I'm not at all convinced that we even want to
invent this struct as compared to just adding another parameter
to subquery_planner.  The problem with a struct is what happens
the next time we need to add a parameter?  If we add yet another
function parameter, we can count on the compiler to complain
about call sites that didn't get the memo.  Adding a field
within an existing struct provokes no such warning, leading
to situations with uninitialized fields that might accidentally
work during testing, but fail the minute they get to the field.

If you do want to go this direction, a minimum safety requirement
would be to have an ironclad rule that callers memset the whole
struct to zero before filling it, so that any not-set fields
will at least have predictable values.  But I don't see the
point really.

regards, tom lane




Re: Possibility to disable `ALTER SYSTEM`

2024-03-28 Thread Bruce Momjian
On Thu, Mar 28, 2024 at 01:23:36PM +0100, Jelte Fennema-Nio wrote:
> +   
> +Turning this setting off is intended for environments where the
> +configuration of PostgreSQL is managed by
> +some external tool.
> +In such environments, a well intentioned superuser might
> +mistakenly use ALTER SYSTEM
> +to change the configuration instead of using the external tool.
> +This might result in unintended behavior, such as the external tool
> +discarding the change at some later point in time when it updates the

"discarding" -> "overwriting" ?

> +  
> +   ALTER SYSTEM can be disabled by setting
> +to off, but 
> this
> +   is no security mechanism (as explained in detail in the documentation for

"is no" -> "is not a"

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: pg_upgrade --copy-file-range

2024-03-28 Thread Robert Haas
On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
 wrote:
> The patch I shared a couple minutes ago should fix this, effectively
> restoring the original debug behavior. I liked the approach with calling
> strategy_implementation a bit more, I wonder if it'd be better to go
> back to that for the "accelerated" copy methods, somehow.

Somehow I don't see this patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-28 Thread Tom Lane
jian he  writes:
> trying to do it this way.
> not sure the following error message is expected.

> SELECT pg_basetype(-1);
> ERROR:  cache lookup failed for type 4294967295

Yeah, that's not really OK.  You could say it's fine for bogus input,
but we've found over the years that it's better for catalog inspection
functions like this to be forgiving of bad input.  Otherwise,
your query can blow up in unexpected ways due to race conditions
(ie somebody just dropped the type you are interested in).

A fairly common solution to that is to return NULL for bad input,
but in this case we could just have it return the OID unchanged.

Either way though, we can't use getBaseType as-is.  We could imagine
extending that function to support a "noerror"-like flag, but I
believe it's already a hot-spot and I'd rather not complicate it
further.  So what I suggest doing is just duplicating the code;
there's not very much of it.

I did a little polishing of the docs and test cases too, ending
with the v7 attached.  I think this is about ready to go unless
there are objections to the definition.

Not sure what I think about your 0002 proposal to extend \dD
with this.  Aside from the server-version-compatibility problem,
I think it's about 90% redundant because \dD already shows
the immediate base type.  The new column would only be
different in the case of nested domains, which I think are
not common.  \dD's output is already pretty wide, so on the
whole I'm inclined to leave it alone.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93b0bc2bc6..b3687b3645 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25129,6 +25129,29 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);

   
 
+  
+   
+
+ pg_basetype
+
+pg_basetype ( regtype )
+regtype
+   
+   
+Returns the OID of the base type of a domain identified by its
+type OID.  If the argument is not the OID of a domain type,
+returns the argument as-is.  If there's a chain of domain
+dependencies, it will recurse until finding the base type.
+   
+   
+Assuming CREATE DOMAIN mytext AS text:
+   
+   
+pg_basetype('mytext'::regtype)
+text
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d4a92d0b3f..d2b4ba8a72 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -44,6 +44,7 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -566,6 +567,48 @@ pg_typeof(PG_FUNCTION_ARGS)
 }
 
 
+/*
+ * Return the base type of the argument.
+ *		If the given type is a domain, return its base type;
+ *		otherwise return the type's own OID.
+ *
+ * This is a SQL-callable version of getBaseType().  Unlike that function,
+ * we don't want to fail for a bogus type OID; this is helpful to keep race
+ * conditions from turning into query failures when scanning the catalogs.
+ * Hence we need our own implementation.
+ */
+Datum
+pg_basetype(PG_FUNCTION_ARGS)
+{
+	Oid			typid = PG_GETARG_OID(0);
+
+	/*
+	 * We loop to find the bottom base type in a stack of domains.
+	 */
+	for (;;)
+	{
+		HeapTuple	tup;
+		Form_pg_type typTup;
+
+		tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+		if (!HeapTupleIsValid(tup))
+			break;/* return the bogus OID as-is */
+		typTup = (Form_pg_type) GETSTRUCT(tup);
+		if (typTup->typtype != TYPTYPE_DOMAIN)
+		{
+			/* Not a domain, so done */
+			ReleaseSysCache(tup);
+			break;
+		}
+
+		typid = typTup->typbasetype;
+		ReleaseSysCache(tup);
+	}
+
+	PG_RETURN_OID(typid);
+}
+
+
 /*
  * Implementation of the COLLATE FOR expression; returns the collation
  * of the argument.
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07023ee61d..134e3b22fd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3889,6 +3889,9 @@
 { oid => '1619', descr => 'type of the argument',
   proname => 'pg_typeof', proisstrict => 'f', provolatile => 's',
   prorettype => 'regtype', proargtypes => 'any', prosrc => 'pg_typeof' },
+{ oid => '8312', descr => 'base type of a domain type',
+  proname => 'pg_basetype', provolatile => 's', prorettype => 'regtype',
+  proargtypes => 'regtype', prosrc => 'pg_basetype' },
 { oid => '3162',
   descr => 'collation of the argument; implementation of the COLLATION FOR expression',
   proname => 'pg_collation_for', proisstrict => 'f', provolatile => 's',
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index dc58793e3f..71d9f1952c 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -1292,3 +1292,28 @@ SELECT * FROM information_schema.check_constraints
  regression | p

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Thomas Munro
On Fri, Mar 29, 2024 at 7:01 AM Tomas Vondra
 wrote:
> On 3/28/24 06:20, Thomas Munro wrote:
> > With the unexplained but apparently somewhat systematic regression
> > patterns on certain tests and settings, I wonder if they might be due
> > to read_stream.c trying to form larger reads, making it a bit lazier.
> > It tries to see what the next block will be before issuing the
> > fadvise.  I think that means that with small I/O concurrency settings,
> > there might be contrived access patterns where it loses, and needs
> > effective_io_concurrency to be set one notch higher to keep up, or
> > something like that.
>
> Yes, I think we've speculated this might be the root cause before, but
> IIRC we didn't manage to verify it actually is the problem.

Another factor could be the bug in master that allows it to get out of
sync -- can it allow *more* concurrency than it intended to?  Or fewer
hints, but somehow that goes faster because of the
stepping-on-kernel-toes problem?

> > One way to test that idea would be to run the
> > tests with io_combine_limit = 1 (meaning 1 block).  It issues advise
> > eagerly when io_combine_limit is reached, so I suppose it should be
> > exactly as eager as master.  The only difference then should be that
> > it automatically suppresses sequential fadvise calls.
>
> Sure, I'll give that a try. What are some good values to test? Perhaps
> 32 and 1, i.e. the default and "no coalescing"?

Thanks!  Yeah.  The default is actually 16, computed backwards from
128kB.  (Explanation: POSIX requires 16 as minimum IOV_MAX, ie number
of vectors acceptable to writev/readv and related functions, though
actual acceptable number is usually much higher, and it also seems to
be a conservative acceptable number for hardware scatter/gather lists
in various protocols, ie if doing direct I/O, the transfer won't be
chopped up into more than one physical I/O command because the disk
and DMA engine can handle it as a single I/O in theory at least.
Actual limit on random SSDs might be more like 33, a weird number but
that's what I'm seeing; Mr Axboe wrote a nice short article[1] to get
some starting points for terminology on that topic on Linux.  Also,
just anecdotally, returns seem to diminish after that with huge
transfers of buffered I/O so it seems like an OK number if you have to
pick one; but IDK, YMMV, subject for future research as direct I/O
grows in relevance, hence GUC.)

> If this turns out to be the problem, does that mean we would consider
> using a more conservative default value? Is there some "auto tuning" we
> could do? For example, could we reduce the value combine limit if we
> start not finding buffers in memory, or something like that?

Hmm, not sure... I like that number for seq scans.  I also like
auto-tuning.  But it seems to me that *if* the problem is that we're
not allowing ourselves as many concurrent I/Os as master BHS because
we're waiting to see if the next block is consecutive, that might
indicate that the distance needs to be higher so that we can have a
better chance to see the 'edge' (the non-contiguous next block) and
start the I/O, not that the io_combine_limit needs to be lower.  But I
could be way off, given the fuzziness on this problem so far...

> Anyway, doesn't the combine limit work against the idea that
> effective_io_concurrency is "prefetch distance"? With eic=32 I'd expect
> we issue prefetch 32 pages ahead, i.e. if we prefetch page X, we should
> then process 32 pages before we actually need X (and we expect the page
> to already be in memory, thanks to the gap). But with the combine limit
> set to 32, is this still true?

Hmm.  It's different.  (1) Master BHS has prefetch_maximum, which is
indeed directly taken from the eic setting, while read_stream.c is
prepared to look much ahead further than that (potentially as far as
max_pinned_buffers) if it's been useful recently, to find
opportunities to coalesce and start I/O.  (2) Master BHS has
prefetch_target to control the look-ahead window, which starts small
and ramps up until it hits prefetch_maximum, while read_stream.c has
distance which goes up and down according to a more complex algorithm
described at the top.

> I've tried going through read_stream_* to determine how this will
> behave, but read_stream_look_ahead/read_stream_start_pending_read does
> not make this very clear. I'll have to experiment with some tracing.

I'm going to try to set up something like your experiment here too,
and figure out some way to visualise or trace what's going on...

The differences come from (1) multi-block I/Os, requiring two separate
numbers: how many blocks ahead we're looking, and how many I/Os are
running, and (2) being more aggressive about trying to reach the
desired I/O level.  Let me try to describe the approach again.
"distance" is the size of a window that we're searching for
opportunities to start I/Os.  read_stream_look_ahead() will keep
looking ahead until we already have max_ios I/Os running, or we

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
Here is a v14 of the patch that I think is beginning to approach something
committable.  Besides general review and testing, there are two things that
I'd like to bring up:

* The latest patch set from Paul Amonson appeared to support MSVC in the
  meson build, but not the autoconf one.  I don't have much expertise here,
  so the v14 patch doesn't have any autoconf/meson support for MSVC, which
  I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
  can always compile the x86_64 popcount code, but I don't know whether
  that's safe for AVX512.

* I think we need to verify there isn't a huge performance regression for
  smaller arrays.  IIUC those will still require an AVX512 instruction or
  two as well as a function call, which might add some noticeable overhead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9b5725e36aa8cff7caeb8683e11cd09bd5bda745 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v14 1/1] AVX512 popcount support

---
 config/c-compiler.m4   |  34 +++
 configure  | 165 +
 configure.ac   |  34 +++
 meson.build|  59 
 src/Makefile.global.in |   1 +
 src/include/pg_config.h.in |   9 ++
 src/include/port/pg_bitutils.h |  20 
 src/makefiles/meson.build  |   1 +
 src/port/Makefile  |   6 ++
 src/port/meson.build   |   6 +-
 src/port/pg_bitutils.c |  56 ---
 src/port/pg_popcount_avx512.c  |  98 
 12 files changed, 451 insertions(+), 38 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..f881e7ec28 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,37 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# -
+# Check if the compiler supports the AVX512 POPCNT instructions using the
+# _mm512_setzero_si512, _mm512_loadu_si512, _mm512_popcnt_epi64,
+# _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
+#
+# An optional compiler flag can be passed as argument
+# (e.g., -mavx512vpopcntdq).  If the intrinsics are supported, sets
+# pgac_avx512_popcnt_intrinsics and CFLAGS_AVX512_POPCNT.
+AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [const char buf@<:@sizeof(__m512i)@:>@;
+   PG_INT64_TYPE popcnt = 0;
+   __m512i accum = _mm512_setzero_si512();
+   const __m512i val = _mm512_loadu_si512((const __m512i *) buf);
+   const __m512i cnt = _mm512_popcnt_epi64(val);
+   accum = _mm512_add_epi64(accum, cnt);
+   popcnt = _mm512_reduce_add_epi64(accum);
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX512_POPCNT="$1"
+  pgac_avx512_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX512_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 36feeafbb2..189264b86e 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,7 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+CFLAGS_AVX512_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17404,6 +17405,41 @@ $as_echo "#define HAVE__GET_CPUID 1" >>confdefs.h
 
 fi
 
+# Check for x86 cpuid_count instruction
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
+$as_echo_n "checking for __get_cpuid_count... " >&6; }
+if ${pgac_cv__get_cpuid_count+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+unsigned int exx[4] = {0, 0, 0, 0};
+  __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__get_cpuid_count="yes"
+else
+  pgac_cv__get_cpuid_count="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__get_cpuid_count" >&5
+$as_echo "$pgac_cv__get_cpuid_count" >&6; }
+if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
+
+$as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
+
+fi
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuid" >&5
 $as_echo_n "checking for __cpuid... " >&6; }
 if ${pgac_cv__cpuid+:} false; then :
@@ -17438,6 +17474,135 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+$as_echo_n "checking

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Tomas Vondra


On 3/27/24 20:37, Melanie Plageman wrote:
> On Mon, Mar 25, 2024 at 12:07:09PM -0400, Melanie Plageman wrote:
>> On Sun, Mar 24, 2024 at 06:37:20PM -0400, Melanie Plageman wrote:
>>> On Sun, Mar 24, 2024 at 5:59 PM Tomas Vondra
>>>  wrote:

 BTW when you say "up to 'Make table_scan_bitmap_next_block() async
 friendly'" do you mean including that patch, or that this is the first
 patch that is not one of the independently useful patches.
>>>
>>> I think the code is easier to understand with "Make
>>> table_scan_bitmap_next_block() async friendly". Prior to that commit,
>>> table_scan_bitmap_next_block() could return false even when the bitmap
>>> has more blocks and expects the caller to handle this and invoke it
>>> again. I think that interface is very confusing. The downside of the
>>> code in that state is that the code for prefetching is still in the
>>> BitmapHeapNext() code and the code for getting the current block is in
>>> the heap AM-specific code. I took a stab at fixing this in v9's 0013,
>>> but the outcome wasn't very attractive.
>>>
>>> What I will do tomorrow is reorder and group the commits such that all
>>> of the commits that are useful independent of streaming read are first
>>> (I think 0014 and 0015 are independently valuable but they are on top
>>> of some things that are only useful to streaming read because they are
>>> more recently requested changes). I think I can actually do a bit of
>>> simplification in terms of how many commits there are and what is in
>>> each. Just to be clear, v9 is still reviewable. I am just going to go
>>> back and change what is included in each commit.
>>
>> So, attached v10 does not include the new version of streaming read API.
>> I focused instead on the refactoring patches commit regrouping I
>> mentioned here.
> 
> Attached v11 has the updated Read Stream API Thomas sent this morning
> [1]. No other changes.
> 

I think there's some sort of bug, triggering this assert in heapam

  Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);

I haven't looked for the root cause, and it's not exactly deterministic,
but try this:

  create table t (a int, b text);

  insert into t select 1 * random(), md5(i::text)
from generate_series(1,1000) s(i);^C

  create index on t (a);

  explain analyze select * from t where a = 200;
  explain analyze select * from t where a < 200;

and then vary the condition a bit (different values, inequalities,
etc.). For me it hits the assert in a couple tries.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company0x79ff48a73d37 in epoll_wait () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.37-18.fc38.x86_64 libgcc-13.2.1-4.fc38.x86_64 libicu-72.1-2.fc38.x86_64 
libstdc++-13.2.1-4.fc38.x86_64 zlib-1.2.13-3.fc38.x86_64
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6
(gdb) bt
#0  0x79ff489ef884 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x79ff4899eafe in raise () from /lib64/libc.so.6
#2  0x79ff4898787f in abort () from /lib64/libc.so.6
#3  0x009ba563 in ExceptionalCondition 
(conditionName=conditionName@entry=0xa209e0 
"BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno", 
fileName=fileName@entry=0xa205b4 "heapam_handler.c", 
lineNumber=lineNumber@entry=2221) at assert.c:66
#4  0x0057759f in heapam_scan_bitmap_next_block (exact_pages=, lossy_pages=, recheck=, scan=0x1e73548) at 
heapam_handler.c:2221
#5  heapam_scan_bitmap_next_tuple (scan=0x1e73548, slot=, 
recheck=, lossy_pages=, exact_pages=) at heapam_handler.c:2359
#6  0x006f58bb in table_scan_bitmap_next_tuple (exact_pages=, lossy_pages=, recheck=, slot=, scan=) at ../../../src/include/access/tableam.h:2022
#7  BitmapHeapNext (node=0x1e71fc8) at nodeBitmapHeapscan.c:202
#8  0x006e46b8 in ExecProcNodeInstr (node=0x1e71fc8) at 
execProcnode.c:480
#9  0x006dd6fa in ExecProcNode (node=0x1e71fc8) at 
../../../src/include/executor/executor.h:274
#10 ExecutePlan (execute_once=, dest=0xb755e0 , 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1e71fc8, 
estate=0x1e71d80) at execMain.c:1644
#11 standard_ExecutorRun (queryDesc=0x1e6c500, direction=, 
count=0, execute_once=) at execMain.c:363
#12 0x0067af84 in ExplainOnePlan 
(plannedstmt=plannedstmt@entry=0x1e6c3f0, into=into@entry=0x0, 
es=es@entry=0x1db3138, queryString=queryString@entry=0x1d87f60 "explain analyze 
select * from t where a = 200;", params=params@entry=0x0, 
queryEnv=queryEnv@entry=0x0, 
planduration=0x7ffcbe111ac8, bufusage=0x0, mem_counters=0x0) at 
explain.c:645
#13 0x0067b417 in standard_ExplainOneQuery (query=, 
cursorOptions=2048, into=0x0, es=0x1db3138, queryString=0x1d87f60 "explain 
analyze select * from t where a = 200;", params=0x0, que

Re: pg_upgrade --copy-file-range

2024-03-28 Thread Tomas Vondra



On 3/28/24 21:45, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:09 PM Tomas Vondra
>  wrote:
>> The patch I shared a couple minutes ago should fix this, effectively
>> restoring the original debug behavior. I liked the approach with calling
>> strategy_implementation a bit more, I wonder if it'd be better to go
>> back to that for the "accelerated" copy methods, somehow.
> 
> Somehow I don't see this patch?
> 

It's here:

https://www.postgresql.org/message-id/90866c27-265a-4adb-89d0-18c8dd22bc19%40enterprisedb.com

I did change the subject to reflect that it's no longer about
pg_upgrade, maybe that breaks the threading for you somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote:
> Here is a v14 of the patch that I think is beginning to approach something
> committable.  Besides general review and testing, there are two things that
> I'd like to bring up:
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.
> 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

I forgot to mention that I also want to understand whether we can actually
assume availability of XGETBV when CPUID says we support AVX512:

> + /*
> +  * We also need to check that the OS has enabled support for 
> the ZMM
> +  * registers.
> +  */
> +#ifdef _MSC_VER
> + return (_xgetbv(0) & 0xe0) != 0;
> +#else
> + uint64  xcr = 0;
> + uint32  high;
> + uint32  low;
> +
> +__asm__ __volatile__(" xgetbv\n":"=a"(low), "=d"(high):"c"(xcr));
> + return (low & 0xe0) != 0;
> +#endif

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Thursday, March 28, 2024 2:39 PM
> To: Amonson, Paul D 
> 
> * The latest patch set from Paul Amonson appeared to support MSVC in the
>   meson build, but not the autoconf one.  I don't have much expertise here,
>   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
>   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
>   can always compile the x86_64 popcount code, but I don't know whether
>   that's safe for AVX512.

I also do not know how to integrate MSVC+Autoconf, the CI uses MSVC+Meson+Ninja 
so I stuck with that.
 
> * I think we need to verify there isn't a huge performance regression for
>   smaller arrays.  IIUC those will still require an AVX512 instruction or
>   two as well as a function call, which might add some noticeable overhead.

Not considering your changes, I had already tested small buffers. At less than 
512 bytes there was no measurable regression (there was one extra condition 
check) and for 512+ bytes it moved from no regression to some gains between 512 
and 4096 bytes. Assuming you introduced no extra function calls, it should be 
the same.

> I forgot to mention that I also want to understand whether we can actually 
> assume availability of XGETBV when CPUID says we support AVX512:

You cannot assume as there are edge cases where AVX-512 was found on system one 
during compile but it's not actually available in a kernel on a second system 
at runtime despite the CPU actually having the hardware feature.

I will review the new patch to see if there are anything that jumps out at me.

Thanks,
Paul





Re: incorrect results and different plan with 2 very similar queries

2024-03-28 Thread Tomas Vondra



On 3/27/24 23:10, Dave Cramer wrote:
> Dave Cramer
> 
> 
> On Wed, 27 Mar 2024 at 17:57, David Rowley  wrote:
> 
>> On Thu, 28 Mar 2024 at 10:33, Dave Cramer  wrote:
>>> There is a report on the pgjdbc github JDBC Driver shows erratic
>> behavior when filtering on CURRENT_DATE · pgjdbc/pgjdbc · Discussion #3184 (
>> github.com)
>>>
>>> Here are the plans.
>>>
>>> JDBC - Nested Loop (incorrect result)
>>>
>>> Index Cond: (mutation >= ((CURRENT_DATE -
>> '1971-12-31'::date) - 28))
>>
>>> JDBC - Hash Right (correct result)
>>>
>>> Recheck Cond: (mutation >= ((CURRENT_DATE -
>> '1971-12-31'::date) - 29))
>>
>> I don't see any version details or queries, but going by the
>> conditions above, the queries don't appear to be the same, so
>> different results aren't too surprising and not a demonstration that
>> there's any sort of bug.
>>
> 
> Sorry, you are correct. Version is 12.14. Here is the query
> 
> SELECT
>   p.partseqno_i
> , p.partno
> , p.partmatch
> , pfe.average_price
> , pfe.sales_price
> , pfe.purch_price
> , pfe.average_price_2
> , pfe.avg_repair_cost
> , pfe.average_price_func
> , pfe.fsv
> , pfe.fsv_func
> , p.status
> 
> FROM part p
> LEFT JOIN part_fa_entity pfe ON (p.partno = pfe.partno)
> WHERE 1=1
> AND (p.mutation >= (CURRENT_DATE - '1971-12-31'::date)-27) ORDER BY p.partno
> 

I guess the confusing bit is that the report does not claim that those
queries are expected to produce the same result, but that the parameter
value affects which plan gets selected, and one of those plans produces
incorrect result.

I think the simplest explanation might be that one of the indexes on
part_fa_entity is corrupted and fails to lookup some rows by partno.
That would explain why the plan with seqscan works fine, but nestloop
with index scan is missing these rows.

They might try a couple things:

1) set enable_nestloop=off, see if the results get correct

2) try bt_index_check on i_39773, might notice some corruption

3) rebuild the index

If it's not this, they'll need to build a reproducer. It's really
difficult to deduce what's going on just from query plans for different
parameter values.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Popcount optimization using AVX512

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Amonson, Paul D wrote:

> > -Original Message-
> > From: Nathan Bossart 
> > Sent: Thursday, March 28, 2024 2:39 PM
> > To: Amonson, Paul D 
> > 
> > * The latest patch set from Paul Amonson appeared to support MSVC in the
> >   meson build, but not the autoconf one.  I don't have much expertise here,
> >   so the v14 patch doesn't have any autoconf/meson support for MSVC, which
> >   I thought might be okay for now.  IIUC we assume that 64-bit/MSVC builds
> >   can always compile the x86_64 popcount code, but I don't know whether
> >   that's safe for AVX512.
> 
> I also do not know how to integrate MSVC+Autoconf, the CI uses
> MSVC+Meson+Ninja so I stuck with that.

We don't do MSVC via autoconf/Make.  We used to have a special build
framework for MSVC which parsed Makefiles to produce "solution" files,
but it was removed as soon as Meson was mature enough to build.  See
commit 1301c80b2167.  If it builds with Meson, you're good.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)




RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message-
> From: Amonson, Paul D 
> Sent: Thursday, March 28, 2024 3:03 PM
> To: Nathan Bossart 
> ...
> I will review the new patch to see if there are anything that jumps out at me.

I see in the meson.build you added the new file twice?

@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
@@ -84,6 +85,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sse42', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_sse42_choose', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_SSE42_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_avx512', 'USE_AVX512_POPCNT_WITH_RUNTIME_CHECK', 
'avx512_popcnt'],

I was putting the file with special flags ONLY in the second section and all 
seemed to work. :)

Everything else seems good to me.

Thanks,
Paul





Re: PSQL Should \sv & \ev work with materialized views?

2024-03-28 Thread Erik Wienhold
On 2023-05-15 06:32 +0200, Kirk Wolak wrote:
> Personally I would appreciate it if \sv actually showed you the DDL.
> Oftentimes I will \ev something to review it, with syntax highlighting.

+1.  I was just reviewing some matviews and was surprised that psql
lacks commands to show their definitions.

But I think that it should be separate commands \sm and \em because we
already have commands \dm and \dv that distinguish between matviews and
views.

> This should not be that difficult.  Just looking for feedback.
> Admittedly \e is questionable, because you cannot really apply the changes.
> ALTHOUGH, I would consider that I could
> BEGIN;
> DROP MATERIALIZED VIEW ...;
> CREATE MATERIALIZED VIEW ...;
> 
> Which I had to do to change the WITH DATA so it creates with data when we
> reload our object.s

I think this could even be handled by optional modifiers, e.g. \em emits
CREATE MATERIALIZED VIEW ... WITH NO DATA and \emD emits WITH DATA.
Although I wouldn't mind manually changing WITH NO DATA to WITH DATA.

-- 
Erik




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Alexander Korotkov
On Fri, Mar 22, 2024 at 3:45 PM Kartyshov Ivan
 wrote:
> On 2024-03-20 12:11, Alexander Korotkov wrote:
> > On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
> >  wrote:
> >> > 4.2 With an unreasonably high future LSN, BEGIN command waits
> >> > unboundedly, shouldn't we check if the specified LSN is more than
> >> > pg_last_wal_receive_lsn() error out?
> >
> > I think limiting wait lsn by current received lsn would destroy the
> > whole value of this feature.  The value is to wait till given LSN is
> > replayed, whether it's already received or not.
>
> Ok sounds reasonable, I`ll rollback the changes.
>
> > But I don't see a problem here. On the replica, it's out of our
> > control to check which lsn is good and which is not.  We can't check
> > whether the lsn, which is in future for the replica, is already issued
> > by primary.
> >
> > For the case of wrong lsn, which could cause potentially infinite
> > wait, there is the timeout and the manual query cancel.
>
> Fully agree with this take.
>
> >> > 4.3 With an unreasonably high wait time, BEGIN command waits
> >> > unboundedly, shouldn't we restrict the wait time to some max
> > value,
> >> > say a day or so?
> >> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
> >> > BEGIN AFTER :'future_receive_lsn' WITHIN 10;
> >>
> >> Good idea, I put it 1 day. But this limit we should to discuss.
> >
> > Do you think that specifying timeout in milliseconds is suitable?  I
> > would prefer to switch to seconds (with ability to specify fraction of
> > second).  This was expressed before by Alexander Lakhin.
>
> It sounds like an interesting idea. Please review the result.
>
> >> > https://github.com/macdice/redo-bench or similar tools?
> >
> > Ivan, could you do this?
>
> Yes, test redo-bench/crash-recovery.sh
> This patch on master
> 91.327, 1.973
> 105.907, 3.338
> 98.412, 4.579
> 95.818, 4.19
>
> REL_13-STABLE
> 116.645, 3.005
> 113.212, 2.568
> 117.644, 3.183
> 111.411, 2.782
>
> master
> 124.712, 2.047
> 117.012, 1.736
> 116.328, 2.035
> 115.662, 1.797
>
> Strange behavior, patched version is faster then REL_13-STABLE and
> master.

I've run this test on my machine with v13 of the path.

patched
53.663, 0.466
53.884, 0.402
54.102, 0.441

master
55.216, 0.441
54.52, 0.464
51.479, 0.438

It seems that difference is less than variance.

--
Regards,
Alexander Korotkov




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-28 Thread Euler Taveira
On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:
> Fixed along with other issues spotted by Alexander Lakhin.

[I didn't read the whole thread. I'm sorry if I missed something ...]

You renamed the function in a previous version but let me suggest another one:
pg_wal_replay_wait. It uses the same pattern as the other recovery control
functions [1]. I think "for" doesn't add much for the function name and "lsn" is
used in functions that return an LSN (that's not the case here).

postgres=# \df pg_wal_replay*
List of functions
-[ RECORD 1 ]---+-
Schema  | pg_catalog
Name| pg_wal_replay_pause
Result data type| void
Argument data types | 
Type| func
-[ RECORD 2 ]---+-
Schema  | pg_catalog
Name| pg_wal_replay_resume
Result data type| void
Argument data types | 
Type| func

Regarding the arguments, I think the timeout should be bigint. There is at least
another function that implements a timeout that uses bigint. 

postgres=# \df pg_terminate_backend
List of functions
-[ RECORD 1 ]---+--
Schema  | pg_catalog
Name| pg_terminate_backend
Result data type| boolean
Argument data types | pid integer, timeout bigint DEFAULT 0
Type| func

I also suggests that the timeout unit should be milliseconds, hence, using
bigint is perfectly fine for the timeout argument.

+   
+Throws an ERROR if the target lsn was not replayed
+on standby within given timeout.  Parameter 
timeout
+is the time in seconds to wait for the 
target_lsn
+replay. When timeout value equals to zero no
+timeout is applied.
+   


[1] 
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-CONTROL


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: PSQL Should \sv & \ev work with materialized views?

2024-03-28 Thread Erik Wienhold
I wrote:
> On 2023-05-15 06:32 +0200, Kirk Wolak wrote:
> > Personally I would appreciate it if \sv actually showed you the DDL.
> > Oftentimes I will \ev something to review it, with syntax highlighting.
> 
> +1.  I was just reviewing some matviews and was surprised that psql
> lacks commands to show their definitions.
> 
> But I think that it should be separate commands \sm and \em because we
> already have commands \dm and \dv that distinguish between matviews and
> views.

Separate commands are not necessary because \ev and \sv already have a
(disabled) provision in get_create_object_cmd for when CREATE OR REPLACE
MATERIALIZED VIEW is available.  So I guess both commands should also
apply to matview.  The attached patch replaces that provision with a
transaction that drops and creates the matview.  This uses meta command
\; to put multiple statements into the query buffer without prematurely
sending those statements to the server.

Demo:

=> DROP MATERIALIZED VIEW IF EXISTS test;
DROP MATERIALIZED VIEW
=> CREATE MATERIALIZED VIEW test AS SELECT s FROM generate_series(1, 
10) s;
SELECT 10
=> \sv test
BEGIN \;
DROP MATERIALIZED VIEW public.test \;
CREATE MATERIALIZED VIEW public.test AS
 SELECT s
   FROM generate_series(1, 10) s(s)
 WITH DATA \;
COMMIT
=>

And \ev test works as well.

Of course the problem with using DROP and CREATE is that indexes and
privileges (anything else?) must also be restored.  I haven't bothered
with that yet.

-- 
Erik
>From efb5e37d90b668011307b602655f28455d700635 Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Fri, 29 Mar 2024 01:08:35 +0100
Subject: [PATCH v1] psql: \ev and \sv for matviews

CREATE OR REPLACE is not available for materialized views so DROP and
CREATE them inside a transaction.  Use meta command \; to compose the
query buffer without sending it to the server.

TODO: Re-create indexes and privileges which are currently lost by
  relying on DROP and CREATE.
---
 src/bin/psql/command.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 9b0fa041f7..f40c1d7f99 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5575,19 +5575,22 @@ get_create_object_cmd(EditableObjectType obj_type, Oid 
oid,
char   *reloptions = 
PQgetvalue(res, 0, 4);
char   *checkoption = 
PQgetvalue(res, 0, 5);
 
-   /*
-* If the backend ever supports CREATE 
OR REPLACE
-* MATERIALIZED VIEW, allow that here; 
but as of today it
-* does not, so editing a matview 
definition in this way
-* is impossible.
-*/
switch (relkind[0])
{
-#ifdef NOT_USED
case RELKIND_MATVIEW:
-   
appendPQExpBufferStr(buf, "CREATE OR REPLACE MATERIALIZED VIEW ");
+   /*
+* Allow editing a 
matview via separate DROP and
+* CREATE statement 
inside a transaction.  Use meta
+* command \; to write 
more than one statement to
+* the query buffer 
without sending it to the server.
+*/
+   
appendPQExpBufferStr(buf, "BEGIN \\;\n");
+   
appendPQExpBufferStr(buf, "DROP MATERIALIZED VIEW ");
+   appendPQExpBuffer(buf, 
"%s.", fmtId(nspname));
+   
appendPQExpBufferStr(buf, fmtId(relname));
+   
appendPQExpBufferStr(buf, " \\;\n");
+   
appendPQExpBufferStr(buf, "CREATE MATERIALIZED VIEW ");
break;
-#endif
case RELKIND_VIEW:

appendPQExpBufferStr(buf, "CREATE OR REPLACE VIEW ");
break;
@@ -5625,6 +5628,14 @@ get_create_object_cmd(EditableObjectType obj_type, Oid 
oid,
if (checkoption && checkoption[0] != 
'\0')

RE: Synchronizing slots from primary to standby

2024-03-28 Thread Zhijie Hou (Fujitsu)
On Thursday, March 28, 2024 10:02 PM Zhijie Hou (Fujitsu) 
 wrote:
> 
> On Thursday, March 28, 2024 7:32 PM Amit Kapila 
> wrote:
> >
> > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
> > 
> > wrote:
> > >
> > > When analyzing one BF error[1], we find an issue of slotsync: Since
> > > we don't perform logical decoding for the synced slots when syncing
> > > the lsn/xmin of slot, no logical snapshots will be serialized to
> > > disk. So, when user starts to use these synced slots after
> > > promotion, it needs to re-build the consistent snapshot from the
> > > restart_lsn if the
> > > WAL(xl_running_xacts) at restart_lsn position indicates that there
> > > are running transactions. This however could cause the data that
> > > before the
> > consistent point to be missed[2].
> > >
> > > This issue doesn't exist on the primary because the snapshot at
> > > restart_lsn should have been serialized to disk
> > > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the
> > > logical decoding restarts, it can find consistent snapshot
> > > immediately at
> > restart_lsn.
> > >
> > > To fix this, we could use the fast forward logical decoding to
> > > advance the synced slot's lsn/xmin when syncing these values instead
> > > of directly updating the slot's info. This way, the snapshot will be
> > > serialized to
> > disk when decoding.
> > > If we could not reach to the consistent point at the remote
> > > restart_lsn, the slot is marked as temp and will be persisted once
> > > it reaches the consistent point. I am still analyzing the fix and
> > > will share once
> > ready.
> > >
> >
> > Yes, we can use this but one thing to note is that
> > CreateDecodingContext() will expect that the slot's and current
> > database are the same. I think the reason for that is we need to check
> > system tables of the current database while decoding and sending data
> > to the output_plugin which won't be a requirement for the fast_forward
> > case. So, we need to skip that check in fast_forward mode.
> 
> Agreed.
> 
> >
> > Next, I was thinking about the case of the first time updating the
> > restart and confirmed_flush LSN while syncing the slots. I think we
> > can keep the current logic as it is based on the following analysis.
> >
> > For each logical slot, cases possible on the primary:
> > 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet
> > reached the consistent point.
> > 2. The restart_lsn doesn't have a serialized snapshot but has reached
> > a consistent point.
> > 3. The restart_lsn has a serialized snapshot which means it has
> > reached a consistent point as well.
> >
> > Considering we keep the logic to reserve initial WAL positions the
> > same as the current (Reserve WAL for the currently active local slot
> > using the specified WAL location (restart_lsn). If the given WAL
> > location has been removed, reserve WAL using the oldest existing WAL
> > segment.), I could think of the below
> > scenarios:
> > A. For 1, we shouldn't sync the slot as it still wouldn't have been
> > marked persistent on the primary.
> > B. For 2, we would sync the slot
> >B1. If remote_restart_lsn >= local_resart_lsn, then advance the
> > slot by calling pg_logical_replication_slot_advance().
> >B11. If we reach consistent point, then it should be okay
> > because after promotion as well we should reach consistent point.
> > B111. But again is it possible that there is some xact
> > that comes before consistent_point on primary and the same is after
> > consistent_point on standby? This shouldn't matter as we will start
> > decoding transactions after confirmed_flush_lsn which would be the same on
> primary and standby.
> >B22. If we haven't reached consistent_point, then we won't mark
> > the slot as persistent, and at the next sync we will do the same till
> > it reaches consistent_point. At that time, the situation will be similar to 
> > B11.
> >B2. If remote_restart_lsn < local_restart_lsn, then we will wait
> > for the next sync cycle and keep the slot as temporary. Once in the
> > next or some consecutive sync cycle, we reach the condition
> > remote_restart_lsn >= local_restart_lsn, we will proceed to advance
> > the slot and we should have the same behavior as B1.
> > C. For 3, we would sync the slot, but the behavior should be the same as B.
> >
> > Thoughts?
> 
> Looks reasonable to me.
> 
> Here is the patch based on above lines.
> I am also testing and verifying the patch locally.

Attach a new version patch which fixed an un-initialized variable issue and
added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so 
that
we can analyze the possible CFbot failures easily.

Best Regards,
Hou zj


v2-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch
Description:  v2-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch


Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-28 Thread Thomas Munro
On Fri, Mar 29, 2024 at 10:43 AM Tomas Vondra
 wrote:
> I think there's some sort of bug, triggering this assert in heapam
>
>   Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);

Thanks for the repro.  I can't seem to reproduce it (still trying) but
I assume this is with Melanie's v11 patch set which had
v11-0016-v10-Read-Stream-API.patch.

Would you mind removing that commit and instead applying the v13
stream_read.c patches[1]?  v10 stream_read.c was a little confused
about random I/O combining, which I fixed with a small adjustment to
the conditions for the "if" statement right at the end of
read_stream_look_ahead().  Sorry about that.  The fixed version, with
eic=4, with your test query using WHERE a < a, ends its scan with:

...
posix_fadvise(32,0x28aee000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0@4\M-5:\0\0\^D\0\M-x\^A"...,40960,0x28acc000) = 40960 (0xa000)
posix_fadvise(32,0x28af4000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\^XC\M-6:\0\0\^D\0\M-x"...,32768,0x28ad8000) = 32768 (0x8000)
posix_fadvise(32,0x28afc000,0x4000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\M-XQ\M-7:\0\0\^D\0\M-x"...,24576,0x28ae4000) = 24576 (0x6000)
posix_fadvise(32,0x28b02000,0x8000,POSIX_FADV_WILLNEED) = 0 (0x0)
pread(32,"\0\0\0\0\M^@3\M-8:\0\0\^D\0\M-x"...,16384,0x28aee000) = 16384 (0x4000)
pread(32,"\0\0\0\0\M-`\M-:\M-8:\0\0\^D\0"...,16384,0x28af4000) = 16384 (0x4000)
pread(32,"\0\0\0\0po\M-9:\0\0\^D\0\M-x\^A"...,16384,0x28afc000) = 16384 (0x4000)
pread(32,"\0\0\0\0\M-P\M-v\M-9:\0\0\^D\0"...,32768,0x28b02000) = 32768 (0x8000)

In other words it's able to coalesce, but v10 was a bit b0rked in that
respect and wouldn't do as well at that.  Then if you set
io_combine_limit = 1, it looks more like master, eg lots of little
reads, but not as many fadvises as master because of sequential
access:

...
posix_fadvise(32,0x28af4000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+
pread(32,...,8192,0x28ae8000) = 8192 (0x2000)  |
pread(32,...,8192,0x28aee000) = 8192 (0x2000)  |
posix_fadvise(32,0x28afc000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) ---+
pread(32,...,8192,0x28af) = 8192 (0x2000)  | |
pread(32,...,8192,0x28af4000) = 8192 (0x2000) <+ |
posix_fadvise(32,0x28b02000,0x2000,POSIX_FADV_WILLNEED) = 0 (0x0) -+
pread(32,...,8192,0x28af6000) = 8192 (0x2000)| |
pread(32,...,8192,0x28afc000) = 8192 (0x2000) <--+ |
pread(32,...,8192,0x28afe000) = 8192 (0x2000)  }-- no advice   |
pread(32,...,8192,0x28b02000) = 8192 (0x2000) <+
pread(32,...,8192,0x28b04000) = 8192 (0x2000)  }
pread(32,...,8192,0x28b06000) = 8192 (0x2000)  }-- no advice
pread(32,...,8192,0x28b08000) = 8192 (0x2000)  }

It becomes slightly less eager to start I/Os as soon as
io_combine_limit > 1, because when it has hit max_ios, if ... 
yeah if the average block that it can combine is bigger than 4, an
arbitrary number from:

max_pinned_buffers = Max(max_ios * 4, io_combine_limit);

 then it can run out of look ahead window before it can reach
max_ios (aka eic), so that's a kind of arbitrary/bogus I/O depth
constraint, which is another way of saying what I was saying earlier:
maybe it just needs more distance.  So let's see the average combined
I/O length in your test query... for me it works out to 27,169 bytes.
But I think there must be times when it runs out of window due to
clustering.  So you could also try increasing that 4->8 to see what
happens to performance.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2B5UofvseJWv6YqKmuc_%3Drguc7VqKcNEG1eawKh3MzHXQ%40mail.gmail.com




Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-03-28 Thread David G. Johnston
On Thu, Mar 7, 2024 at 9:29 PM Erik Wienhold  wrote:

> I wrote:
> > The attached v2 is a simpler patch that instead modifies the existing
> > error message.
>
> Forgot to attach v2.
>
>
For consideration for the doc portion.  The existing wording is too
imprecise for my liking and just tacking on "expects...create type" is
jarring.

"""
Creates a typed table, which takes it structure from an existing (name
optionally schema-qualified) stand-alone composite type i.e., one created
using CREATE TYPE) though it still produces a new composite type as well.
The table will have a dependency to the referenced type such cascaded alter
and drop actions on the type will propagate to the table.

A typed table always has the same column names and data types as the type
it is derived from, and you cannot specify additional columns.  But the
CREATE TABLE command can add defaults and constraints to the table, as well
as specify storage parameters.
"""

We do use the term "stand-alone composite" in create type so I'm inclined
to use it instead of "composite created with CREATE TYPE"; especially in
the error messages; I'm a bit more willing to add the cross-reference to
create type in the user docs.

David J.


Re: AIX support

2024-03-28 Thread Noah Misch
On Thu, Mar 28, 2024 at 11:09:43AM +, Sriram RK wrote:
> We are setting up the build environment and trying to build the source and 
> also trying to analyze the assert from the Aix point of view.

The thread Alvaro and Tom cited contains an analysis.  It's a compiler bug.
You can get past the compiler bug by upgrading your compiler; both ibm-clang
17.1.1.2 and gcc 13.2.0 are free from the bug.

> Also, would like to know if we can access the buildfarm(power machines) to 
> run any of the specific tests to hit this assert.

https://portal.cfarm.net/users/new/ is the form to request access.  It lists
the eligibility criteria.




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread torikoshia

On 2024-03-28 21:54, Masahiko Sawada wrote:
On Thu, Mar 28, 2024 at 9:38 PM torikoshia  
wrote:


On 2024-03-28 10:20, Masahiko Sawada wrote:
> Hi,
>
> On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> wrote:
>>
>> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
>>  wrote:
>> >
>> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
wrote:
>> > > On 2024-01-18 10:10, jian he wrote:
>> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
>> > > > wrote:
>> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
>> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
"error")?
>> > > >> > You will need a separate parameter anyway to specify the destination
>> > > >> > of "log", unless "none" became an illegal table name when I wasn't
>> > > >> > looking.  I don't buy that one parameter that has some special 
values
>> > > >> > while other values could be names will be a good design.  Moreover,
>> > > >> > what if we want to support (say) log-to-file along with 
log-to-table?
>> > > >> > Trying to distinguish a file name from a table name without any 
other
>> > > >> > context seems impossible.
>> > > >>
>> > > >> I've been thinking we can add more values to this option to log errors
>> > > >> not only to the server logs but also to the error table (not sure
>> > > >> details but I imagined an error table is created for each table on
>> > > >> error), without an additional option for the destination name. The
>> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
>> > > >>
>> > > >
>> > > > another idea:
>> > > > on_error {error|ignore|other_future_option}
>> > > > if not specified then by default ERROR.
>> > > > You can also specify ERROR or IGNORE for now.
>> > > >
>> > > > I agree, the parameter "error_action" is better than "location".
>> > >
>> > > I'm not sure whether error_action or on_error is better, but either way
>> > > "error_action error" and "on_error error" seems a bit odd to me.
>> > > I feel "stop" is better for both cases as Tom suggested.
>> >
>> > OK.  What about this?
>> > on_error {stop|ignore|other_future_option}
>> > where other_future_option might be compound like "file 'copy.log'" or
>> > "table 'copy_log'".
>>
>> +1
>>
>
> I realized that ON_ERROR syntax synoposis in the documentation is not
> correct. The option doesn't require the value to be quoted and the
> value can be omitted. The attached patch fixes it.
>
> Regards,

Thanks!

Attached patch fixes the doc, but I'm wondering perhaps it might be
better to modify the codes to prohibit abbreviation of the value.

When seeing the query which abbreviates ON_ERROR value, I feel it's 
not

obvious what happens compared to other options which tolerates
abbreviation of the value such as FREEZE or HEADER.

   COPY t1 FROM stdin WITH (ON_ERROR);

What do you think?


Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,


Thanks for your comment!

Attached a patch which modifies the code to prohibit omission of its 
value.


I was a little unsure about adding a regression test for this, but I 
have not added it since other COPY option doesn't test the omission of 
its value.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 1b4bec3c2223246ec59ffb9eb7de2f1de27315f7 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 29 Mar 2024 11:36:12 +0900
Subject: [PATCH v1] Disallow ON_ERROR option without value

Currently ON_ERROR option of COPY allows to omit its value,
but the syntax synopsis in the documentation requires it.

Since it seems non-boolean parameters usually require its value
and it's not obvious what happens when value of ON_ERROR is
omitted, this patch disallows ON_ERROR without its value.
---
 src/backend/commands/copy.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 28cf8b040a..2719bf28b7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -392,7 +392,7 @@ defGetCopyHeaderChoice(DefElem *def, bool is_from)
 static CopyOnErrorChoice
 defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
 {
-	char	   *sval;
+	char	   *sval = defGetString(def);
 
 	if (!is_from)
 		ereport(ERROR,
@@ -400,16 +400,9 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
  errmsg("COPY ON_ERROR cannot be used with COPY TO"),
  parser_errposition(pstate, def->location)));
 
-	/*
-	 * If no parameter value given, assume the default value.
-	 */
-	if (def->arg == NULL)
-		return COPY_ON_ERROR_STOP;
-
 	/*
 	 * Allow "stop", or 

  1   2   >