date:20240112

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-12 Thread Masahiko Sawada

On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 8, 2024 at 8:35 PM John Naylor  wrote:
> >
> > On Wed, Jan 3, 2024 at 9:10 PM John Naylor  wrote:
> > >
> > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  
> > > wrote:
> > >
> > > > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > > > calls part. I think that even if we expose them, we will still need to
> > > > do something like "if (TidStoreIsShared(ts))
> > > > shared_rt_lock_share(ts->tree.shared)", no?
> > >
> > > I'll come back to this topic separately.
> >
> > To answer your question, sure, but that "if (TidStoreIsShared(ts))"
> > part would be pushed down into a function so that only one place has
> > to care about it.
> >
> > However, I'm starting to question whether we even need that. Meaning,
> > lock the tidstore separately. To "lock the tidstore" means to take a
> > lock, _separate_ from the radix tree's internal lock, to control
> > access to two fields in a separate "control object":
> >
> > +typedef struct TidStoreControl
> > +{
> > + /* the number of tids in the store */
> > + int64 num_tids;
> > +
> > + /* the maximum bytes a TidStore can use */
> > + size_t max_bytes;
> >
> > I'm pretty sure max_bytes does not need to be in shared memory, and
> > certainly not under a lock: Thinking of a hypothetical
> > parallel-prune-phase scenario, one way would be for a leader process
> > to pass out ranges of blocks to workers, and when the limit is
> > exceeded, stop passing out blocks and wait for all the workers to
> > finish.
>
> True. I agreed that it doesn't need to be under a lock anyway, as it's
> read-only.
>
> >
> > As for num_tids, vacuum previously put the similar count in
> >
> > @@ -176,7 +179,8 @@ struct ParallelVacuumState
> >   PVIndStats *indstats;
> >
> >   /* Shared dead items space among parallel vacuum workers */
> > - VacDeadItems *dead_items;
> > + TidStore *dead_items;
> >
> > VacDeadItems contained "num_items". What was the reason to have new
> > infrastructure for that count? And it doesn't seem like access to it
> > was controlled by a lock -- can you confirm? If we did get parallel
> > pruning, maybe the count would belong inside PVShared?
>
> I thought that since the tidstore is a general-purpose data structure
> the shared counter should be protected by a lock. One thing I'm
> concerned about is that we might need to update both the radix tree
> and the counter atomically in some cases. But that's true we don't
> need it for lazy vacuum at least for now. Even given the parallel scan
> phase, probably we won't need to have workers check the total number
> of stored tuples during a parallel scan.
>
> >
> > The number of tids is not that tightly bound to the tidstore's job. I
> > believe tidbitmap.c (a possible future client) doesn't care about the
> > global number of tids -- not only that, but AND/OR operations can
> > change the number in a non-obvious way, so it would not be convenient
> > to keep an accurate number anyway. But the lock would still be
> > mandatory with this patch.
>
> Very good point.
>
> >
> > If we can make vacuum work a bit closer to how it does now, it'd be a
> > big step up in readability, I think. Namely, getting rid of all the
> > locking logic inside tidstore.c and let the radix tree's locking do
> > the right thing. We'd need to make that work correctly when receiving
> > pointers to values upon lookup, and I already shared ideas for that.
> > But I want to see if there is any obstacle in the way of removing the
> > tidstore control object and it's separate lock.
>
> So I agree to remove both max_bytes and num_items from the control
> object.Also, as you mentioned, we can remove the tidstore control
> object itself. TidStoreGetHandle() returns a radix tree handle, and we
> can pass it to TidStoreAttach().  I'll try it.
>

I realized that if we remove the whole tidstore control object
including max_bytes, processes who attached the shared tidstore cannot
use TidStoreIsFull() actually as it always returns true. Also they
cannot use TidStoreReset() as well since it needs to pass max_bytes to
RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
could be problematic for general use. If we remove it, we probably
need a safeguard to prevent those who attached the tidstore from
calling these functions. Or we can keep the control object but remove
the lock and num_tids.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

plpgsql memory leaks

2024-01-12 Thread Pavel Stehule

Hi

I have reported very memory expensive pattern:

CREATE OR REPLACE FUNCTION public.fx(iter integer)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
declare
  c cursor(m bigint) for select distinct i from generate_series(1, m) g(i);
  t bigint;
  s bigint;
begin
  for i in 1..iter
  loop
open c(m := i * 1);
s := 0;
loop
  fetch c into t;
  exit when not found;
  s := s + t;
end loop;
close c; raise notice '%=%', i, s;
  end loop;
end;
$function$
;

This script takes for 100 iterations 100MB

but rewritten

CREATE OR REPLACE FUNCTION public.fx(iter integer)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
declare
  t bigint;
  s bigint;
begin
  for i in 1..iter
  loop
s := 0;
for t in select  ic from generate_series(1, i * 1) g(ic)
loop
  s := s + t;
end loop;
raise notice '%=%', i, s;
  end loop;
end;
$function$

takes lot of megabytes of memory too.

Regards

Pavel

Re: alter table add x wrong error position

2024-01-12 Thread Alvaro Herrera

On 2024-Jan-08, jian he wrote:

> hi.
> Maybe this is a small printout err_position bug.
> 
> create table atacc2 ( test int, a int, b int) ;
> success tests:
> alter table atacc2 add CONSTRAINT x PRIMARY KEY (id, b );
> alter table atacc2 add CONSTRAINT x PRIMARY KEY (id, b a);
> alter table atacc2 add CONSTRAINT x PRIMARY KEYa (id, b);
> 
> tests have problem:
> alter table atacc2 add constraints x unique (test, a, b);
> ERROR:  syntax error at or near "("
> LINE 1: alter table atacc2 add constraints x unique (test, a, b);
> 
>   ^
> ADD either following with the optional keyword "COLUMN" or
> "CONSTRAINT"  as the doc.
> so I should expect the '^' point at "constraints"?

Here you're saying to add a column called constraints, of
type x; then UNIQUE is parsed by columnDef as ColQualList, which goes to
the ColConstraintElem production starting with the UNIQUE keyword: 

| UNIQUE opt_unique_null_treatment opt_definition 
OptConsTableSpace

so the next thing could be opt_unique_null_treatment or opt_definition
or OptConsTableSpace or going back to ColQualList, but none of those
start with a '(' parens.  So the ( doesn't have a match and you get the
syntax error.

If you don't misspell CONSTRAINT as "constraints", there's no issue.

I don't see any way to improve this.  You can't forbid misspellings of
the keyword CONSTRAINT, because they can be column names.

alter table atacc2 add cnstrnt x unique (test, a, b);
ERROR:  error de sintaxis en o cerca de «(»
LÍNEA 1: alter table atacc2 add cnstrnt x unique (test, a, b);
 ^

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: plpgsql memory leaks

2024-01-12 Thread Pavel Stehule

pá 12. 1. 2024 v 10:27 odesílatel Pavel Stehule 
napsal:

> Hi
>
> I have reported very memory expensive pattern:
>
> CREATE OR REPLACE FUNCTION public.fx(iter integer)
>  RETURNS void
>  LANGUAGE plpgsql
> AS $function$
> declare
>   c cursor(m bigint) for select distinct i from generate_series(1, m) g(i);
>   t bigint;
>   s bigint;
> begin
>   for i in 1..iter
>   loop
> open c(m := i * 1);
> s := 0;
> loop
>   fetch c into t;
>   exit when not found;
>   s := s + t;
> end loop;
> close c; raise notice '%=%', i, s;
>   end loop;
> end;
> $function$
> ;
>
> This script takes for 100 iterations 100MB
>
> but rewritten
>
> CREATE OR REPLACE FUNCTION public.fx(iter integer)
>  RETURNS void
>  LANGUAGE plpgsql
> AS $function$
> declare
>   t bigint;
>   s bigint;
> begin
>   for i in 1..iter
>   loop
> s := 0;
> for t in select  ic from generate_series(1, i * 1) g(ic)
> loop
>   s := s + t;
> end loop;
> raise notice '%=%', i, s;
>   end loop;
> end;
> $function$
>
> takes lot of megabytes of memory too.
>

The megabytes leaks are related to JIT. With JIT off the memory consumption
is significantly less  although there are some others probably.

regards

Pavel


> Regards
>
> Pavel
>

Re: A failure in t/038_save_logical_slots_shutdown.pl

2024-01-12 Thread Bharath Rupireddy

On Fri, Jan 12, 2024 at 9:28 AM Amit Kapila  wrote:
>
> On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Jan 11, 2024 at 4:35 PM vignesh C  wrote:
> > >
> > > On further analysis, it was found that in the failing test,
> > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> > > page header present just before the CHECKPOINT_SHUTDOWN which was
> > > causing the failure. We could alternatively reproduce the issue by
> > > switching the WAL file before restarting the server like in the
> > > attached test change patch.
> > > There are a couple of ways to fix this issue a) one by switching the
> > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> > > does not get inserted in a new page as in the attached test_fix.patch
> > > b) by using pg_walinspect to check that the next WAL record is
> > > CHECKPOINT_SHUTDOWN. I have to try this approach.
> > >
> > > Thanks to Bharath and Kuroda-san for offline discussions and helping
> > > in getting to the root cause.
> >
> > IIUC, the problem the commit e0b2eed tries to solve is to ensure there
> > are no left-over decodable WAL records between confirmed_flush LSN and
> > a shutdown checkpoint, which is what it is expected from the
> > t/038_save_logical_slots_shutdown.pl. How about we have a PG function
> > returning true if there are any decodable WAL records between the
> > given start_lsn and end_lsn?
> >
>
> But, we already test this in 003_logical_slot during a successful
> upgrade. Having an explicit test to do the same thing has some merits
> but not sure if it is worth it.

If the code added by commit e0b2eed is covered by the new upgrade
test, why not remove 038_save_logical_slots_shutdown.pl altogether?

> The current test tries to ensure that
> during shutdown after we shutdown walsender and ensures that it sends
> all the wal records and receipts an ack for the same, there is no
> other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
> the test robust enough that it shouldn't show spurious failures like
> the current one reported by you, so shall we try to proceed with that?

Do you mean something like [1]? It ensures the test passes unless any
writes are added (in future) before the publisher restarts in the test
which can again make the tests flaky. How do we ensure no one adds
anything in before the publisher restart
038_save_logical_slots_shutdown.pl? A note before the restart perhaps?
I might be okay with a simple solution like [1] with a note before the
restart instead of other complicated ones.

[1]
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
index 5a4f5dc1d4..493fdbce2f 100644
--- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -60,6 +60,14 @@ $node_subscriber->start;
 $node_publisher->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");
 $node_subscriber->safe_psql('postgres', "CREATE TABLE test_tbl (id int)");

+# On some machines, it was detected that the shutdown checkpoint WAL record
+# that gets generated as part of the publisher restart below falls exactly in
+# the new page in the WAL file. Due to this, the latest checkpoint location and
+# confirmed flush check in compare_confirmed_flush() was failing. Hence, we
+# advance WAL by 1 segment before generating some data so that the shutdown
+# checkpoint doesn't fall exactly in the new WAL file page.
+$node_publisher->advance_wal(1);
+
 # Insert some data
 $node_publisher->safe_psql('postgres',
"INSERT INTO test_tbl VALUES (generate_series(1, 5));");

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

plperl and perl 5.38

2024-01-12 Thread Christoph Berg

Perl 5.38 has landed in Debian unstable, and plperl doesn't like it:

diff -U3 
/home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_elog_1.out
 
/home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_elog.out
--- 
/home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_elog_1.out
   2023-07-24 12:47:52.124583553 +
+++ 
/home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_elog.out
   2024-01-12 10:09:51.065265341 +
@@ -76,6 +76,7 @@
   RETURN 1;
 END;
 $$;
+WARNING:  could not determine encoding for locale "C.utf8": codeset is 
"ANSI_X3.4-1968"
 select die_caller();
 NOTICE:  caught die
  die_caller
diff -U3 
/home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_call.out
 
/home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_call.out
--- 
/home/myon/projects/postgresql/pg/postgresql/src/pl/plperl/expected/plperl_call.out
 2023-10-17 09:40:01.365865484 +
+++ 
/home/myon/projects/postgresql/pg/postgresql/build/testrun/plperl/regress/results/plperl_call.out
   2024-01-12 10:09:51.413278511 +
@@ -64,6 +64,7 @@
   RAISE NOTICE '_a: %, _b: %', _a, _b;
 END
 $$;
+WARNING:  could not determine encoding for locale "C.utf8": codeset is 
"ANSI_X3.4-1968"
 NOTICE:  a: 10, b:
 NOTICE:  _a: 10, _b: 20
 DROP PROCEDURE test_proc1;

Same problem in 17devel and 16. (Did not try the older branches yet.)

Christoph

RE: speed up a logical replica setup

2024-01-12 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Here are comments for your v5 patch.

01.
In the document, two words target/standby are used as almost the same meaning.
Can you unify them?

02.
```
 
  
   pg_subscriber
   option
  
 
```

There are some mandatory options like -D/-S/-P. It must be listed in Synopsis 
chapter.

03.
```
  
   pg_subscriber takes the publisher and subscriber
   connection strings, a cluster directory from a standby server and a list of
   database names and it sets up a new logical replica using the physical
   recovery process.
  
```

I briefly checked other pages and they do not describe accepted options here.
A summary of the application should be mentioned. Based on that, how about:

```
pg_subscriber creates a new 
subscriber from a physical standby server. This allows users to quickly
set up logical replication system.
```

04.
```
  
   The pg_subscriber should be run at the target
   server. The source server (known as publisher server) should accept logical
   replication connections from the target server (known as subscriber server).
   The target server should accept local logical replication connection.
  
```

I'm not native speaker, but they are not just recommmendations - they are surely
required. So, should we replace s/should/has to/?

05.
```
 
  -S conninfo
  --subscriber-conninfo=conninfo
  
   
The connection string to the subscriber. For details see .
   
  
 
```

I became not sure whether it is "The connection string to the subscriber.".
The server is still physical standby at that time.

06.
```
 * IDENTIFICATION
 *  src/bin/pg_subscriber/pg_subscriber.c
```

The identification is not correct.

07.
I felt that there were too many global variables and LogicalRepInfo should be
refactored. Because...

* Some info related with clusters(e.g., subscriber_dir, conninfo, ...) should be
  gathered in one struct.
* pubconninfo/subsconninfo are stored per db, but it is not needed if we have
  one base_conninfo.
* pubname/subname are not needed because we have fixed naming rule.
* pg_ctl_path and pg_resetwal_path can be conbimed into one bindir.
* num_dbs should not be alone.
...

Based on above, how about using structures like below?

```
typedef struct LogicalRepPerdbInfo
{
Oid oid;
char   *dbname;
boolmade_replslot;  /* replication slot was created */
boolmade_publication;   /* publication was created */
boolmade_subscription;  /* subscription was created */
} LogicalRepPerdbInfo;

typedef struct PrimaryInfo
{
char   *base_conninfo;
boolmade_transient_replslot;
} PrimaryInfo;

typedef struct StandbyInfo
{
char   *base_conninfo;
char   *bindir;
char   *pgdata;
char   *primary_slot_name;
} StandbyInfo;

typedef struct LogicalRepInfo
{
int num_dbs;
LogicalRepPerdbInfo*perdb;
PrimaryInfo*primary;
StandbyInfo*standby;
} LogicalRepInfo;
```

08.
```
char   *subconninfo;/* subscription connection string for 
logical
 * replication 
*/
```

Not sure how we should notate because the target has not been subscriber yet.

09.
```
enum WaitPMResult
{
POSTMASTER_READY,
POSTMASTER_STANDBY,
POSTMASTER_STILL_STARTING,
POSTMASTER_FAILED
};
```

This enum has been already defined in pg_ctl.c. Not sure we can use the same 
name.
Can we rename to PGSWaitPMResult. or export pre-existing one?

10.
```
/* Options */
static const char *progname;
```

I think it is not an option.

11.
```
/*
 * Validate a connection string. Returns a base connection string that is a
 * connection string without a database name plus a fallback application name.
 * Since we might process multiple databases, each database name will be
 * appended to this base connection string to provide a final connection string.
 * If the second argument (dbname) is not null, returns dbname if the provided
 * connection string contains it. If option --database is not provided, uses
 * dbname as the only database to setup the logical replica.
 * It is the caller's responsibility to free the returned connection string and
 * dbname.
 */
static char *
get_base_conninfo(char *conninfo, char *dbname, const char *noderole)
```
Just FYI - adding fallback_application_name may be too optimisitic. Currently
the output was used by both pg_subscriber and subscription connection. 

12.
Can we add an option not to remove log files even operations were succeeded.

13.
```
/*
 * Since the standby server is running, check if it is using an
 * existing replication slot for WAL retention purposes. This
 * replication slot has no use after the transformation, hence, 
it
 * will

Re: tablecmds.c/MergeAttributes() cleanup

2024-01-12 Thread Alvaro Herrera

On 2024-Jan-11, Alvaro Herrera wrote:

> If you look closely at InsertPgAttributeTuples and accompanying code, it
> all looks a bit archaic.  They seem to be treating TupleDesc as a
> glorified array of Form_pg_attribute elements in a convenient packaging.
> It's probably cleaner to change these APIs so that they deal with a
> Form_pg_attribute array, and not TupleDesc anymore.  But we can hack on
> that some other day.

In addition, it also occurs to me now that maybe it would make sense to
change the TupleDesc implementation to use a List of Form_pg_attribute
instead of an array, and do away with ->natts.  This would let us change
all "for ( ... natts ...)" loops into foreach_ptr() loops ...  there are
only five hundred of them or so --rolls eyes--.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)

Re: plpgsql memory leaks

2024-01-12 Thread Michael Banck

Hi,

On Fri, Jan 12, 2024 at 11:02:14AM +0100, Pavel Stehule wrote:
> pá 12. 1. 2024 v 10:27 odesílatel Pavel Stehule 
> napsal:
> 
> > Hi
> >
> > I have reported very memory expensive pattern:

[...]

> > takes lot of megabytes of memory too.
> 
> The megabytes leaks are related to JIT. With JIT off the memory consumption
> is significantly less  although there are some others probably.

I cannot readily reproduce this.

Which version of Postgres is this and on which platform/distribution?

Did you try keep jit on but set jit_inline_above_cost to 0?

The back-branches have a fix for the above case, i.e. llvmjit memleaks
that can be worked-around by setting jit_inline_above_cost=0.

Michael

Re: Make attstattarget nullable

2024-01-12 Thread Alvaro Herrera

On 2024-Jan-11, Peter Eisentraut wrote:

> On 10.01.24 14:16, Alvaro Herrera wrote:

> > Seems reasonable.  Do we really need a catversion bump for this?
> 
> Yes, this changes the order of the fields in pg_attribute.

Ah, right.

> > In get_attstattarget() I think we should return 0 for dropped columns
> > without reading attstattarget, which is useless anyway, and if it did
> > happen to return non-null, it might cause us to do stuff, which would be
> > a waste.
> 
> I ended up deciding to get rid of get_attstattarget() altogether and just do
> the fetching inline in examine_attribute().  Because the previous API and
> what you are discussing here is over-designed, since the only caller doesn't
> call it with dropped columns or system columns anyway.  This way these
> issues are contained in the ANALYZE code, not in a very general place like
> lsyscache.c.

Sounds good.

Maybe instead of having examine_attribute hand a -1 target to the
analyze functions, we could just put default_statistics_target there.
Analyze functions would never receive negative values, and we could
remove that from the analyze functions.  Maybe make
VacAttrStats->attstattarget unsigned while at it.  (This could be a
separate patch.)

> > It's annoying that the new code in index_concurrently_swap() is more
> > verbose than the code being replaced, but it seems OK to me, since it
> > allows us to distinguish a null value in attstattarget from actual 0
> > without complicating the get_attstattarget API (which I think we would
> > have to do if we wanted to use it here.)
> 
> Yeah, this was annoying.  Originally, I had it even more complicated,
> because I was trying to check if the actual (non-null) values are the same.
> But then I realized the new value is never set at this point.  I think what
> the code is actually about is clearer now.

Yeah, it's neat and the comment is clear enough.

> And of course the 0003 patch gets rid of it anyway.

I again didn't look at 0002 and 0003 very closely, but from 10,000 feet
it looks mostly reasonable -- but I think naming the struct
FormData_pg_attribute_extra is not a great idea, as it looks like there
would have to be a catalog named pg_attribute_extra -- and I don't think
I would make the "non-Data" pointer-to-struct typedef either.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Make attstattarget nullable

2024-01-12 Thread Alvaro Herrera

BTW I wanted to but didn't say so explicitly, so here goes: 0001 looks
ready to go in.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)

Re: doc: add LITERAL tag to RETURNING

2024-01-12 Thread Alvaro Herrera

On 2024-Jan-12, Ashutosh Bapat wrote:

> On Fri, Jan 12, 2024 at 11:27 AM torikoshia  
> wrote:
> >
> > RETURNING is usually tagged with appropriate tags, such as ,
> > but not in the 'query' section of COPY.

> The patch looks good.

Good catch, pushed.  It has user-visible effect, so I backpatched it.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"People get annoyed when you try to debug them."  (Larry Wall)

Re: Synchronizing slots from primary to standby

2024-01-12 Thread Masahiko Sawada

On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila  wrote:
>
> On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila  wrote:
> >
> > +static bool
> > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> > {
> > ...
> > + /* Slot ready for sync, so sync it. */
> > + else
> > + {
> > + /*
> > + * Sanity check: With hot_standby_feedback enabled and
> > + * invalidations handled appropriately as above, this should never
> > + * happen.
> > + */
> > + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > + elog(ERROR,
> > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > + " to remote slot's LSN(%X/%X) as synchronization"
> > + " would move it backwards", remote_slot->name,
> > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > ...
> > }
> >
> > I was thinking about the above code in the patch and as far as I can
> > think this can only occur if the same name slot is re-created with
> > prior restart_lsn after the existing slot is dropped. Normally, the
> > newly created slot (with the same name) will have higher restart_lsn
> > but one can mimic it by copying some older slot by using
> > pg_copy_logical_replication_slot().
> >
> > I don't think as mentioned in comments even if hot_standby_feedback is
> > temporarily set to off, the above shouldn't happen. It can only lead
> > to invalidated slots on standby.
> >
> > To close the above race, I could think of the following ways:
> > 1. Drop and re-create the slot.
> > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > ahead of local_slot's LSN then we can update it; but as mentioned in
> > your previous comment, we need to update all other fields as well. If
> > we follow this then we probably need to have a check for catalog_xmin
> > as well.
> >
>
> The second point as mentioned is slightly misleading, so let me try to
> rephrase it once again: Emit LOG/WARNING in this case and once
> remote_slot's LSN moves ahead of local_slot's LSN then we can update
> it; additionally, we need to update all other fields like two_phase as
> well. If we follow this then we probably need to have a check for
> catalog_xmin as well along remote_slot's restart_lsn.
>
> > Now, related to this the other case which needs some handling is what
> > if the remote_slot's restart_lsn is greater than local_slot's
> > restart_lsn but it is a re-created slot with the same name. In that
> > case, I think the other properties like 'two_phase', 'plugin' could be
> > different. So, is simply copying those sufficient or do we need to do
> > something else as well?
> >
>
> Bertrand, Dilip, Sawada-San, and others, please share your opinion on
> this problem as I think it is important to handle this race condition.

Is there any good use case of copying a failover slot in the first
place? If it's not a normal use case and we can probably live without
it, why not always disable failover during the copy? FYI we always
disable two_phase on copied slots. It seems to me that copying a
failover slot could lead to problems, as long as we synchronize slots
based on their names. IIUC without the copy, this pass should never
happen.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Compile warnings in dbcommands.c building with meson

2024-01-12 Thread Alvaro Herrera

On 2024-Jan-12, jian he wrote:

> I saw it sometimes, sometimes not.
> Now I think the reason is:
> it will appear when you do `-Dbuildtype=release`.
> 
> but it will not occur when I do:
> `-Dbuildtype=debug`
> 
> my current meson version is 1.3.1, my ninja version is 1.10.1.

Hmm, but why doesn't it happen for other arguments of get_db_info that
have pretty much identical code, say src_istemplate?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"But static content is just dynamic content that isn't moving!"
http://smylers.hates-software.com/2007/08/15/fe244d0c.html

Re: Make mesage at end-of-recovery less scary.

2024-01-12 Thread Aleksander Alekseev

Hi,

> The errors occurred in a part of the tests for end-of-WAL detection
> added in the master branch. These failures were primarily due to
> changes in the message contents introduced by this patch. During the
> revision, I discovered an issue with the handling of empty pages that
> appear in the middle of reading continuation records. In the previous
> version, such empty pages were mistakenly identified as indicating a
> clean end-of-WAL (that is a LOG). However, they should actually be
> handled as a WARNING, since the record curently being read is broken
> at the empty pages. The following changes have been made in this
> version:
>
> 1. Adjusting the test to align with the error message changes
>   introduced by this patch.
>
> 2. Adding tests for the newly added messages.
>
> 3. Correcting the handling of empty pages encountered during the
>   reading of continuation records. (XLogReaderValidatePageHeader)
>
> 4. Revising code comments.
>
> 5. Changing the term "log segment" to "WAL
>   segment". (XLogReaderValidatePageHeader)
>
> regards.

Thanks for the updated patch.

```
+p = (char *) record;
+pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+while (p < pe && *p == 0)
+p++;
+
+if (p == pe)
```

Just as a random thought: perhaps we should make this a separate
function, as a part of src/port/. It seems to me that this code could
benefit from using vector instructions some day, similarly to
memcmp(), memset() etc. Surprisingly there doesn't seem to be a
standard C function for this. Alternatively one could argue that one
cycle doesn't make much code to reuse and that the C compiler will
place SIMD instructions for us. However a counter-counter argument
would be that we could use a macro or even better an inline function
and have the same effect except getting a slightly more readable code.

```
- * This is just a convenience subroutine to avoid duplicated code in
+ * This is just a convenience subroutine to avoid duplicate code in
```

This change doesn't seem to be related to the patch. Personally I
don't mind it though.

All in all I find v28 somewhat scary. It does much more than "making
one message less scary" as it was initially intended and what bugged
me personally, and accordingly touches many more places including
xlogreader.c, xlogrecovery.c, etc.

Particularly I have mixed feeling about this:

```
+/*
+ * Consider it as end-of-WAL if all subsequent bytes of this page
+ * are zero. We don't bother checking the subsequent pages since
+ * they are not zeroed in the case of recycled segments.
+ */
```

If I understand correctly, if somehow several FS blocks end up being
zeroed (due to OS bug, bit rot, restoring from a corrupted for
whatever reason backup, hardware failures, ...) there is non-zero
chance that PG will interpret this as a normal situation. To my
knowledge this is not what we typically do - typically PG would report
an error and ask a human to figure out what happened. Of course the
possibility of such a scenario is small, but I don't think that as
DBMS developers we can ignore it.

Does anyone agree or maybe I'm making things up?

-- 
Best regards,
Aleksander Alekseev

Re: Improve the connection failure error messages

2024-01-12 Thread Nisha Moond

Thanks for the review. Attached v2 patch with suggested changes.
Please find my response inline.

On Fri, Jan 12, 2024 at 8:20 AM Peter Smith  wrote:
>
> Thanks for the patch! Here are a couple of review comments for it.
>
> ==
> src/backend/commands/subscriptioncmds.c
>
> 1.
> @@ -742,7 +742,7 @@ CreateSubscription(ParseState *pstate,
> CreateSubscriptionStmt *stmt,
>   if (!wrconn)
>   ereport(ERROR,
>   (errcode(ERRCODE_CONNECTION_FAILURE),
> - errmsg("could not connect to the publisher: %s", err)));
> + errmsg("\"%s\" could not connect to the publisher: %s", stmt->subname, 
> err)));
>
> In practice, these commands give errors like:
>
> test_sub=# create subscription sub1 connection 'dbname=bogus' publication 
> pub1;
> ERROR:  could not connect to the publisher: connection to server on
> socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "bogus" does not
> exist
>
> and logs like:
>
> 2024-01-12 12:45:05.177 AEDT [13108] ERROR:  could not connect to the
> publisher: connection to server on socket "/tmp/.s.PGSQL.5432" failed:
> FATAL:  database "bogus" does not exist
> 2024-01-12 12:45:05.177 AEDT [13108] STATEMENT:  create subscription
> sub1 connection 'dbname=bogus' publication pub1;
>
> Since the subscription name is already obvious from the statement that
> caused the error I'm not sure it benefits much to add this to the
> error message (but maybe it is useful if the error message was somehow
> read in isolation from the statement).
>
> Anyway, I felt at least it should include the word "subscription" for
> better consistency with the other messages in this patch:
>
> SUGGESTION
> subscription \"%s\" could not connect to the publisher: %s

Done.

> ==
>
> 2.
> + appname = cluster_name[0] ? cluster_name : "walreceiver";
> +
>   /* Establish the connection to the primary for XLOG streaming */
> - wrconn = walrcv_connect(conninfo, false, false,
> - cluster_name[0] ? cluster_name : "walreceiver",
> - &err);
> + wrconn = walrcv_connect(conninfo, false, false, appname, &err);
>   if (!wrconn)
>   ereport(ERROR,
>   (errcode(ERRCODE_CONNECTION_FAILURE),
> - errmsg("could not connect to the primary server: %s", err)));
> + errmsg("%s could not connect to the primary server: %s", appname, err)));
>
> I think your new %s should be quoted according to the guidelines at [1].

Done.

> ==
> src/test/regress/expected/subscription.out
>
> 3.
> Apparently, there is no existing regression test case for the ALTER
> "could not connect" message because if there was, it would have
> failed. Maybe a test should be added?
>
The ALTER SUBSCRIPTION command does not error out on the user
interface if updated with a bad connection string and the connection
failure error can only be seen in the respective log file.
Due to this behavior, it is not possible to add a test to show the
error message as it is done for CREATE SUBSCRIPTION.
Let me know if you think there is another way to add this test.

--
Thanks,
Nisha


v2-0001-Improve-the-connection-failure-error-messages.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-01-12 Thread shveta malik

On Fri, Jan 12, 2024 at 5:30 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila  wrote:
> >
> > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila  wrote:
> > >
> > > +static bool
> > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
> > > {
> > > ...
> > > + /* Slot ready for sync, so sync it. */
> > > + else
> > > + {
> > > + /*
> > > + * Sanity check: With hot_standby_feedback enabled and
> > > + * invalidations handled appropriately as above, this should never
> > > + * happen.
> > > + */
> > > + if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > > + elog(ERROR,
> > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> > > + " to remote slot's LSN(%X/%X) as synchronization"
> > > + " would move it backwards", remote_slot->name,
> > > + LSN_FORMAT_ARGS(slot->data.restart_lsn),
> > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn));
> > > ...
> > > }
> > >
> > > I was thinking about the above code in the patch and as far as I can
> > > think this can only occur if the same name slot is re-created with
> > > prior restart_lsn after the existing slot is dropped. Normally, the
> > > newly created slot (with the same name) will have higher restart_lsn
> > > but one can mimic it by copying some older slot by using
> > > pg_copy_logical_replication_slot().
> > >
> > > I don't think as mentioned in comments even if hot_standby_feedback is
> > > temporarily set to off, the above shouldn't happen. It can only lead
> > > to invalidated slots on standby.
> > >
> > > To close the above race, I could think of the following ways:
> > > 1. Drop and re-create the slot.
> > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves
> > > ahead of local_slot's LSN then we can update it; but as mentioned in
> > > your previous comment, we need to update all other fields as well. If
> > > we follow this then we probably need to have a check for catalog_xmin
> > > as well.
> > >
> >
> > The second point as mentioned is slightly misleading, so let me try to
> > rephrase it once again: Emit LOG/WARNING in this case and once
> > remote_slot's LSN moves ahead of local_slot's LSN then we can update
> > it; additionally, we need to update all other fields like two_phase as
> > well. If we follow this then we probably need to have a check for
> > catalog_xmin as well along remote_slot's restart_lsn.
> >
> > > Now, related to this the other case which needs some handling is what
> > > if the remote_slot's restart_lsn is greater than local_slot's
> > > restart_lsn but it is a re-created slot with the same name. In that
> > > case, I think the other properties like 'two_phase', 'plugin' could be
> > > different. So, is simply copying those sufficient or do we need to do
> > > something else as well?
> > >
> >
> > Bertrand, Dilip, Sawada-San, and others, please share your opinion on
> > this problem as I think it is important to handle this race condition.
>
> Is there any good use case of copying a failover slot in the first
> place? If it's not a normal use case and we can probably live without
> it, why not always disable failover during the copy? FYI we always
> disable two_phase on copied slots. It seems to me that copying a
> failover slot could lead to problems, as long as we synchronize slots
> based on their names. IIUC without the copy, this pass should never
> happen.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

There are multiple approaches discussed and tried when it comes to
starting a slot-sync worker. I am summarizing all here:

 1) Make slotsync worker as an Auxiliary Process (like checkpointer,
walwriter, walreceiver etc). The benefit this approach provides is, it
can control begin and stop in a more flexible way as each auxiliary
process could have different checks before starting and can have
different stop conditions. But it needs code duplication for process
management(start, stop, crash handling, signals etc) and currently it
does not support db-connection smoothly (none of the auxiliary process
has one so far)

We attempted to make slot-sync worker as an auxiliary process and
faced some challenges. The slot sync worker needs db-connection and
thus needs InitPostgres(). But AuxiliaryProcessMain() and
InitPostgres() are not compatible as both invoke common functions and
end up setting many callbacks functions twice (with different args).
Also InitPostgres() does 'MyBackendId' initialization (which further
triggers some stuff) which is not needed for AuxiliaryProcess and so
on. And thus in order to make slot-sync worker as an auxiliary
process, we need something similar to InitPostgres (trimmed down
working version) which needs further detailed analysis.

2) Make slotsync worker as a 'special' process like AutoVacLauncher
which is neither an Auxiliary process nor a bgworker one. It allows
db-connection and also provides flexibility to have start and stop
conditions for a process. But it needs a lot of code-duplication
ar

Re: plpgsql memory leaks

2024-01-12 Thread Pavel Stehule

pá 12. 1. 2024 v 11:54 odesílatel Michael Banck  napsal:

> Hi,
>
> On Fri, Jan 12, 2024 at 11:02:14AM +0100, Pavel Stehule wrote:
> > pá 12. 1. 2024 v 10:27 odesílatel Pavel Stehule  >
> > napsal:
> >
> > > Hi
> > >
> > > I have reported very memory expensive pattern:
>
> [...]
>
> > > takes lot of megabytes of memory too.
> >
> > The megabytes leaks are related to JIT. With JIT off the memory
> consumption
> > is significantly less  although there are some others probably.
>
> I cannot readily reproduce this.
>
> Which version of Postgres is this and on which platform/distribution?
>

It was tested on master branch (pg 17) on Fedora 39

>
> Did you try keep jit on but set jit_inline_above_cost to 0?
>
> The back-branches have a fix for the above case, i.e. llvmjit memleaks
> that can be worked-around by setting jit_inline_above_cost=0.
>

I'll do recheck

Pavel



>
>
> Michael
>

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-12 Thread Nazir Bilal Yavuz

Hi,

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman
 wrote:
>
> On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz  wrote:
> >
> > On Thu, 11 Jan 2024 at 08:01, Michael Paquier  wrote:
> > >
> > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz  
> > > > wrote:
> > > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier  
> > > >> wrote:
> > > >> Oh, I understand it now. Yes, that makes sense.
> > > >> I thought removing op_bytes completely ( as you said "This patch
> > > >> extends it with two more operation sizes, and there are even cases
> > > >> where it may be a variable" ) from pg_stat_io view then adding
> > > >> something like {read | write | extend}_bytes and {read | write |
> > > >> extend}_calls could be better, so that we don't lose any information.
> > > >
> > > > Upthread, Michael says:
> > > >
> > > >> I find the use of 1 in this context a bit confusing, because when
> > > >> referring to a counter at N, then it can be understood as doing N
> > > >> times a operation,
> > > >
> > > > I didn't understand this argument, so I'm not sure if I agree or
> > > > disagree with it.
> > >
> > > Nazir has mentioned upthread one thing: what should we do for the case
> > > where a combination of (io_object,io_context) does I/O with a
> > > *variable* op_bytes, because that may be the case for the WAL
> > > receiver?  For this case, he has mentioned that we should set op_bytes
> > > to 1, but that's something I find confusing because it would mean that
> > > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > > would be to use op_bytes = -1 or NULL for the variable case instead,
> > > with reads, writes and extends referring to a number of bytes rather
> > > than a number of operations.
> >
> > I agree but we can't do this only for the *variable* cases since
> > B_WAL_RECEIVER and other backends use the same
> > pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> > is, if two backends use the same pgstat_count_io_op_time() function
> > call in the code; they must count the same thing (number of calls,
> > bytes, etc.). It could be better to count the number of bytes for all
> > the IOOBJECT_WAL IOs.
>
> I'm a bit confused by this. pgstat_count_io_op_time() can check
> MyBackendType. In fact, you do this to ban the wal receiver already.
> It is true that you would need to count all wal receiver normal
> context wal object IOOps in the variable way, but I don't see how
> pgstat_count_io_op_time() is the limiting factor as long as the
> callsite is always doing either the number of bytes or the number of
> calls.

Apologies for not being clear. Let me try to explain this by giving examples:

Let's assume that there are 3 different pgstat_count_io_op_time()
calls in the code base and they are labeled as 1, 2 and 3.

And let's' assume that: WAL receiver uses 1st and 2nd
pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd
pgstat_count_io_op_time() and checkpointer uses 3rd
pgstat_count_io_op_time() to count IOs.

The 1st one is the only pgstat_count_io_op_time() call that must count
the number of bytes because of the variable cases and the others count
the number of calls or blocks.

a) WAL receiver uses both 1st and 2nd => 1st and 2nd
pgstat_count_io_op_time() must count the same thing => 2nd
pgstat_count_io_op_time() must count the number of bytes as well.

b) 2nd pgstat_count_io_op_time() started to count the number of bytes
=> Autovacuum will start to count the number of bytes => 2nd and 3rd
both are used by autocavuum => 3rd pgstat_count_io_op_time() must
count the number of bytes as well.

c) 3rd pgstat_count_io_op_time() started to count the number of bytes
=> Checkpointer will start to count the number of bytes.

The list goes on like this and if we don't have [write | read |
extend]_bytes, this effect will be multiplied.

> > > > I think these are the three proposals for handling WAL reads:
> > > >
> > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > > number of calls to pg_pread() or similar
> > > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > > calls to pg_pread() or similar
> > >
> > > 3) could be a number of bytes, actually.
> >
> > One important point is that we can't change only reads, if we decide
> > to count the number of bytes for the reads; writes and extends should
> > be counted as a number of bytes as well.
>
> Yes, that is true.
>
> > > > Looking at the patch, I think it is still doing 2.
> > >
> > > The patch disables stats for the WAL receiver, while the startup
> > > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > > discard the variable case.
> > >
> > > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > > all those IOOps for which it would be relevant. It kind of stinks but
> > > > it wo

Re: Compile warnings in dbcommands.c building with meson

2024-01-12 Thread jian he

On Fri, Jan 12, 2024 at 8:03 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-12, jian he wrote:
>
> > I saw it sometimes, sometimes not.
> > Now I think the reason is:
> > it will appear when you do `-Dbuildtype=release`.
> >
> > but it will not occur when I do:
> > `-Dbuildtype=debug`
> >
> > my current meson version is 1.3.1, my ninja version is 1.10.1.
>
> Hmm, but why doesn't it happen for other arguments of get_db_info that
> have pretty much identical code, say src_istemplate?
>

git at commit 6780b79d5c580586ae6feb37b9c8b8bf33367886 (HEAD ->
master, origin/master, origin/HEAD)
the minimum setup that will generate the warning:

meson setup  --reconfigure ${BUILD} \
-Dprefix=${PG_PREFIX} \
-Dpgport=5462 \
-Dbuildtype=release

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Re: Improve the connection failure error messages

2024-01-12 Thread Aleksander Alekseev

Hi,

Thanks for the patch.

> Due to this behavior, it is not possible to add a test to show the
> error message as it is done for CREATE SUBSCRIPTION.
> Let me know if you think there is another way to add this test.

I believe it can be done with TAP tests, see for instance:

contrib/auto_explain/t/001_auto_explain.pl

However I wouldn't insist on including the test in scope of this
particular patch. Such a test doesn't currently exist, it can be added
as a separate patch, and whether this is actually a useful test (all
the tests consume time after all...) is somewhat debatable. Personally
I agree that it would be nice to have though.

This being said...

> The ALTER SUBSCRIPTION command does not error out on the user
> interface if updated with a bad connection string and the connection
> failure error can only be seen in the respective log file.

I wonder if we should fix this. Again, not necessarily in scope of
this work, but if this is not a difficult task, again, it would be
nice to have.

Other than that v2 looks good.

-- 
Best regards,
Aleksander Alekseev

Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-12 Thread Bertrand Drouvot

Hi,

On Fri, Jan 12, 2024 at 02:00:01PM +0300, Alexander Lakhin wrote:
> Hi,
> 
> 12.01.2024 10:15, Bertrand Drouvot wrote:
> > 
> > For this one, the "good" news is that it looks like that we don’t see the
> > "terminating" message not followed by an "obsolete" message (so the engine
> > behaves correctly) anymore.
> > 
> > There is simply nothing related to the row_removal_activeslot at all (the 
> > catalog_xmin
> > advanced and there is no conflict).
> 
> Yes, judging from all the failures that we see now, it looks like the
> 0001-Fix-race-condition...patch works as expected.

Yeah, thanks for confirming, I'll create a dedicated hackers thread for that 
one.

> > Let's give v5 a try? (please apply 
> > v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
> > too).
> 
> Unfortunately, I've got the failure again (please see logs attached).
> (_primary.log can confirm that I have used exactly v5 — I see no
> txid_current() calls there...)

Okay ;-( Thanks for the testing. Then I can think of:

1) Michael's proposal up-thread (means tweak the test with a retry logic, 
retrying
things if such a standby snapshot is found).

2) Don't report a test error for active slots in case its catalog_xmin advanced.

I'd vote for 2) as:

- this is a corner case and the vast majority of the animals don't report any
issues (means the active slot conflict detection is already well covered).

- even on the same animal it should be pretty rare to not have an active slot 
conflict detection not covered at all (and the "failing" one would be probably
moving over time).

- It may be possible that 1) ends up failing (as we'd need to put a limit on the
retry logic anyhow).

What do you think?

And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
fully correct, so I'll give it another look.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: plpgsql memory leaks

2024-01-12 Thread Michael Banck

Hi,

On Fri, Jan 12, 2024 at 01:35:24PM +0100, Pavel Stehule wrote:
> pá 12. 1. 2024 v 11:54 odesílatel Michael Banck  napsal:
> > Which version of Postgres is this and on which platform/distribution?
> 
> It was tested on master branch (pg 17) on Fedora 39
> 
> > Did you try keep jit on but set jit_inline_above_cost to 0?
> >
> > The back-branches have a fix for the above case, i.e. llvmjit memleaks
> > that can be worked-around by setting jit_inline_above_cost=0.

I got that wrong, it needs to be -1 to disable it.

But if you are already running the master branch, it is probably a
separate issue.


Michael

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-12 Thread Aleksander Alekseev

Hi,

> I recently began trying to write documentation for the dynamic shared
> memory registry feature [0], and I noticed that the "Shared Memory and
> LWLocks" section of the documentation might need some improvement.

I know that feeling.

> Thoughts?

"""
Any registered shmem_startup_hook will be executed shortly after each
backend attaches to shared memory.
"""

IMO the word "each" here can give the wrong impression as if there are
certain guarantees about synchronization between backends. Maybe we
should change this to simply "... will be executed shortly after
[the?] backend attaches..."

"""
should ensure that only one process allocates a new tranche_id
(LWLockNewTrancheId) and initializes each new LWLock
(LWLockInitialize).
"""

Personally I think that reminding the corresponding function name here
is redundant and complicates reading just a bit. But maybe it's just
me.

Except for these nitpicks the patch looks good.

-- 
Best regards,
Aleksander Alekseev

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Robert Haas

On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman
 wrote:
> Yes, this is a good point. Seems like writing maintainable code is
> really about never lying and then figuring out when hiding the truth
> is also lying. Darn!

I think that's pretty true. In this particular case, this code has a
fair number of preexisting problems of this type, IMHO. It's been
touched by many hands, each person with their own design ideas, and
the result isn't as coherent as if it were written by a single mind
from scratch. But, because this code is also absolutely critical to
the system not eating user data, any changes have to be approached
with the highest level of caution. I think it's good to be really
careful about this sort of refactoring anywhere, because finding out a
year later that you broke something and having to go back and fix it
is no fun even if the consequences are minor ... here, they might not
be.

> Ah! I think you are right. Good catch. I could fix this with logic like this:
>
> bool space_freed = vacrel->tuples_deleted > tuples_already_deleted;
> if ((vacrel->nindexes == 0 && space_freed) ||
> (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming)))

Perhaps that would be better written as space_freed ||
(vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point
you might not need to introduce the variable.

> I think I made this mistake when working on a different version of
> this that combined the prune and no prune cases. I noticed that
> lazy_scan_noprune() updates the FSM whenever there are no indexes. I
> wonder why this is (and why we don't do it in the prune case).

Yeah, this all seems distinctly under-commented. As noted above, this
code has grown organically, and some things just never got written
down.

Looking through the git history, I see that this behavior seems to
date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
lazy_scan_noprune(). The comments don't explain the reasoning, but my
guess is that it was just an accident. It's not entirely evident to me
whether there might ever be good reasons to update the freespace map
for a page where we haven't freed up any space -- after all, the free
space map isn't crash-safe, so you could always postulate that
updating it will correct an existing inaccuracy. But I really doubt
that there's any good reason for lazy_scan_prune() and
lazy_scan_noprune() to have different ideas about whether to update
the FSM or not, especially in an obscure corner case like this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: cleanup patches for incremental backup

2024-01-12 Thread Robert Haas

On Thu, Jan 11, 2024 at 8:00 PM Shinoda, Noriyoshi (HPE Services Japan
- FSIP)  wrote:
> Thank you for developing the new tool. I have attached a patch that corrects 
> the spelling of the --individual option in the SGML file.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[PATCH] New predefined role pg_manage_extensions

2024-01-12 Thread Michael Banck

Hi,

I propose to add a new predefined role to Postgres,
pg_manage_extensions. The idea is that it allows Superusers to delegate
the rights to create, update or delete extensions to other roles, even
if those extensions are not trusted or those users are not the database
owner.

I have attached a WIP patch for this.


Thoughts?

Michael
>From 59497e825184f0de30a18573ffd7d331be3b233d Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Fri, 12 Jan 2024 13:56:59 +0100
Subject: [PATCH] Add new pg_manage_extensions predefined role.

This allows any role that is granted this new predefined role to CREATE, UPDATE
or DROP extensions, no matter whether they are trusted or not.
---
 doc/src/sgml/user-manag.sgml  |  5 +
 src/backend/commands/extension.c  | 11 ++-
 src/include/catalog/pg_authid.dat |  5 +
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/user-manag.sgml b/doc/src/sgml/user-manag.sgml
index 92a299d2d3..ebb82801ec 100644
--- a/doc/src/sgml/user-manag.sgml
+++ b/doc/src/sgml/user-manag.sgml
@@ -693,6 +693,11 @@ DROP ROLE doomed_role;
database to issue
CREATE SUBSCRIPTION.
   
+  
+   pg_manage_extensions
+   Allow creating, removing or updating extensions, even if the
+   extensions are untrusted or the user is not the database owner.
+  
  
 

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 226f85d0e3..71481d9a73 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -882,13 +882,14 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	ListCell   *lc2;
 
 	/*
-	 * Enforce superuser-ness if appropriate.  We postpone these checks until
-	 * here so that the control flags are correctly associated with the right
+	 * Enforce superuser-ness/membership of the pg_manage_extensions
+	 * predefined role if appropriate.  We postpone these checks until here
+	 * so that the control flags are correctly associated with the right
 	 * script(s) if they happen to be set in secondary control files.
 	 */
 	if (control->superuser && !superuser())
 	{
-		if (extension_is_trusted(control))
+		if (extension_is_trusted(control) || has_privs_of_role(GetUserId(), ROLE_PG_MANAGE_EXTENSIONS))
 			switch_to_superuser = true;
 		else if (from_version == NULL)
 			ereport(ERROR,
@@ -897,7 +898,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 			control->name),
 	 control->trusted
 	 ? errhint("Must have CREATE privilege on current database to create this extension.")
-	 : errhint("Must be superuser to create this extension.")));
+	 : errhint("Only roles with privileges of the \"%s\" role are allowed to create this extension.", "pg_manage_extensions")));
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
@@ -905,7 +906,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 			control->name),
 	 control->trusted
 	 ? errhint("Must have CREATE privilege on current database to update this extension.")
-	 : errhint("Must be superuser to update this extension.")));
+	 : errhint("Only roles with privileges of the \"%s\" role are allowed to update this extension.", "pg_manage_extensions")));
 	}
 
 	filename = get_extension_script_filename(control, from_version, version);
diff --git a/src/include/catalog/pg_authid.dat b/src/include/catalog/pg_authid.dat
index 82a2ec2862..ac70603d26 100644
--- a/src/include/catalog/pg_authid.dat
+++ b/src/include/catalog/pg_authid.dat
@@ -94,5 +94,10 @@
   rolcreaterole => 'f', rolcreatedb => 'f', rolcanlogin => 'f',
   rolreplication => 'f', rolbypassrls => 'f', rolconnlimit => '-1',
   rolpassword => '_null_', rolvaliduntil => '_null_' },
+{ oid => '8801', oid_symbol => 'ROLE_PG_MANAGE_EXTENSIONS',
+  rolname => 'pg_manage_extensions', rolsuper => 'f', rolinherit => 't',
+  rolcreaterole => 'f', rolcreatedb => 'f', rolcanlogin => 'f',
+  rolreplication => 'f', rolbypassrls => 'f', rolconnlimit => '-1',
+  rolpassword => '_null_', rolvaliduntil => '_null_' },
 
 ]
-- 
2.39.2

Re: tablecmds.c/MergeAttributes() cleanup

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 5:32 AM Alvaro Herrera  wrote:
> On 2024-Jan-11, Alvaro Herrera wrote:
> > If you look closely at InsertPgAttributeTuples and accompanying code, it
> > all looks a bit archaic.  They seem to be treating TupleDesc as a
> > glorified array of Form_pg_attribute elements in a convenient packaging.
> > It's probably cleaner to change these APIs so that they deal with a
> > Form_pg_attribute array, and not TupleDesc anymore.  But we can hack on
> > that some other day.
>
> In addition, it also occurs to me now that maybe it would make sense to
> change the TupleDesc implementation to use a List of Form_pg_attribute
> instead of an array, and do away with ->natts.  This would let us change
> all "for ( ... natts ...)" loops into foreach_ptr() loops ...  there are
> only five hundred of them or so --rolls eyes--.

Open-coding stuff like this can easily work out to a loss, and I
personally think we're overly dependent on List. It's not a
particularly good abstraction, IMHO, and if we do a lot of work to
start using it everywhere because a list is really an array, then what
happens when somebody decides that a list really ought to be a
skip-list, or a hash table, or some other crazy thing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Stack overflow issue

2024-01-12 Thread Heikki Linnakangas


On 11/01/2024 19:37, Robert Haas wrote:

On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas  wrote:

The problem with CommitTransactionCommand (or rather
AbortCurrentTransaction() which has the same problem)
and ShowTransactionStateRec is that they get called in a state where
aborting can lead to a panic. If you add a "check_stack_depth()" to them
and try to reproducer scripts that Egor posted, you still get a panic.


Hmm, that's unfortunate. I'm not sure what to do about that. But I'd
still suggest looking for a goto-free approach.


Here's one goto-free attempt. It adds a local loop to where the 
recursion was, so that if you have a chain of subtransactions that need 
to be aborted in CommitTransactionCommand, they are aborted iteratively. 
The TBLOCK_SUBCOMMIT case already had such a loop.


I added a couple of comments in the patch marked with "REVIEWER NOTE", 
to explain why I changed some things. They are to be removed before 
committing.


I'm not sure if this is better than a goto. In fact, even if we commit 
this, I think I'd still prefer to replace the remaining recursive calls 
with a goto. Recursion feels a weird to me, when we're unwinding the 
states from the stack as we go.


Of course we could use a "for (;;) { ... continue }" construct around 
the whole function, instead of a goto, but I don't think that's better 
than a goto in this case.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 44568e6feef79d604bbe168c0839ab2e03c99f8a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 12 Jan 2024 17:07:25 +0200
Subject: [PATCH v2 1/1] Turn tail recursion into iteration in
 AbortCurrentTransaction()

Usually the compiler will optimize away the tail recursion anyway, but
if it doesn't, you can drive the function into stack overflow. For
example:

(n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null

The usual fix would be to add a check_stack_depth() to the function,
but that's not good in AbortCurrentTransaction(), as you are already
trying to clean up after a failure. ereporting there leads to a panic.

Fix CurrentTransactionCommand() in a similar fashion, although
ereporting would be less bad there.

Report by Egor Chindyaskin and Alexander Lakhin.

Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru
---
 src/backend/access/transam/xact.c | 141 --
 1 file changed, 75 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 464858117e0..329a139c991 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -3194,47 +3194,62 @@ CommitTransactionCommand(void)
 CommitSubTransaction();
 s = CurrentTransactionState;	/* changed by pop */
 			} while (s->blockState == TBLOCK_SUBCOMMIT);
+			/*
+			 * REVIEWER NOTE: We used to have duplicated code here, to handle
+			 * the TBLOCK_END and TBLOCK_PREPARE cases exactly the same way
+			 * that we do above.  I replaced that with a recursive call to
+			 * CommitTransactionCommand(), to make this consistent with the
+			 * TBLOCK_SUBABORT_END and TBLOCK_SUBABORT_PENDING handling.  This
+			 * can only recurse once, so there's no risk of stack overflow.
+			 */
 			/* If we had a COMMIT command, finish off the main xact too */
-			if (s->blockState == TBLOCK_END)
-			{
-Assert(s->parent == NULL);
-CommitTransaction();
-s->blockState = TBLOCK_DEFAULT;
-if (s->chain)
-{
-	StartTransaction();
-	s->blockState = TBLOCK_INPROGRESS;
-	s->chain = false;
-	RestoreTransactionCharacteristics(&savetc);
-}
-			}
-			else if (s->blockState == TBLOCK_PREPARE)
-			{
-Assert(s->parent == NULL);
-PrepareTransaction();
-s->blockState = TBLOCK_DEFAULT;
-			}
-			else
+			if (s->blockState != TBLOCK_END && s->blockState != TBLOCK_PREPARE)
 elog(ERROR, "CommitTransactionCommand: unexpected state %s",
 	 BlockStateAsString(s->blockState));
+			CommitTransactionCommand();
 			break;
 
 			/*
-			 * The current already-failed subtransaction is ending due to a
-			 * ROLLBACK or ROLLBACK TO command, so pop it and recursively
-			 * examine the parent (which could be in any of several states).
+			 * The current subtransaction is ending due to a ROLLBACK or
+			 * ROLLBACK TO command.  Pop it, as well as any of it parents that
+			 * are also being rolled back.
 			 */
 		case TBLOCK_SUBABORT_END:
-			CleanupSubTransaction();
-			CommitTransactionCommand();
-			break;
+		case TBLOCK_SUBABORT_PENDING:
+			do {
+/*
+ * The difference between SUBABORT_END and SUBABORT_PENDING is
+ * whether the subtransaction has already been aborted and we
+ * just need to clean it up, or if we need to also abort it
+ * first.
+ */
+if (s->blockState == TBLOCK_SUBABORT_PENDING)
+	AbortSubTransaction();
+CleanupSubTransaction();
+s = CurrentTransa

Re: [PATCH] New predefined role pg_manage_extensions

2024-01-12 Thread Jelte Fennema-Nio

On Fri, 12 Jan 2024 at 15:53, Michael Banck  wrote:
> I propose to add a new predefined role to Postgres,
> pg_manage_extensions. The idea is that it allows Superusers to delegate
> the rights to create, update or delete extensions to other roles, even
> if those extensions are not trusted or those users are not the database
> owner.

I agree that extension creation is one of the main reasons people
require superuser access, and I think it would be beneficial to try to
reduce that. But I'm not sure that such a pg_manage_extensions role
would have any fewer permissions than superuser in practice. Afaik
many extensions that are not marked as trusted, are not trusted
because they would allow fairly trivial privilege escalation to
superuser if they were.

Re: Make all Perl warnings fatal

2024-01-12 Thread Peter Eisentraut


On 11.01.24 12:29, Bharath Rupireddy wrote:

On Sat, Dec 30, 2023 at 12:57 AM Peter Eisentraut  wrote:


committed


With the commit c5385929 converting perl warnings to FATAL, use of
psql/safe_psql with timeout parameters [1] fail with the following
error:

Use of uninitialized value $ret in bitwise and (&) at
/home/ubuntu/postgres/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
line 2015.


I think what is actually relevant is the timed_out parameter, otherwise 
the psql/safe_psql function ends up calling "die" and you don't get any 
further.



Perhaps assigning a default error code to $ret instead of undef in
PostgreSQL::Test::Cluster - psql() function is the solution.


I would put this code

my $core = $ret & 128 ? " (core dumped)" : "";
die "psql exited with signal "
  . ($ret & 127)
  . "$core: '$$stderr' while running '@psql_params'"
  if $ret & 127;
$ret = $ret >> 8;

inside a if (defined $ret) block.

Then the behavior would be that the whole function returns undef on 
timeout, which is usefully different from returning 0 (and matches 
previous behavior).

Re: tablecmds.c/MergeAttributes() cleanup

2024-01-12 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jan 12, 2024 at 5:32 AM Alvaro Herrera  
> wrote:
>> In addition, it also occurs to me now that maybe it would make sense to
>> change the TupleDesc implementation to use a List of Form_pg_attribute
>> instead of an array, and do away with ->natts.  This would let us change
>> all "for ( ... natts ...)" loops into foreach_ptr() loops ...  there are
>> only five hundred of them or so --rolls eyes--.

> Open-coding stuff like this can easily work out to a loss, and I
> personally think we're overly dependent on List. It's not a
> particularly good abstraction, IMHO, and if we do a lot of work to
> start using it everywhere because a list is really an array, then what
> happens when somebody decides that a list really ought to be a
> skip-list, or a hash table, or some other crazy thing?

Without getting into opinions on whether List is a good abstraction,
I'm -1 on this idea.  It would be a large amount of code churn, with
attendant back-patching pain, and I just don't see that there is
much to be gained.

regards, tom lane

Re: the s_lock_stuck on perform_spin_delay

2024-01-12 Thread Robert Haas

On Wed, Jan 10, 2024 at 10:17 PM Andy Fan  wrote:
> fixed in v2.

Timing the spinlock wait seems like a separate patch from the new sanity checks.

I suspect that the new sanity checks should only be armed in
assert-enabled builds.

I'm doubtful that this method of tracking the wait time will be
accurate. And I don't know that we can make it accurate with
reasonable overhead. But I don't think we can assume that the time we
tried to wait for and the time that we were actually descheduled are
the same.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-12 Thread Nathan Bossart

Thanks for reviewing.

On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote:
> """
> Any registered shmem_startup_hook will be executed shortly after each
> backend attaches to shared memory.
> """
> 
> IMO the word "each" here can give the wrong impression as if there are
> certain guarantees about synchronization between backends. Maybe we
> should change this to simply "... will be executed shortly after
> [the?] backend attaches..."

I see what you mean, but I don't think the problem is the word "each."  I
think the problem is the use of passive voice.  What do you think about
something like

Each backend will execute the registered shmem_startup_hook shortly
after it attaches to shared memory.

> """
> should ensure that only one process allocates a new tranche_id
> (LWLockNewTrancheId) and initializes each new LWLock
> (LWLockInitialize).
> """
> 
> Personally I think that reminding the corresponding function name here
> is redundant and complicates reading just a bit. But maybe it's just
> me.

Yeah, I waffled on this one.  I don't mind removing it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: tablecmds.c/MergeAttributes() cleanup

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 10:09 AM Robert Haas  wrote:
> Open-coding stuff like this can easily work out to a loss, and I
> personally think we're overly dependent on List. It's not a
> particularly good abstraction, IMHO, and if we do a lot of work to
> start using it everywhere because a list is really an array, then what
> happens when somebody decides that a list really ought to be a
> skip-list, or a hash table, or some other crazy thing?

This paragraph was a bit garbled. I meant to say that open-coding can
be better than relying on a canned abstraction, but it came out the
other way around.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Make all Perl warnings fatal

2024-01-12 Thread Bharath Rupireddy

On Fri, Jan 12, 2024 at 9:03 PM Peter Eisentraut  wrote:
>
> On 11.01.24 12:29, Bharath Rupireddy wrote:
> > On Sat, Dec 30, 2023 at 12:57 AM Peter Eisentraut  
> > wrote:
> >>
> >> committed
> >
> > With the commit c5385929 converting perl warnings to FATAL, use of
> > psql/safe_psql with timeout parameters [1] fail with the following
> > error:
> >
> > Use of uninitialized value $ret in bitwise and (&) at
> > /home/ubuntu/postgres/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
> > line 2015.
>
> I think what is actually relevant is the timed_out parameter, otherwise
> the psql/safe_psql function ends up calling "die" and you don't get any
> further.
>
> > Perhaps assigning a default error code to $ret instead of undef in
> > PostgreSQL::Test::Cluster - psql() function is the solution.
>
> I would put this code
>
>  my $core = $ret & 128 ? " (core dumped)" : "";
>  die "psql exited with signal "
>. ($ret & 127)
>. "$core: '$$stderr' while running '@psql_params'"
>if $ret & 127;
>  $ret = $ret >> 8;
>
> inside a if (defined $ret) block.
>
> Then the behavior would be that the whole function returns undef on
> timeout, which is usefully different from returning 0 (and matches
> previous behavior).

WFM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: System username in pg_stat_activity

2024-01-12 Thread Magnus Hagander

On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Thu, Jan 11, 2024 at 02:24:58PM +0100, Magnus Hagander wrote:
> > On Wed, Jan 10, 2024 at 3:12 PM Bertrand Drouvot
> >  wrote:
> > >
> > > If we go the 2 fields way, then what about auth_identity and auth_method 
> > > then?
> >
> >
> > Here is an updated patch based on this idea.
>
> Thanks!
>
> + 
> +  
> +   auth_method text
> +  
> +  
> +   The authentication method used for authenticating the connection, or
> +   NULL for background processes.
> +  
>
> I'm wondering if it would make sense to populate it for parallel workers too.
> I think it's doable thanks to d951052, but I'm not sure it's worth it (one 
> could
> join based on the leader_pid though). OTOH that would be consistent with
> how the SYSTEM_USER behaves with parallel workers (it's populated).

I guess one could conceptually argue that "authentication happens int
he leader". But we do populate it with the other user records, and
it'd be weird if this one was excluded.

The tricky thing is that pgstat_bestart() is called long before we
deserialize the data. But from what I can tell it should be safe to
change it per the attached? That should be AFAICT an extremely short
window of time longer before we report it, not enough to matter.


> +  
> +   auth_identity text
> +  
> +  
> +   The identity (if any) that the user presented during the 
> authentication
> +   cycle before they were assigned a database role.  Contains the same
> +   value as 
>
> Same remark regarding the parallel workers case +:
>
> - Would it be better to use the `name` datatype for auth_identity?

I've been going back and forth. And I think my conclusion is that it's
not a postgres identifier, so it shouldn't be. See the earlier
discussion, and for example that that's what we do for cert names when
SSL is used.

> - what about "Contains the same value as the identity part in  linkend="system-user" />"?
>
> +   /*
> +* Trust doesn't set_authn_id(), but we still need to 
> store the
> +* auth_method
> +*/
> +   MyClientConnectionInfo.auth_method = uaTrust;
>
> +1, I think it is useful here to provide "trust" and not a NULL value in the
> context of this patch.

Yeah, that's probably "independently correct", but actually useful here.


> +# pg_stat_activity shold contain trust and empty string for trust auth
>
> typo: s/shold/should/

Ops.


> +# Users with md5 auth should show both auth method and name in 
> pg_stat_activity
>
> what about "show both auth method and identity"?

Good spot, yeah, I changed it over to identity everywhere else so it
should be here as well.

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 849a03e4b6..269ab4e586 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1529,6 +1529,9 @@ ParallelWorkerMain(Datum main_arg)
 	/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
 	AttachSerializableXact(fps->serializable_xact_handle);
 
+	/* Report to the stats system that we are started */
+	pgstat_bestart();
+
 	/*
 	 * We've initialized all of our state now; nothing should change
 	 * hereafter.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1ad3367159..ce090f6864 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -22,6 +22,7 @@
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
+#include "access/parallel.h"
 #include "access/session.h"
 #include "access/sysattr.h"
 #include "access/tableam.h"
@@ -1235,7 +1236,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		process_session_preload_libraries();
 
 	/* report this backend in the PgBackendStatus array */
-	if (!bootstrap)
+	if (!bootstrap && !IsParallelWorker())
 		pgstat_bestart();
 
 	/* close the transaction we started above */

Re: Error "initial slot snapshot too large" in create replication slot

2024-01-12 Thread Robert Haas

On Thu, Jan 11, 2024 at 9:47 PM Kyotaro Horiguchi
 wrote:
> I understand. So, summirizing the current state briefly, I believe it
> as follows:
>
> a. snapbuild lacks a means to differentiate between top and sub xids
>   during snapshot building.
>
> b. Abusing takenDuringRecovery could lead to potential issues, so it
>   has been rejected.
>
> c. Dynamic sizing of xip is likely to have a significant impact on
>   performance (as mentioned in the comments of GetSnapshotData).
>
> d. (new!) Adding a third storing method is not favored.
>
> As a method to satisfy these prerequisites, I think one direction
> could be to split takenDuringRecovery into flags indicating the
> storage method and creation timing. I present this as a last-ditch
> effort. It's a rough proposal, and further validation should be
> necessary. If this direction is also not acceptable, I'll proceed with
> removing the CF entry.

I think that the idea of evolving takenDuringRecovery could
potentially work for this problem and maybe for some other things as
well. I remember studying that flag before and coming to the
conclusion that it had some pretty specific and surprising semantics
that weren't obvious from the name -- I don't remember the details --
and so I think improving it could be useful. Also, I think that
multiple storage methods could be more palatable if there were a clear
way to distinguish which one was in use and good comments explaining
the distinction in relevant places.

However, I wonder whether this whole area is in need of a bigger
rethink. There seem to be a number of situations in which the split
into xip and subxip arrays is not very convenient, and also some
situations where it's quite important. Sometimes we want to record
what's committed, and sometimes what isn't. It's all a bit messy and
inconsistent. The necessity of limiting snapshot size is annoying,
too. I have no real idea what can be done about all of this, but what
strikes me is that the current system has grown up incrementally: we
started with a data structure designed for the original use case, and
now by gradually adding new use cases things have gotten complicated.
If we were designing things over from scratch, maybe we'd do it
differently and end up with something less messy. And maybe someone
can imagine a redesign that is likewise less messy.

But on the other hand, maybe not. Perhaps we can't really do any
better than what we have. Then the question becomes whether this case
is important enough to justify additional code complexity. I don't
think I've personally seen users run into this problem so I have no
special reason to think that it's important, but if it's causing
issues for other people then maybe it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Add new pg_walsummary tool.

2024-01-12 Thread Robert Haas

On Thu, Jan 11, 2024 at 12:56 PM Robert Haas  wrote:
> Add new pg_walsummary tool.

culicidae is unhappy with this, but I don't yet understand why. The output is:

#   Failed test 'stdout shows block 0 modified'
#   at t/002_blocks.pl line 85.
#   'TS 1663, DB 5, REL 16384, FORK main: blocks 0..1'
# doesn't match '(?^m:FORK main: block 0$)'

The test is expecting block 0 to be modified, but block 1 to be
unmodified, but here, both blocks are modified. That would maybe make
sense if this machine had a really big block size, but that doesn't
seem to be the case. Or, maybe the test has erred in failing to
disable autovacuum -- though it does take other precautions to try to
prevent that from interfering.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pgsql: Add new pg_walsummary tool.

2024-01-12 Thread Robert Haas

On Thu, Jan 11, 2024 at 1:49 PM Robert Haas  wrote:
> On Thu, Jan 11, 2024 at 12:56 PM Robert Haas  wrote:
> > Add new pg_walsummary tool.
>
> culicidae is unhappy with this, but I don't yet understand why. The output is:
>
> #   Failed test 'stdout shows block 0 modified'
> #   at t/002_blocks.pl line 85.
> #   'TS 1663, DB 5, REL 16384, FORK main: blocks 0..1'
> # doesn't match '(?^m:FORK main: block 0$)'
>
> The test is expecting block 0 to be modified, but block 1 to be
> unmodified, but here, both blocks are modified. That would maybe make
> sense if this machine had a really big block size, but that doesn't
> seem to be the case. Or, maybe the test has erred in failing to
> disable autovacuum -- though it does take other precautions to try to
> prevent that from interfering.

It's not autovacuum, the test is flaky. I ran it in a loop locally
until it failed, and then ran pg_waldump, finding this:

rmgr: Heaplen (rec/tot): 73/  8249, tx:738, lsn:
0/0158AEE8, prev 0/01588EB8, desc: UPDATE old_xmax: 738, old_off: 2,
old_infobits: [], flags: 0x03, new_xmax: 0, new_off: 76, blkref #0:
rel 1663/5/16384 blk 1 FPW, blkref #1: rel 1663/5/16384 blk 0

I'm slightly puzzled, here. I would have expected that if I inserted a
bunch of records into the table and then updated one of them, the new
record would have gone into a new page at the end of the table, and
also that even if it didn't extend the relation, it would go into the
same page every time the test was run. But here the behavior seems to
be nondeterministic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

2024-01-12 Thread Tomas Vondra

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.

1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c.  And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.

2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?

3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.

4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.

5) I haven't done anything about the xs_heap_continue=true case yet.

6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.

Now to Robert's message ...

On 1/9/24 21:31, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
>  wrote:
>> Here's a somewhat reworked version of the patch. My initial goal was to
>> see if it could adopt the StreamingRead API proposed in [1], but that
>> turned out to be less straight-forward than I hoped, for two reasons:
> 
> I guess we need Thomas or Andres or maybe Melanie to comment on this.
> 

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

>> Perhaps a bigger change is that I decided to move this into a separate
>> API on top of indexam.c. The original idea was to integrate this into
>> index_getnext_tid/index_getnext_slot, so that all callers benefit from
>> the prefetching automatically. Which would be nice, but it also meant
>> it's need to happen in the indexam.c code, which seemed dirty.
> 
> This patch is hard to review right now because there's a bunch of
> comment updating that doesn't seem to have been done for the new
> design. For instance:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> But not any more.
> 

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

> + * XXX The prefetching may interfere with the patch allowing us to evaluate
> + * conditions on the index tuple, in which case we may not need the heap
> + * tuple. Maybe if there's such filter, we should prefetch only pages that
> + * are not all-visible (and the same idea would also work for IOS), but
> + * it also makes the indexing a bit "aware" of the visibility stuff (which
> + * seems a somewhat wrong). Also, maybe we should consider the filter
> selectivity
> 
> I'm not sure whether all the problems in this

Re: psql not responding to SIGINT upon db reconnection

2024-01-12 Thread Robert Haas

On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin  wrote:
> I think the way to go is to expose some variation of libpq's
> pqSocketPoll(), which I would be happy to put together a patch for.
> Making frontends, psql in this case, have to reimplement the polling
> logic doesn't strike me as fruitful, which is essentially what I have
> done.

I encourage further exploration of this line of attack. I fear that if
I were to commit something like what you've posted up until now,
people would complain that that code was too ugly to live, and I'd
have a hard time telling them that they're wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Peter Geoghegan

On Fri, Jan 12, 2024 at 9:44 AM Robert Haas  wrote:
> Looking through the git history, I see that this behavior seems to
> date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
> lazy_scan_noprune(). The comments don't explain the reasoning, but my
> guess is that it was just an accident. It's not entirely evident to me
> whether there might ever be good reasons to update the freespace map
> for a page where we haven't freed up any space -- after all, the free
> space map isn't crash-safe, so you could always postulate that
> updating it will correct an existing inaccuracy. But I really doubt
> that there's any good reason for lazy_scan_prune() and
> lazy_scan_noprune() to have different ideas about whether to update
> the FSM or not, especially in an obscure corner case like this.

Why do you think that lazy_scan_prune() and lazy_scan_noprune() have
different ideas about whether to update the FSM or not?

Barring certain failsafe edge cases, we always call
PageGetHeapFreeSpace() exactly once for each scanned page. While it's
true that we won't always do that in the first heap pass (in cases
where VACUUM has indexes), that's only because we expect to do it in
the second heap pass instead -- since we only want to do it once.

It's true that lazy_scan_noprune unconditionally calls
PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same
behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases
we know that there won't be any second heap pass, and so in both cases
we always call PageGetHeapFreeSpace() in the first heap pass. It's
just that it's a bit harder to see that in the lazy_scan_prune case.
No?

-- 
Peter Geoghegan

Re: Report planning memory in EXPLAIN ANALYZE

2024-01-12 Thread Alvaro Herrera

I think this patch is mostly OK and decided to use it to test something
else.  In doing so I noticed one small thing that bothers me: if
planning uses buffers, and those were requested to be reported, we get
something like this at the end of the explain

 Planning:
   Buffers: shared hit=120 read=30
 Planning Memory: used=67944 bytes allocated=73728 bytes
 Planning Time: 0.892 ms

This looks a bit unpleasant.  Wouldn't it be much nicer if the Planning
thingies were all reported together in the Planning group?

 Planning:
   Buffers: shared hit=120 read=30
   Memory: used=67944 bytes allocated=73728 bytes
   Time: 0.892 ms

This is easier said than done.  First, moving "Time" to be in the same
group would break tools that are already parsing the current format.  So
maybe we should leave that one alone.  So that means we end up with
this:

 Planning:
   Buffers: shared hit=120 read=30
   Memory: used=67944 bytes allocated=73728 bytes
 Planning Time: 0.892 ms

Better than the original, I think, even if not perfect.  Looking at the
code, this is a bit of a mess to implement, because of the way
show_buffer_usage is coded; currently it has an internal
increment/decrement of indentation, so in order to report memory we
would have to move the indentation change to outside show_buffer_usage
(I think it does belong outside, but the determination of whether to
print or not is quite messy, so we'd need a new function returning
boolean).

Alternatively we could move the memory reportage to within
show_buffer_usage, but that seems worse.

Or we could leave it as you have it, but to me that's akin to giving up
on doing it nicely.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

Re: index prefetching

2024-01-12 Thread Robert Haas

Not a full response, but just to address a few points:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
 wrote:
> Thinking about this, I think it should be possible to make prefetching
> work even for plans with execute_once=false. In particular, when the
> plan changes direction it should be possible to simply "walk back" the
> prefetch queue, to get to the "correct" place in in the scan. But I'm
> not sure it's worth it, because plans that change direction often can't
> really benefit from prefetches anyway - they'll often visit stuff they
> accessed shortly before anyway. For plans that don't change direction
> but may pause, we don't know if the plan pauses long enough for the
> prefetched pages to get evicted or something. So I think it's OK that
> execute_once=false means no prefetching.

+1.

> > + * XXX We do add the cache size to the request in order not to
> > + * have issues with uint64 underflows.
> >
> > I don't know what this means.
> >
>
> There's a check that does this:
>
>   (x + PREFETCH_CACHE_SIZE) >= y
>
> it might also be done as "mathematically equivalent"
>
>   x >= (y - PREFETCH_CACHE_SIZE)
>
> but if the "y" is an uint64, and the value is smaller than the constant,
> this would underflow. It'd eventually disappear, once the "y" gets large
> enough, ofc.

The problem is, I think, that there's no particular reason that
someone reading the existing code should imagine that it might have
been done in that "mathematically equivalent" fashion. I imagined that
you were trying to make a point about adding the cache size to the
request vs. adding nothing, whereas in reality you were trying to make
a point about adding from one side vs. subtracting from the other.

> > + * We reach here if the index only scan is not parallel, or if we're
> > + * serially executing an index only scan that was planned to be
> > + * parallel.
> >
> > Well, this seems sad.
>
> Stale comment, I believe. However, I didn't see much benefits with
> parallel index scan during testing. Having I/O from multiple workers
> generally had the same effect, I think.

Fair point, likely worth mentioning explicitly in the comment.

> Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
> to keep it consistent. And then the constants are all capital, ofc.

It'd still be nice to get table or heap in there, IMHO, but maybe we
can't, and consistency is certainly a good thing regardless of the
details, so thanks for that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-01-12 Thread Jeff Davis

On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote:
> Jeremy also raised a problem with old versions of psql connecting to
> a
> new server: the \l and \dO won't work. Not sure exactly what to do
> there, but I could work around it by adding a new field rather than
> renaming (though that's not ideal).

I did a bit of research for a precedent, and the closest I could find
is that \dp was broken between 14 and 15 by commit 07eee5a0dc.

That is some precedent, but it's more narrow. I think that might
justify breaking \dO in older clients, but \l is used frequently.

It seems safer to just introduce new columns "datbuiltinlocale" and
"collbuiltinlocale". They'll be nullable anyway.

If we want to clean this up we can do so as a separate commit. There
are other parts of the catalog representation related to collation that
we might want to clean up as well.

Regards,
Jeff Davis

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 9:44 AM Robert Haas  wrote:
>
> On Thu, Jan 11, 2024 at 9:05 PM Melanie Plageman
>  wrote:
> > Ah! I think you are right. Good catch. I could fix this with logic like 
> > this:
> >
> > bool space_freed = vacrel->tuples_deleted > tuples_already_deleted;
> > if ((vacrel->nindexes == 0 && space_freed) ||
> > (vacrel->nindexes > 0 && (space_freed || !vacrel->do_index_vacuuming)))
>
> Perhaps that would be better written as space_freed ||
> (vacrel->nindexes > 0 && !vacrel->do_index_vacuuming), at which point
> you might not need to introduce the variable.

As I revisited this, I realized why I had written this logic for
updating the FSM:

if (vacrel->nindexes == 0 ||
!vacrel->do_index_vacuuming || lpdead_items == 0)

It is because in master, in lazy_scan_heap(), when there are no
indexes and no space has been freed, we actually keep going and hit
the other logic to update the FSM at the bottom of the loop:

if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)

The effect is that if there are no indexes and no space is freed we
always update the FSM. When combined, that means that if there are no
indexes, we always update the FSM. With the logic you suggested, we
fail a pageinspect test which expects the FSM to exist when it
doesn't. So, adding the logic:

if (space_freed ||
(vacrel->nindexes > 0 && !vacrel->do_index_vacuuming) ||
vacrel->nindexes == 0)

We pass that pageinspect test. But, we fail a freespacemap test which
expects some freespace reported which isn't:

id| blkno | is_avail
 -+---+--
- freespace_tab   | 0 | t
+ freespace_tab   | 0 | f

which I presume is because space_freed is not the same as !has_lpdead_items.

I can't really see what logic would be right here.

> > I think I made this mistake when working on a different version of
> > this that combined the prune and no prune cases. I noticed that
> > lazy_scan_noprune() updates the FSM whenever there are no indexes. I
> > wonder why this is (and why we don't do it in the prune case).
>
> Yeah, this all seems distinctly under-commented. As noted above, this
> code has grown organically, and some things just never got written
> down.
>
> Looking through the git history, I see that this behavior seems to
> date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
> lazy_scan_noprune(). The comments don't explain the reasoning, but my
> guess is that it was just an accident. It's not entirely evident to me
> whether there might ever be good reasons to update the freespace map
> for a page where we haven't freed up any space -- after all, the free
> space map isn't crash-safe, so you could always postulate that
> updating it will correct an existing inaccuracy. But I really doubt
> that there's any good reason for lazy_scan_prune() and
> lazy_scan_noprune() to have different ideas about whether to update
> the FSM or not, especially in an obscure corner case like this.

All of this makes me wonder why we would update the FSM in vacuum when
no space was freed. I thought that RelationAddBlocks() and
RelationGetBufferForTuple() would handle making sure the FSM gets
created and updated if the reason there is freespace is because the
page hasn't been filled yet.

- Melanie

Re: Custom explain options

2024-01-12 Thread Tomas Vondra

On 10/21/23 14:16, Konstantin Knizhnik wrote:
> Hi hackers,
> 
> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
> COST,...) which help to provide useful details of query execution.
> In Neon we have added PREFETCH option which shows information about page
> prefetching during query execution (prefetching is more critical for Neon
> architecture because of separation of compute and storage, so it is
> implemented not only for bitmap heap scan as in Vanilla Postgres, but
> also for seqscan, indexscan and indexonly scan). Another possible
> candidate  for explain options is local file cache (extra caching layer
> above shared buffers which is used to somehow replace file system cache
> in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: psql not responding to SIGINT upon db reconnection

2024-01-12 Thread Tristan Partin

On Fri Jan 12, 2024 at 10:45 AM CST, Robert Haas wrote:

On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin  wrote:
> I think the way to go is to expose some variation of libpq's
> pqSocketPoll(), which I would be happy to put together a patch for.
> Making frontends, psql in this case, have to reimplement the polling
> logic doesn't strike me as fruitful, which is essentially what I have
> done.

I encourage further exploration of this line of attack. I fear that if
I were to commit something like what you've posted up until now,
people would complain that that code was too ugly to live, and I'd
have a hard time telling them that they're wrong.

Completely agree. Let me look into this. Perhaps I can get something up 
next week or the week after.

--
Tristan Partin
Neon (https://neon.tech)

Re: introduce dynamic shared memory registry

2024-01-12 Thread Nathan Bossart

Here's a new version of the patch set in which I've attempted to address
the feedback in this thread.  Note that 0001 is being tracked in a separate
thread [0], but it is basically a prerequisite for adding the documentation
for this feature, so that's why I've also added it here.

[0] https://postgr.es/m/20240112041430.GA3557928%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v6 1/3] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 182 +---
 1 file changed, 114 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..0ba52b41d4 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend sould obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  Like with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can be found in
- contrib/pg_stat_statements/pg_stat_statements.c in the
- PostgreSQL source tree.
-
-
- There is another, more flexible method of obtaining LWLocks. First,
- allocate a tranche_id from a shared counter by
- calling:
+  This ensures that an array of num_lwlocks LWLocks is
+  available under the name tranche_name.  A pointer to
+  this array can be obtained by calling:
 
-int LWLockNewTrancheId(void)
+LWLockPadded *GetNamedLWLockTranche(const char *tranche_name)
 
- Next, each individual process using the tranche_id
- should associate it with

Re: Report planning memory in EXPLAIN ANALYZE

2024-01-12 Thread Abhijit Menon-Sen

At 2024-01-12 17:52:27 +0100, alvhe...@alvh.no-ip.org wrote:
>
> I think this patch is mostly OK

(After the last few rounds of changes, it looks fine to me too.)

>  Planning:
>Buffers: shared hit=120 read=30
>Memory: used=67944 bytes allocated=73728 bytes
>  Planning Time: 0.892 ms
>
> […]
>
> Or we could leave it as you have it, but to me that's akin to giving up
> on doing it nicely.

For completeness, there's a third option, which is easier to write and a
bit more friendly to the sort of thing that might already be parsing
"Planning Time", viz.,

Planning Buffers: shared hit=120 read=30
Planning Memory: used=67944 bytes allocated=73728 bytes
Planning Time: 0.892 ms

(Those "bytes" look slightly odd to me in the midst of all the x=y
pieces, but that's probably not worth thinking about.)

-- Abhijit

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-12 Thread Nathan Bossart

On Fri, Jan 12, 2024 at 09:46:50AM -0600, Nathan Bossart wrote:
> On Fri, Jan 12, 2024 at 05:12:28PM +0300, Aleksander Alekseev wrote:
>> """
>> Any registered shmem_startup_hook will be executed shortly after each
>> backend attaches to shared memory.
>> """
>> 
>> IMO the word "each" here can give the wrong impression as if there are
>> certain guarantees about synchronization between backends. Maybe we
>> should change this to simply "... will be executed shortly after
>> [the?] backend attaches..."
> 
> I see what you mean, but I don't think the problem is the word "each."  I
> think the problem is the use of passive voice.  What do you think about
> something like
> 
>   Each backend will execute the registered shmem_startup_hook shortly
>   after it attaches to shared memory.
> 
>> """
>> should ensure that only one process allocates a new tranche_id
>> (LWLockNewTrancheId) and initializes each new LWLock
>> (LWLockInitialize).
>> """
>> 
>> Personally I think that reminding the corresponding function name here
>> is redundant and complicates reading just a bit. But maybe it's just
>> me.
> 
> Yeah, I waffled on this one.  I don't mind removing it.

Here is a new version of the patch with these changes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7cf22727a96757bf212ec106bd471bf55a6981b9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v2 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 182 +---
 1 file changed, 114 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..0ba52b41d4 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
 void RequestAddinShmemSpace(int size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend sould obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  Like with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, in

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan  wrote:
>
> On Fri, Jan 12, 2024 at 9:44 AM Robert Haas  wrote:
> > Looking through the git history, I see that this behavior seems to
> > date back to 44fa84881fff4529d68e2437a58ad2c906af5805 which introduced
> > lazy_scan_noprune(). The comments don't explain the reasoning, but my
> > guess is that it was just an accident. It's not entirely evident to me
> > whether there might ever be good reasons to update the freespace map
> > for a page where we haven't freed up any space -- after all, the free
> > space map isn't crash-safe, so you could always postulate that
> > updating it will correct an existing inaccuracy. But I really doubt
> > that there's any good reason for lazy_scan_prune() and
> > lazy_scan_noprune() to have different ideas about whether to update
> > the FSM or not, especially in an obscure corner case like this.
>
> Why do you think that lazy_scan_prune() and lazy_scan_noprune() have
> different ideas about whether to update the FSM or not?
>
> Barring certain failsafe edge cases, we always call
> PageGetHeapFreeSpace() exactly once for each scanned page. While it's
> true that we won't always do that in the first heap pass (in cases
> where VACUUM has indexes), that's only because we expect to do it in
> the second heap pass instead -- since we only want to do it once.
>
> It's true that lazy_scan_noprune unconditionally calls
> PageGetHeapFreeSpace() when vacrel->nindexes == 0. But that's the same
> behavior as lazy_scan_prune when vacrel-> nindexes == 0. In both cases
> we know that there won't be any second heap pass, and so in both cases
> we always call PageGetHeapFreeSpace() in the first heap pass. It's
> just that it's a bit harder to see that in the lazy_scan_prune case.
> No?

So, I think this is the logic in master:

Prune case, first pass

- indexes == 0 && space_freed -> update FSM
- indexes == 0 && (!space_freed || !index_vacuuming) -> update FSM
- indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM

which reduces to:

- indexes == 0 || !space_freed || !index_vacuuming

No Prune (except aggressive vacuum), first pass:

- indexes == 0 -> update FSM
- indexes > 0 && !space_freed -> update FSM

which reduces to:

- indexes == 0 || !space_freed

which is, during the first pass, if we will not do index vacuuming,
either because we have no indexes, because there is nothing to vacuum,
or because do_index_vacuuming is false, make sure we update the
freespace map now.

but what about no prune when do_index_vacuuming is false?

I still don't understand why vacuum is responsible for updating the
FSM per page when no line pointers have been set unused. That is how
PageGetFreeSpace() figures out if there is free space, right?

- Melanie

Re: introduce dynamic shared memory registry

2024-01-12 Thread Abhijit Menon-Sen

At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote:
>
> From: Nathan Bossart 
> Date: Thu, 11 Jan 2024 21:55:25 -0600
> Subject: [PATCH v6 1/3] reorganize shared memory and lwlocks documentation
> 
> ---
>  doc/src/sgml/xfunc.sgml | 182 +---
>  1 file changed, 114 insertions(+), 68 deletions(-)
> 
> diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
> index 89116ae74c..0ba52b41d4 100644
> --- a/doc/src/sgml/xfunc.sgml
> +++ b/doc/src/sgml/xfunc.sgml
> @@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS 
> anyarray
> 
>  
> […]
> - from your shmem_request_hook.
> -
> -
> - LWLocks are reserved by calling:
> +  Each backend sould obtain a pointer to the reserved shared memory by

sould → should

> +  Add-ins can reserve LWLocks on server startup.  Like with shared 
> memory,

(Would "As with shared memory" read better? Maybe, but then again maybe
it should be left alone because you also write "Unlike with" elsewhere.)

-- Abhijit

Re: Built-in CTYPE provider

2024-01-12 Thread Daniel Verite

Jeff Davis wrote:

> > Jeremy also raised a problem with old versions of psql connecting to
> > a
> > new server: the \l and \dO won't work. Not sure exactly what to do
> > there, but I could work around it by adding a new field rather than
> > renaming (though that's not ideal).
> 
> I did a bit of research for a precedent, and the closest I could find
> is that \dp was broken between 14 and 15 by commit 07eee5a0dc.

Another one is that version 12 broke \d in older psql by
removing pg_class.relhasoids.

ISTM that in general the behavior of old psql vs new server does
not weight much against choosing optimal catalog changes.

There's also that warning at start informing users about it:
WARNING: psql major version X, server major version Y.
 Some psql features might not work.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Peter Geoghegan

On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman
 wrote:
> So, I think this is the logic in master:
>
> Prune case, first pass
>
> ...
> - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM

What is "space_freed"? Isn't that something from your uncommitted patch?

As I said, the aim is to call PageGetHeapFreeSpace() (*not*
PageGetFreeSpace(), which is only used for index pages) exactly once
per heap page scanned. This is supposed to happen independently of
whatever specific work was/will be required for the heap page. In
general, we don't ever trust that the FSM is already up-to-date.
Presumably because the FSM isn't crash safe.

On master, prunestate.has_lpdead_items may be set true when our VACUUM
wasn't actually the thing that performed pruning that freed tuple
storage -- typically when some other backend was the one that did all
required pruning at some earlier point in time, often via
opportunistic pruning. For better or worse, the only thing that VACUUM
aims to do is make sure that PageGetHeapFreeSpace() gets called
exactly once per scanned page.

> which is, during the first pass, if we will not do index vacuuming,
> either because we have no indexes, because there is nothing to vacuum,
> or because do_index_vacuuming is false, make sure we update the
> freespace map now.
>
> but what about no prune when do_index_vacuuming is false?

Note that do_index_vacuuming cannot actually affect the "nindexes ==
0" case. We have an assertion that's kinda relevant to this point:

if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
{
/*
 * Wait until lazy_vacuum_heap_rel() to save free space.
 * 
 */
Assert(vacrel->nindexes > 0);
UnlockReleaseBuffer(buf);
}
else
{

}

We should never get this far down in the lazy_scan_heap() loop when
"nindexes == 0 && prunestate.has_lpdead_items". That's handled in the
special "single heap pass" branch after lazy_scan_prune().

As for the case where we use lazy_scan_noprune() for a "nindexes == 0"
VACUUM's heap page, we also can't get this far down. (Actually,
lazy_scan_noprune lies if it has to in order to make sure that
lazy_scan_heap never has to deal with a "nindexes == 0" VACUUM with
LP_DEAD items from a no-cleanup-lock page. This is a kludge -- see
comments about how this is "dishonest" inside lazy_scan_noprune().)

> I still don't understand why vacuum is responsible for updating the
> FSM per page when no line pointers have been set unused. That is how
> PageGetFreeSpace() figures out if there is free space, right?

You mean PageGetHeapFreeSpace? Not really. (Though even pruning can
set line pointers unused, or heap-only tuples.)

Even if pruning doesn't happen in VACUUM, that doesn't mean that the
FSM is up-to-date.

In short, we do these things with the free space map because it is a
map of free space (which isn't crash safe) -- nothing more. I happen
to agree that that general design has a lot of problems, but those
seem out of scope here.

-- 
Peter Geoghegan

Re: Built-in CTYPE provider

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite  wrote:
> ISTM that in general the behavior of old psql vs new server does
> not weight much against choosing optimal catalog changes.

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-01-12 Thread Jeff Davis

On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote:
> Another one is that version 12 broke \d in older psql by
> removing pg_class.relhasoids.
> 
> ISTM that in general the behavior of old psql vs new server does
> not weight much against choosing optimal catalog changes.
> 
> There's also that warning at start informing users about it:
> WARNING: psql major version X, server major version Y.
>  Some psql features might not work.

Good point, I'll leave it as-is then. If someone complains I can rework
it.

Also, the output of \l changes from version to version, so if there are
automated tools processing the output then they'd have to change
anyway.

Regards,
Jeff Davis

Re: speed up a logical replica setup

2024-01-12 Thread Nazir Bilal Yavuz

Hi,

On Fri, 12 Jan 2024 at 09:32, Hayato Kuroda (Fujitsu)
 wrote:
>
> Good, all warnings were removed. However, the patch failed to pass tests on 
> FreeBSD twice.
>  I'm quite not sure the ERROR, but is it related with us?

FreeBSD errors started after FreeBSD's CI image was updated [1]. I do
not think error is related to this.

[1] https://cirrus-ci.com/task/4700394639589376

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 11:50 AM Peter Geoghegan  wrote:
> Why do you think that lazy_scan_prune() and lazy_scan_noprune() have
> different ideas about whether to update the FSM or not?

When lazy_scan_prune() is called, we call RecordPageWithFreeSpace if
vacrel->nindexes == 0 && prunestate.has_lpdead_items. See the code
near the comment that begins "Consider the need to do page-at-a-time
heap vacuuming".

When lazy_scan_noprune() is called, whether we call
RecordPageWithFreeSpace depends on the output parameter
recordfreespace. That is set to true whenever vacrel->nindexes == 0 ||
lpdead_items == 0. See the code near the comment that begins "Save any
LP_DEAD items".

The case where I thought there was a behavior difference is when
vacrel->nindexes == 0 and lpdead_items == 0 and thus
prunestate.has_lpdead_items is false. But now I see (from Melanie's
email) that this isn't really true, because in that case we fall
through to the logic that we use when indexes are present, giving us a
second chance to call RecordPageWithFreeSpace(), which we take when
(prunestate.has_lpdead_items && vacrel->do_index_vacuuming) comes out
false, as it always does in the scenario that I postulated.

P.S. to Melanie: I'll respond to your further emails next but I wanted
to respond to this one from Peter first so he didn't think I was
rudely ignoring him. :-)

P.P.S. to everyone: Yikes, this logic is really confusing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan  wrote:
>
> On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman
>  wrote:
> > So, I think this is the logic in master:
> >
> > Prune case, first pass
> >
> > ...
> > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM
>
> What is "space_freed"? Isn't that something from your uncommitted patch?

Yes, I was mixing the two together.

I just want to make sure that we agree that, on master, when
lazy_scan_prune() is called, the logic for whether or not to update
the FSM after the first pass is:

indexes == 0 || !has_lpdead_items || !index_vacuuming

and when lazy_scan_noprune() is called, the logic for whether or not
to update the FSM after the first pass is:

indexes == 0 || !has_lpdead_items

Those seem different to me.

- Melanie

Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-12 Thread Jim Nasby


On 1/11/24 5:50 PM, Melanie Plageman wrote:

On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz  wrote:


On Fri, 5 Jan 2024 at 02:25, Jim Nasby  wrote:


On 1/4/24 2:23 PM, Andres Freund wrote:

On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:

Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
  nskippable_blocks

I think these may lead to worse code - the compiler has to reload
vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
can't guarantee that they're not changed within one of the external functions
called in the loop body.

Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of 
skipping should go away in the context of vector IO? Instead of thinking about "we can skip 
this range of blocks", why not maintain a list of "here's the next X number of blocks 
that we need to vacuum"?


Sorry if I misunderstood. AFAIU, with the help of the vectored IO;
"the next X number of blocks that need to be vacuumed" will be
prefetched by calculating the unskippable blocks ( using the
lazy_scan_skip() function ) and the X will be determined by Postgres
itself. Do you have something different in your mind?


I think you are both right. As we gain more control of readahead from
within Postgres, we will likely want to revisit this heuristic as it
may not serve us anymore. But the streaming read interface/vectored
I/O is also not a drop-in replacement for it. To change anything and
ensure there is no regression, we will probably have to do
cross-platform benchmarking, though.

That being said, I would absolutely love to get rid of the skippable
ranges because I find them very error-prone and confusing. Hopefully
now that the skipping logic is isolated to a single function, it will
be easier not to trip over it when working on lazy_scan_heap().


Yeah, arguably it's just a matter of semantics, but IMO it's a lot 
clearer to simply think in terms of "here's the next blocks we know we 
want to vacuum" instead of "we vacuum everything, but sometimes we skip 
some blocks".

--
Jim Nasby, Data Architect, Austin TX

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan  wrote:
>
> On Fri, Jan 12, 2024 at 12:33 PM Melanie Plageman
>  wrote:
> > So, I think this is the logic in master:
> >
> > Prune case, first pass
> >
> > ...
> > - indexes > 0 && (!space_freed || !index_vacuuming) -> update FSM
>
> What is "space_freed"? Isn't that something from your uncommitted patch?
>
> As I said, the aim is to call PageGetHeapFreeSpace() (*not*
> PageGetFreeSpace(), which is only used for index pages) exactly once
> per heap page scanned. This is supposed to happen independently of
> whatever specific work was/will be required for the heap page. In
> general, we don't ever trust that the FSM is already up-to-date.
> Presumably because the FSM isn't crash safe.
>
> On master, prunestate.has_lpdead_items may be set true when our VACUUM
> wasn't actually the thing that performed pruning that freed tuple
> storage -- typically when some other backend was the one that did all
> required pruning at some earlier point in time, often via
> opportunistic pruning. For better or worse, the only thing that VACUUM
> aims to do is make sure that PageGetHeapFreeSpace() gets called
> exactly once per scanned page.
...
> > I still don't understand why vacuum is responsible for updating the
> > FSM per page when no line pointers have been set unused. That is how
> > PageGetFreeSpace() figures out if there is free space, right?
>
> You mean PageGetHeapFreeSpace? Not really. (Though even pruning can
> set line pointers unused, or heap-only tuples.)
>
> Even if pruning doesn't happen in VACUUM, that doesn't mean that the
> FSM is up-to-date.
>
> In short, we do these things with the free space map because it is a
> map of free space (which isn't crash safe) -- nothing more. I happen
> to agree that that general design has a lot of problems, but those
> seem out of scope here.

So, there are 3 issues I am trying to understand:

1) How often should vacuum update the FSM (not vacuum as in the second
pass but vacuum as in the whole thing that is happening in
lazy_scan_heap())?
2) What is the exact logic in master that ensures that vacuum
implements the cadence in 1)?
3) How can the logic in 2) be replicated exactly in my patch that sets
would-be dead items LP_UNUSED during pruning?

>From what Peter is saying, I think 1) is decided and is once per page
(across all passes)
For 2), see my previous email. And for 3), TBD until 2) is agreed upon.

- Melanie

Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 2:02 PM Jim Nasby  wrote:
>
> On 1/11/24 5:50 PM, Melanie Plageman wrote:
> > On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz  wrote:
> >>
> >> On Fri, 5 Jan 2024 at 02:25, Jim Nasby  wrote:
> >>>
> >>> On 1/4/24 2:23 PM, Andres Freund wrote:
> >>>
> >>> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> >>>
> >>> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var 
> >>> rel_pages
> >>> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
> >>>   nskippable_blocks
> >>>
> >>> I think these may lead to worse code - the compiler has to reload
> >>> vacrel->rel_pages/next_unskippable_block for every loop iteration, 
> >>> because it
> >>> can't guarantee that they're not changed within one of the external 
> >>> functions
> >>> called in the loop body.
> >>>
> >>> Admittedly I'm not up to speed on recent vacuum changes, but I have to 
> >>> wonder if the concept of skipping should go away in the context of vector 
> >>> IO? Instead of thinking about "we can skip this range of blocks", why not 
> >>> maintain a list of "here's the next X number of blocks that we need to 
> >>> vacuum"?
> >>
> >> Sorry if I misunderstood. AFAIU, with the help of the vectored IO;
> >> "the next X number of blocks that need to be vacuumed" will be
> >> prefetched by calculating the unskippable blocks ( using the
> >> lazy_scan_skip() function ) and the X will be determined by Postgres
> >> itself. Do you have something different in your mind?
> >
> > I think you are both right. As we gain more control of readahead from
> > within Postgres, we will likely want to revisit this heuristic as it
> > may not serve us anymore. But the streaming read interface/vectored
> > I/O is also not a drop-in replacement for it. To change anything and
> > ensure there is no regression, we will probably have to do
> > cross-platform benchmarking, though.
> >
> > That being said, I would absolutely love to get rid of the skippable
> > ranges because I find them very error-prone and confusing. Hopefully
> > now that the skipping logic is isolated to a single function, it will
> > be easier not to trip over it when working on lazy_scan_heap().
>
> Yeah, arguably it's just a matter of semantics, but IMO it's a lot
> clearer to simply think in terms of "here's the next blocks we know we
> want to vacuum" instead of "we vacuum everything, but sometimes we skip
> some blocks".

Even "we vacuum some stuff, but sometimes we skip some blocks" would
be okay. What we have now is "we vacuum some stuff, but sometimes we
skip some blocks, but only if we would skip enough blocks, and, when
we decide to do that we can't go back and actually get visibility
information for those blocks we skipped because we are too cheap"

- Melanie

Re: plpgsql memory leaks

2024-01-12 Thread Pavel Stehule

pá 12. 1. 2024 v 14:53 odesílatel Michael Banck  napsal:

> Hi,
>
> On Fri, Jan 12, 2024 at 01:35:24PM +0100, Pavel Stehule wrote:
> > pá 12. 1. 2024 v 11:54 odesílatel Michael Banck  napsal:
> > > Which version of Postgres is this and on which platform/distribution?
> >
> > It was tested on master branch (pg 17) on Fedora 39
> >
> > > Did you try keep jit on but set jit_inline_above_cost to 0?
> > >
> > > The back-branches have a fix for the above case, i.e. llvmjit memleaks
> > > that can be worked-around by setting jit_inline_above_cost=0.
>
> I got that wrong, it needs to be -1 to disable it.
>
> But if you are already running the master branch, it is probably a
> separate issue.
>

I tested code

CREATE OR REPLACE FUNCTION public.fx(iter integer)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
declare
  c cursor(m bigint) for select distinct i from generate_series(1, m) g(i);
  t bigint;
  s bigint;
begin
  for i in 1..iter
  loop
s := 0;
for r in c(i*1)
loop
  s := s + r.i;
end loop;
raise notice '%=%', i, s;
  end loop;
end;
$function$

default master branch - res 190MB ram
jit_inline_above_cost = -1 doesn't helps
disabling JIT doesn't helps too,

so it looks like the wrong hypothesis , and the problem is maybe somewhere
else :-/

Regards

Pavel



>
>
> Michael
>

Re: Custom explain options

2024-01-12 Thread Konstantin Knizhnik




On 12/01/2024 7:03 pm, Tomas Vondra wrote:

On 10/21/23 14:16, Konstantin Knizhnik wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate  for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

Yes, I am looking at your PR. What we have implemented in Neon is more 
specific to Neon architecture where storage is separated from compute.
So each page not found in shared buffers has to be downloaded from page 
server. It adds quite noticeable latency, because of network roundtrip.
While vanilla Postgres can rely on OS file system cache when page is not 
found in shared buffer (access to OS file cache is certainly slower than 
to shared buffers
because of syscall and copying of page, but performance penaly is not 
very large - less than 15%), Neon has no local files and so has to send 
request to the socket.


This is why we have to perform aggressive prefetching whenever it is 
possible (when it it is possible to predict order of subsequent pages).
Unlike vanilla Postgres which implements prefetch only for bitmap heap 
scan, we have implemented it for seqscan, index scan, indexonly scan, 
bitmap heap scan, vacuum, pg_prewarm.
The main difference between Neon prefetch and vanilla Postgres prefetch 
is that first one is backend specific. So each backend prefetches only 
pages which it needs.
This is why we have to rewrite prefetch for bitmap heap scan, which is 
using `fadvise` and assumes that pages prefetched by one backend in file 
cache, can be used by any other backend.



Concerning index scan we have implemented two different approaches: for 
index only scan we  try to prefetch leave pages and for index scan we 
prefetch referenced heap pages.
In both cases we start from prefetch distance 0 and increase it until it 
reaches `effective_io_concurrency` for this relation. Doing so we try to 
avoid prefetching of useless pages and slowdown of "point" lookups 
returning one or few records.


If you are interested, you can look at our implementation in neon repo: 
all source are available. But briefly speaking, each backend has its own 
prefetch ring (prefetch requests which are waiting for response). The 
key idea is that we can send several prefetch requests to page server 
and then receive multiple replies. It allows to increased speed of OLAP 
queries up to 10 times.


Heikki thinks that prefetch can be somehow combined with async-io 
proposal (based on io_uring). But right now they have nothing in common.

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman
 wrote:
> Yes, I was mixing the two together.
>
> I just want to make sure that we agree that, on master, when
> lazy_scan_prune() is called, the logic for whether or not to update
> the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items || !index_vacuuming
>
> and when lazy_scan_noprune() is called, the logic for whether or not
> to update the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items
>
> Those seem different to me.

This analysis seems correct to me, except that "when
lazy_scan_noprune() is called" should really say "when
lazy_scan_noprune() is called (and returns true)", because when it
returns false we fall through and call lazy_scan_prune() afterwards.

Here's a draft patch to clean up the inconsistency here. It also gets
rid of recordfreespace, because ISTM that recordfreespace is adding to
the confusion here rather than helping anything.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v1-0001-Be-more-consistent-about-whether-to-update-the-FS.patch.nocfbot
Description: Binary data

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Peter Geoghegan

On Fri, Jan 12, 2024 at 2:32 PM Robert Haas  wrote:
> On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman
>  wrote:
> This analysis seems correct to me, except that "when
> lazy_scan_noprune() is called" should really say "when
> lazy_scan_noprune() is called (and returns true)", because when it
> returns false we fall through and call lazy_scan_prune() afterwards.

Now that I see your patch, I understand what Melanie must have meant.
I agree that there is a small inconsistency here, that we could well
do without.

In general I am in favor of religiously eliminating such
inconsistencies (between lazy_scan_prune and lazy_scan_noprune),
unless there is a reason not to. Not because it's necessarily
important. More because it's just too hard to be sure whether it might
matter. It's usually easier to defensively assume that it matters.

> Here's a draft patch to clean up the inconsistency here. It also gets
> rid of recordfreespace, because ISTM that recordfreespace is adding to
> the confusion here rather than helping anything.

You're using "!prunestate.has_lpdead_items" as part of your test that
sets "recordfreespace". But lazy_scan_noprune doesn't get passed a
pointer to prunestate, so clearly you'll need to detect the same
condition some other way.

-- 
Peter Geoghegan

Re: introduce dynamic shared memory registry

2024-01-12 Thread Nathan Bossart

On Fri, Jan 12, 2024 at 11:13:46PM +0530, Abhijit Menon-Sen wrote:
> At 2024-01-12 11:21:52 -0600, nathandboss...@gmail.com wrote:
>> +  Each backend sould obtain a pointer to the reserved shared memory by
> 
> sould → should

D'oh.  Thanks.

>> +  Add-ins can reserve LWLocks on server startup.  Like with shared 
>> memory,
> 
> (Would "As with shared memory" read better? Maybe, but then again maybe
> it should be left alone because you also write "Unlike with" elsewhere.)

I think "As with shared memory..." sounds better here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan  wrote:
> You're using "!prunestate.has_lpdead_items" as part of your test that
> sets "recordfreespace". But lazy_scan_noprune doesn't get passed a
> pointer to prunestate, so clearly you'll need to detect the same
> condition some other way.

OOPS. Thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Extensible storage manager API - SMGR hook Redux

2024-01-12 Thread Tristan Partin


Thought I would show off what is possible with this patchset.

Heikki, a couple of months ago in our internal Slack, said:

[I would like] a debugging tool that checks that we're not missing any 
fsyncs. I bumped into a few missing fsync bugs with unlogged tables 
lately and a tool like that would've been very helpful.


My task was to create such a tool, and off I went. I started with the 
storage manager extension patch that Matthias sent to the list last 
year[0].


Andres, in another thread[1], said:


I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.

My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced.  All of this in assert builds only, of course.


I took this idea and ran with it. I call it the fsync_checker™️. It is an 
extension that prints relations that haven't been fsynced prior to 
a CHECKPOINT. Note that this idea doesn't work in practice because 
relations might not be fsynced, but they might be WAL-logged, like in 
the case of createdb. See log_smgrcreate(). I can't think of an easy way 
to solve this problem looking at the codebase as it stands.


Here is a description of the patches:

0001:

This is essentially just the patch that Matthias posted earlier, but 
rebased and adjusted a little bit so storage managers can "inherit" from 
other storage managers.


0002:

This is an extension of 0001, which allows for extensions to set 
a global storage manager. This is pretty hacky, and if it was going to 
be pulled into mainline, it would need some better protection. For 
instance, only one extension should be able to set the global storage 
manager. We wouldn't want extensions stepping over each other, etc.


0003:

Adds a hook for extensions to inspect a checkpoint before it actually 
occurs. The purpose for the fsync_checker is so that I can iterate over

all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.

0004:

This is the actual fsync_checker extension itself. It must be preloaded.

Hopefully this is a good illustration of how the initial patch could be 
used, even though it isn't perfect.


[0]: 
https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
[1]: 
https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de

--
Tristan Partin
Neon (https://neon.tech)
From 5ffbc7c35bb3248501b2517d26f99afe02fb53d6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Tue, 27 Jun 2023 15:59:23 +0200
Subject: [PATCH v1 1/5] Expose f_smgr to extensions for manual implementation

There are various reasons why one would want to create their own
implementation of a storage manager, among which are block-level compression,
encryption and offloading to cold storage. This patch is a first patch that
allows extensions to register their own SMgr.

Note, however, that this SMgr is not yet used - only the first SMgr to register
is used, and this is currently the md.c smgr. Future commits will include
facilities to select an SMgr for each tablespace.
---
 src/backend/postmaster/postmaster.c |   5 +
 src/backend/storage/smgr/md.c   | 172 +++-
 src/backend/storage/smgr/smgr.c | 129 ++---
 src/backend/utils/init/miscinit.c   |  13 +++
 src/include/miscadmin.h |   1 +
 src/include/storage/md.h|   4 +
 src/include/storage/smgr.h  |  59 --
 7 files changed, 252 insertions(+), 131 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index feb471dd1d..a0e46fe1f2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1010,6 +1010,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	ApplyLauncherRegister();
 
+	/*
+	 * Register built-in managers that are not part of static arrays
+	 */
+	register_builtin_dynamic_managers();
+
 	/*
 	 * process any libraries that should be preloaded at postmaster start
 	 */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b1e9932a29..66a93101ab 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -87,6 +87,21 @@ typedef struct _MdfdVec
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
+SMgrId MdSMgrId;
+
+typedef struct MdSMgrRelationData
+{
+	/* parent data */
+	SMgrRelationData reln;
+	/*
+	 * for md.c; per-fork arrays of the number of open segments
+	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
+	 */
+	int			md_num_open_segs[MAX_FORKNUM + 1]

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Peter Geoghegan

On Fri, Jan 12, 2024 at 1:52 PM Melanie Plageman
 wrote:
> On Fri, Jan 12, 2024 at 1:07 PM Peter Geoghegan  wrote:
> > What is "space_freed"? Isn't that something from your uncommitted patch?
>
> Yes, I was mixing the two together.

An understandable mistake.

> I just want to make sure that we agree that, on master, when
> lazy_scan_prune() is called, the logic for whether or not to update
> the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items || !index_vacuuming
>
> and when lazy_scan_noprune() is called, the logic for whether or not
> to update the FSM after the first pass is:
>
> indexes == 0 || !has_lpdead_items
>
> Those seem different to me.

Right. As I said to Robert just now, I can now see that they're
slightly different conditions.

FWIW my brain was just ignoring " || !index_vacuuming". I dismissed it
as an edge-case, only relevant when the failsafe has kicked in. Which
it is. But that's still no reason to allow an inconsistency that we
can easily just avoid.

-- 
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 2:47 PM Robert Haas  wrote:
>
> On Fri, Jan 12, 2024 at 2:43 PM Peter Geoghegan  wrote:
> > You're using "!prunestate.has_lpdead_items" as part of your test that
> > sets "recordfreespace". But lazy_scan_noprune doesn't get passed a
> > pointer to prunestate, so clearly you'll need to detect the same
> > condition some other way.
>
> OOPS. Thanks.

Also, I think you should combine these in lazy_scan_noprune() now

/* Save any LP_DEAD items found on the page in dead_items array */
if (vacrel->nindexes == 0)
{
/* Using one-pass strategy (since table has no indexes) */
if (lpdead_items > 0)
{

Since we don't set recordfreespace in the outer if statement anymore

And I noticed you missed a reference to recordfreespace output
parameter in the function comment above lazy_scan_noprune().

- Melanie

Re: Recovering from detoast-related catcache invalidations

2024-01-12 Thread Tom Lane

Xiaoran Wang  writes:
>> Maybe, but that undocumented hack in SetHintBits seems completely
>> unacceptable.  Isn't there a cleaner way to make this check?

> Maybe we don't need to call 'HeapTupleSatisfiesVisibility' to check if the
> tuple has been deleted.
> As the tuple's xmin must been committed, so we just need to check if its
> xmax is committed,

I'm not super thrilled with that.  Something I realized last night is
that your proposal only works if "ntp" is pointing directly into the
catalog's disk buffers.  If something earlier than this code had made
a local-memory copy of the catalog tuple, then it's possible that its
header fields (particularly xmax) are out of date compared to shared
buffers and would fail to tell us that some other process just
invalidated the tuple.  Now in fact, with the current implementation
of syscache_getnext() the result is actually a live tuple and so we
can expect to see any relevant updates.  But I think we'd better add
some Asserts that that's so; and that also provides us with a way to
call HeapTupleSatisfiesVisibility fully legally, because we can get
the buffer reference out of the scan descriptor too.

This is uncomfortably much in bed with the tuple table slot code,
perhaps, but I don't see a way to do it more cleanly unless we want
to add some new provisions to that API.  Andres, do you have any
thoughts about that?

Anyway, this approach gets rid of false positives, which is great
for performance and bad for testing.  Code coverage says that now
we never hit the failure paths during regression tests, which is
unsurprising, but I'm not very comfortable with leaving those paths
unexercised.  I tried to make an isolation test to exercise them,
but there's no good way at the SQL level to get a session to block
during the detoast step.  LOCK TABLE on some catalog's toast table
would do, but we disallow it.  I thought about adding a small C
function to regress.so to take out such a lock, but we have no
infrastructure for referencing regress.so from isolation tests.
What I ended up doing is adding a random failure about 0.1% of
the time in USE_ASSERT_CHECKING builds --- that's intellectually
ugly for sure, but doing better seems like way more work than
it's worth.

regards, tom lane

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index c0591cffe5..b1ce7bf513 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -15,6 +15,7 @@
 #include "postgres.h"

 #include "access/genam.h"
+#include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/relscan.h"
 #include "access/sysattr.h"
@@ -24,8 +25,10 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/bufmgr.h"
 #ifdef CATCACHE_STATS
 #include "storage/ipc.h"		/* for on_proc_exit */
 #endif
@@ -90,10 +93,10 @@ static void CatCachePrintStats(int code, Datum arg);
 static void CatCacheRemoveCTup(CatCache *cache, CatCTup *ct);
 static void CatCacheRemoveCList(CatCache *cache, CatCList *cl);
 static void CatalogCacheInitializeCache(CatCache *cache);
-static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
+static CatCTup *CatalogCacheCreateEntry(CatCache *cache,
+		HeapTuple ntp, SysScanDesc scandesc,
 		Datum *arguments,
-		uint32 hashValue, Index hashIndex,
-		bool negative);
+		uint32 hashValue, Index hashIndex);

 static void ReleaseCatCacheWithOwner(HeapTuple tuple, ResourceOwner resowner);
 static void ReleaseCatCacheListWithOwner(CatCList *list, ResourceOwner resowner);
@@ -1372,6 +1375,7 @@ SearchCatCacheMiss(CatCache *cache,
 	SysScanDesc scandesc;
 	HeapTuple	ntp;
 	CatCTup*ct;
+	bool		stale;
 	Datum		arguments[CATCACHE_MAXKEYS];

 	/* Initialize local parameter array */
@@ -1380,16 +1384,6 @@ SearchCatCacheMiss(CatCache *cache,
 	arguments[2] = v3;
 	arguments[3] = v4;

-	/*
-	 * Ok, need to make a lookup in the relation, copy the scankey and fill
-	 * out any per-call fields.
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
-
 	/*
 	 * Tuple was not found in cache, so we have to try to retrieve it directly
 	 * from the relation.  If found, we will add it to the cache; if not
@@ -1404,9 +1398,28 @@ SearchCatCacheMiss(CatCache *cache,
 	 * will eventually age out of the cache, so there's no functional problem.
 	 * This case is rare enough that it's not worth expending extra cycles to
 	 * detect.
+	 *
+	 * Another case, which we *must* handle, is that the tuple could become
+	 * outdated during CatalogCacheCreateEntry's attempt to detoast it (since
+	 * AcceptInvalidationMessages can run during TOAST table access).  We do
+	 * not want to return alread

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman
 wrote:
> Also, I think you should combine these in lazy_scan_noprune() now
>
> /* Save any LP_DEAD items found on the page in dead_items array */
> if (vacrel->nindexes == 0)
> {
> /* Using one-pass strategy (since table has no indexes) */
> if (lpdead_items > 0)
> {
>
> Since we don't set recordfreespace in the outer if statement anymore

Well, maybe, but there's an else clause attached to the outer "if", so
you have to be a bit careful. I didn't think it was critical to
further rejigger this.

> And I noticed you missed a reference to recordfreespace output
> parameter in the function comment above lazy_scan_noprune().

OK.

So what's the best way to solve the problem that Peter pointed out?
Should we pass in the prunestate? Maybe just replace bool
*recordfreespace with bool *has_lpdead_items?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Invalidate the subscription worker in cases where a user loses their superuser status

2024-01-12 Thread Jeff Davis

On Tue, 2023-10-17 at 14:17 +0530, Amit Kapila wrote:
> > Fair enough. I'll wait till early next week (say till Monday EOD)
> > to
> > see if anyone thinks otherwise and push this patch to HEAD after
> > some
> > more testing and review.
> > 
> 
> Pushed.

There was a brief discussion on backporting this here:

https://www.postgresql.org/message-id/CA%2BTgmob2mYpaUMT7aoFOukYTrpxt-WdgcitJnsjWhufnbDWEeg%40mail.gmail.com

It looks like you opted not to backport it. Was there a reason for
that? The only risky thing I see there is a change in the Subscription
structure -- I suppose that could be used by extensions?

(I am fine with it not being backported, but I was inclined to think it
should be backported.)

Regards,
Jeff Davis

cfbot is failing all tests on FreeBSD/Meson builds

2024-01-12 Thread Tom Lane

It looks like every recent cfbot run has failed in the
FreeBSD-13-Meson build, even if it worked in other ones.
The symptoms are failures in the TAP tests that try to
use interactive_psql:

Can't call method "slave" on an undefined value at 
/usr/local/lib/perl5/site_perl/IPC/Run.pm line 2889.

I suspect that we are looking at some bug in IPC::Run that exists in
the version that that FreeBSD release has (but not, seemingly,
elsewhere in the community), and that was mostly harmless until
c53859295 made all Perl warnings fatal.  Not sure what we want
to do about this.

regards, tom lane

Re: libpq compression (part 3)

2024-01-12 Thread Robert Haas

On Sun, Dec 31, 2023 at 2:32 AM Jacob Burroughs
 wrote:
> I ended up reworking this to use a version of this option in place of
> the `none` hackery, but naming the parameters `compress` and
> `decompress, so to disable compression but allow decompression you
> would specify `libpq_compression=gzip:compress=off`.

I'm still a bit befuddled by this interface.
libpq_compression=gzip:compress=off looks a lot like it's saying "do
it, except don't". I guess that you're using "compress" and
"decompress" to distinguish the two directions - i.e. server to client
and client to server - but of course in both directions the sender
compresses and the receiver decompresses, so I don't find that very
clear.

I wonder if we could use "upstream" and "downstream" to be clearer? Or
some other terminology?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Recovering from detoast-related catcache invalidations

2024-01-12 Thread Tom Lane

I wrote:
> This is uncomfortably much in bed with the tuple table slot code,
> perhaps, but I don't see a way to do it more cleanly unless we want
> to add some new provisions to that API.  Andres, do you have any
> thoughts about that?

Oh!  After nosing around a bit more I remembered systable_recheck_tuple,
which is meant for exactly this purpose.  So v4 attached.

regards, tom lane

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index c0591cffe5..0dcd45d2f0 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -24,6 +24,7 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #ifdef CATCACHE_STATS
@@ -90,10 +91,10 @@ static void CatCachePrintStats(int code, Datum arg);
 static void CatCacheRemoveCTup(CatCache *cache, CatCTup *ct);
 static void CatCacheRemoveCList(CatCache *cache, CatCList *cl);
 static void CatalogCacheInitializeCache(CatCache *cache);
-static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
+static CatCTup *CatalogCacheCreateEntry(CatCache *cache,
+		HeapTuple ntp, SysScanDesc scandesc,
 		Datum *arguments,
-		uint32 hashValue, Index hashIndex,
-		bool negative);
+		uint32 hashValue, Index hashIndex);
 
 static void ReleaseCatCacheWithOwner(HeapTuple tuple, ResourceOwner resowner);
 static void ReleaseCatCacheListWithOwner(CatCList *list, ResourceOwner resowner);
@@ -1372,6 +1373,7 @@ SearchCatCacheMiss(CatCache *cache,
 	SysScanDesc scandesc;
 	HeapTuple	ntp;
 	CatCTup*ct;
+	bool		stale;
 	Datum		arguments[CATCACHE_MAXKEYS];
 
 	/* Initialize local parameter array */
@@ -1380,16 +1382,6 @@ SearchCatCacheMiss(CatCache *cache,
 	arguments[2] = v3;
 	arguments[3] = v4;
 
-	/*
-	 * Ok, need to make a lookup in the relation, copy the scankey and fill
-	 * out any per-call fields.
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
-
 	/*
 	 * Tuple was not found in cache, so we have to try to retrieve it directly
 	 * from the relation.  If found, we will add it to the cache; if not
@@ -1404,9 +1396,28 @@ SearchCatCacheMiss(CatCache *cache,
 	 * will eventually age out of the cache, so there's no functional problem.
 	 * This case is rare enough that it's not worth expending extra cycles to
 	 * detect.
+	 *
+	 * Another case, which we *must* handle, is that the tuple could become
+	 * outdated during CatalogCacheCreateEntry's attempt to detoast it (since
+	 * AcceptInvalidationMessages can run during TOAST table access).  We do
+	 * not want to return already-stale catcache entries, so we loop around
+	 * and do the table scan again if that happens.
 	 */
 	relation = table_open(cache->cc_reloid, AccessShareLock);
 
+	do
+	{
+		/*
+		 * Ok, need to make a lookup in the relation, copy the scankey and
+		 * fill out any per-call fields.  (We must re-do this when retrying,
+		 * because systable_beginscan scribbles on the scankey.)
+		 */
+		memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
+		cur_skey[0].sk_argument = v1;
+		cur_skey[1].sk_argument = v2;
+		cur_skey[2].sk_argument = v3;
+		cur_skey[3].sk_argument = v4;
+
 	scandesc = systable_beginscan(relation,
   cache->cc_indexoid,
   IndexScanOK(cache, cur_skey),
@@ -1415,12 +1426,18 @@ SearchCatCacheMiss(CatCache *cache,
   cur_skey);
 
 	ct = NULL;
+	stale = false;
 
 	while (HeapTupleIsValid(ntp = systable_getnext(scandesc)))
 	{
-		ct = CatalogCacheCreateEntry(cache, ntp, arguments,
-	 hashValue, hashIndex,
-	 false);
+		ct = CatalogCacheCreateEntry(cache, ntp, scandesc, NULL,
+	 hashValue, hashIndex);
+		/* upon failure, we must start the scan over */
+		if (ct == NULL)
+		{
+			stale = true;
+			break;
+		}
 		/* immediately set the refcount to 1 */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 		ct->refcount++;
@@ -1429,6 +1446,7 @@ SearchCatCacheMiss(CatCache *cache,
 	}
 
 	systable_endscan(scandesc);
+	} while (stale);
 
 	table_close(relation, AccessShareLock);
 
@@ -1447,9 +1465,11 @@ SearchCatCacheMiss(CatCache *cache,
 		if (IsBootstrapProcessingMode())
 			return NULL;
 
-		ct = CatalogCacheCreateEntry(cache, NULL, arguments,
-	 hashValue, hashIndex,
-	 true);
+		ct = CatalogCacheCreateEntry(cache, NULL, NULL, arguments,
+	 hashValue, hashIndex);
+
+		/* Creating a negative cache entry shouldn't fail */
+		Assert(ct != NULL);
 
 		CACHE_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%d tuples",
    cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup);
@@ -1663,7 +1683,8 @@ SearchCatCacheList(CatCache *cache,
 	 * We have to bump the member refcounts temporarily to ensure they won't
 	 * get dropped from the c

Re: Stack overflow issue

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas  wrote:
> Here's one goto-free attempt. It adds a local loop to where the
> recursion was, so that if you have a chain of subtransactions that need
> to be aborted in CommitTransactionCommand, they are aborted iteratively.
> The TBLOCK_SUBCOMMIT case already had such a loop.
>
> I added a couple of comments in the patch marked with "REVIEWER NOTE",
> to explain why I changed some things. They are to be removed before
> committing.
>
> I'm not sure if this is better than a goto. In fact, even if we commit
> this, I think I'd still prefer to replace the remaining recursive calls
> with a goto. Recursion feels a weird to me, when we're unwinding the
> states from the stack as we go.

I'm not able to quickly verify whether this version is correct, but I
do think the code looks nicer this way.

I understand that's a question of opinion rather than fact, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: libpq compression (part 3)

2024-01-12 Thread Jacob Burroughs

> I wonder if we could use "upstream" and "downstream" to be clearer? Or
> some other terminology?

What about `send` and `receive`?

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-12 Thread Melanie Plageman

On Fri, Jan 12, 2024 at 3:22 PM Robert Haas  wrote:
>
> On Fri, Jan 12, 2024 at 3:04 PM Melanie Plageman
>  wrote:
>
> So what's the best way to solve the problem that Peter pointed out?
> Should we pass in the prunestate? Maybe just replace bool
> *recordfreespace with bool *has_lpdead_items?

Yea, that works for now. I mean, I think the way we should do it is
update the FSM in lazy_scan_noprune(), but, for the purposes of this
patch, yes. has_lpdead_items output parameter seems fine to me.

- Melanie

Re: libpq compression (part 3)

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 4:02 PM Jacob Burroughs
 wrote:
> > I wonder if we could use "upstream" and "downstream" to be clearer? Or
> > some other terminology?
>
> What about `send` and `receive`?

I think that would definitely be better than "compress" and
"decompress," but I was worried that it might be unclear to the user
whether the parameter that they specified was from the point of view
of the client or the server. Perhaps that's a dumb thing to worry
about, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: plpgsql memory leaks

2024-01-12 Thread Tom Lane

Pavel Stehule  writes:
> default master branch - res 190MB ram
> jit_inline_above_cost = -1 doesn't helps
> disabling JIT doesn't helps too,

> so it looks like the wrong hypothesis , and the problem is maybe somewhere
> else :-/

I see no leak with these examples on HEAD, either with or without
--enable-llvm --- the process size stays quite stable according
to "top".  I wonder if you are using some extension that's
contributing to the problem.

regards, tom lane

Re: Make NUM_XLOGINSERT_LOCKS configurable

2024-01-12 Thread Jim Nasby


On 1/12/24 12:32 AM, Bharath Rupireddy wrote:

Test case:
./pgbench --initialize --scale=100 --username=ubuntu postgres
./pgbench --progress=10 --client=64 --time=300 --builtin=tpcb-like
--username=ubuntu postgres

Setup:
./configure --prefix=$PWD/inst/  CFLAGS="-ggdb3 -O3" > install.log &&
make -j 8 install > install.log 2>&1 &

shared_buffers = '8GB'
max_wal_size = '32GB'
track_wal_io_timing = on

Stats measured:
I've used the attached patch to measure WAL Insert Lock Acquire Time
(wal_insert_lock_acquire_time)  and WAL Wait for In-progress Inserts
to Finish Time (wal_wait_for_insert_to_finish_time).


Unfortunately this leaves the question of how frequently is 
WaitXLogInsertionsToFinish() being called and by whom. One possibility 
here is that wal_buffers is too small so backends are constantly having 
to write WAL data to free up buffers.

--
Jim Nasby, Data Architect, Austin TX

Re: [DOC] Add detail regarding resource consumption wrt max_connections

2024-01-12 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I think it is good to warn the user about the increased allocation of memory 
for certain parameters so that they do not abuse it by setting it to a huge 
number without knowing the consequences.

It is true that max_connections can increase the size of proc array and other 
resources, which are allocated in the shared buffer, which also means less 
shared buffer to perform regular data operations. I am sure this is not the 
only parameter that affects the memory allocation. "max_prepared_xacts" can 
also affect the shared memory allocation too so the same warning message 
applies here as well. Maybe there are other parameters with similar effects. 

Instead of stating that higher max_connections results in higher allocation, It 
may be better to tell the user that if the value needs to be set much higher, 
consider increasing the "shared_buffers" setting as well.

thank you

---
Cary Huang
Highgo Software Canada
www.highgo.ca

Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart

On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
> * I did not invent a switch to control the batching of blobs; it's
> just hard-wired at 1000 blobs per group here.  Probably we need some
> user knob for that, but I'm unsure if we want to expose a count or
> just a boolean for one vs more than one blob per batch.  The point of
> forcing one blob per batch would be to allow exact control during
> selective restore, and I'm not sure if there's any value in random
> other settings.  On the other hand, selective restore of blobs has
> been completely broken for the last dozen years and I can't recall any
> user complaints about that; so maybe nobody cares and we could just
> leave this as an internal choice.

I think the argument for making this configurable is that folks might have
fewer larger blobs, in which case it might make sense to lower the batch
size, or they might have many smaller blobs, in which case they might want
to increase it.  But I'm a bit skeptical that will make any sort of
tremendous difference in practice, and I'm not sure how a user would decide
on the right value besides trial-and-error.  In any case, at the moment I
think it'd be okay to keep this an internal setting and wait for feedback
from the field.

> * As the patch stands, we still build a separate TOC entry for each
> comment or seclabel or ACL attached to a blob.  If you have a lot of
> blobs with non-default properties then the TOC bloat problem comes
> back again.  We could do something about that, but it would take a bit
> of tedious refactoring, and the most obvious way to handle it probably
> re-introduces too-many-locks problems.  Is this a scenario that's
> worth spending a lot of time on?

I'll ask around about this.

> Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes.

This one looked good to me.

> Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata
>  and data entries.

I spent most of my review time reading this one.  Overall, it looks pretty
good to me, and I think you've called out most of the interesting design
choices.

> + char   *cmdEnd = psprintf(" OWNER TO %s", 
> fmtId(te->owner));
> +
> + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
> cmdEnd);

This is just a nitpick, but is there any reason not to have
IssueCommandPerBlob() accept a format string and the corresponding
arguments?

> + while (n < 1000 && i + n < ntups)

Another nitpick: it might be worth moving these hard-wired constants to
macros.  Without a control switch, that'd at least make it easy for anyone
determined to change the value for their installation.

> Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA.

This one looks reasonable, too.

> In this patch I have just hard-wired pg_upgrade to use
> --transaction-size 1000.  Perhaps there would be value in adding
> another pg_upgrade option to allow user control of that, but I'm
> unsure that it's worth the trouble; I think few users would use it,
> and any who did would see not that much benefit.  However, we
> might need to adjust the logic to make the size be 1000 divided
> by the number of parallel restore jobs allowed.

I wonder if it'd be worth making this configurable for pg_upgrade as an
escape hatch in case the default setting is problematic.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart

On Fri, Jan 05, 2024 at 03:02:34PM -0500, Tom Lane wrote:
> On further reflection, there is a very good reason why it's done like
> that.  Because pg_upgrade is doing schema-only dump and restore,
> there's next to no opportunity for parallelism within either pg_dump
> or pg_restore.  There's no data-loading steps, and there's no
> index-building either, so the time-consuming stuff that could be
> parallelized just isn't happening in pg_upgrade's usage.
> 
> Now it's true that my 0003 patch moves the needle a little bit:
> since it makes BLOB creation (as opposed to loading) parallelizable,
> there'd be some hope for parallel pg_restore doing something useful in
> a database with very many blobs.  But it makes no sense to remove the
> existing cross-database parallelism in pursuit of that; you'd make
> many more people unhappy than happy.

I assume the concern is that we'd end up multiplying the effective number
of workers if we parallelized both in-database and cross-database?  Would
it be sufficient to make those separately configurable with a note about
the multiplicative effects of setting both?  I think it'd be unfortunate if
pg_upgrade completely missed out on this improvement.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Nathan Bossart

On Fri, Jan 12, 2024 at 04:42:27PM -0600, Nathan Bossart wrote:
> On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
>> +char   *cmdEnd = psprintf(" OWNER TO %s", 
>> fmtId(te->owner));
>> +
>> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
>> cmdEnd);
> 
> This is just a nitpick, but is there any reason not to have
> IssueCommandPerBlob() accept a format string and the corresponding
> arguments?

Eh, I guess you'd have to find some other way of specifying where the OID
is supposed to go, which would probably be weird.  Please disregard this
one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_upgrade failing for 200+ million Large Objects

2024-01-12 Thread Tom Lane

Nathan Bossart  writes:
> On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote:
>> +char   *cmdEnd = psprintf(" OWNER TO %s", 
>> fmtId(te->owner));
>> +
>> +IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", 
>> cmdEnd);

> This is just a nitpick, but is there any reason not to have
> IssueCommandPerBlob() accept a format string and the corresponding
> arguments?

The problem is how to combine the individual per-blob OID with the
command string given by the caller.  If we want the caller to also be
able to inject data values, I don't really see how to combine the OID
with the va_args list from the caller.  If we stick with the design
that the caller provides separate "front" and "back" strings, but ask
to be able to inject data values into those, then we need two va_args
lists which C doesn't support, or else an arbitrary decision about
which one gets the va_args.  (Admittedly, with only one caller that
needs a non-constant string, we could make such a decision; but by the
same token we'd gain little.)

It'd be notationally simpler if we could have the caller supply one
string that includes %u where the OID is supposed to go; but then
we have problems if an owner name includes %.  So on the whole I'm
not seeing anything much better than what I did.  Maybe I missed
an idea though.

> Another nitpick: it might be worth moving these hard-wired constants to
> macros.  Without a control switch, that'd at least make it easy for anyone
> determined to change the value for their installation.

OK.

regards, tom lane

Fix minor memory leak in connection string validation

2024-01-12 Thread Jeff Davis

Introduced in commit c3afe8cf5a.

Someone issuing repeated "CREATE SUBSCRIPTION" commands where the
connection has no password and must_have_password is true will leak
malloc'd memory in the error path. Minor issue in practice, because I
suspect that a user privileged enough to create a subscription could
cause bigger problems.

It makes me wonder if we should use the resowner mechanism to track
pointers to malloc'd memory. Then we could use a standard pattern for
these kinds of cases, and it would also catch more remote issues, like
if a pstrdup() fails in an error path (which can happen a few lines up
if the parse fails).

Patch attached; intended for 16 and 17.

Regards,
Jeff Davis

From cbce9f5a6d1f7a7ce19473c45ac4cc74bca49c0e Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 12 Jan 2024 14:39:05 -0800
Subject: [PATCH] Fix memory leak in connection string validation.

Introduced in commit c3afe8cf5a.

Backpatch-through: 16
---
 src/backend/replication/libpqwalreceiver/libpqwalreceiver.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index ead30f87c9..201c36cb22 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -284,10 +284,15 @@ libpqrcv_check_conninfo(const char *conninfo, bool must_use_password)
 		}
 
 		if (!uses_password)
+		{
+			/* malloc'd, so we must free it explicitly */
+			PQconninfoFree(opts);
+
 			ereport(ERROR,
 	(errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
 	 errmsg("password is required"),
 	 errdetail("Non-superusers must provide a password in the connection string.")));
+		}
 	}
 
 	PQconninfoFree(opts);
-- 
2.34.1

Re: cfbot is failing all tests on FreeBSD/Meson builds

2024-01-12 Thread Thomas Munro

On Sat, Jan 13, 2024 at 9:32 AM Tom Lane  wrote:
> It looks like every recent cfbot run has failed in the
> FreeBSD-13-Meson build, even if it worked in other ones.
> The symptoms are failures in the TAP tests that try to
> use interactive_psql:
>
> Can't call method "slave" on an undefined value at 
> /usr/local/lib/perl5/site_perl/IPC/Run.pm line 2889.
>
> I suspect that we are looking at some bug in IPC::Run that exists in
> the version that that FreeBSD release has (but not, seemingly,
> elsewhere in the community), and that was mostly harmless until
> c53859295 made all Perl warnings fatal.  Not sure what we want
> to do about this.

Right, I see this locally on my FreeBSD box.  Reverting the fatal
warnings thing doesn't help, it's more broken than that.  I tried to
understand

https://github.com/cpan-authors/IPC-Run/blob/master/lib/IPC/Run.pm
https://github.com/cpan-authors/IO-Tty/blob/master/Pty.pm

but I am not good at perl.  I think the error means that in

## Close all those temporary filehandles that the kids needed.
for my $pty ( values %{ $self->{PTYS} } ) {
close $pty->slave;
}

the variable $pty holds undef, so then we have to find out where undef
was put into $self->{PTYS}.  There is a place that does:

## Just flag the pyt's existence for now.  It'll be
## converted to a real IO::Pty by _open_pipes.
$self->{PTYS}->{$pty_id} = undef;

We can see that this started a couple of days ago:

https://github.com/postgres/postgres/commits/master/

But at older commits I see the problem locally too, so, yeah, it must
be coming from where else.  Looking at the relevant packages
p5-IPC-Run and p5-IO-Tty I see that they recently moved to 20231003.0
and  1.18, respectively.  Downgrading p5-IPC-Run to
p5-IPC-Run-20220807.0.pkg was not enough.  Downgrading p5-IO-Tty to
p5-IO-Tty-1.17.pkg allowed psql/010_tab_completion to pass with either
version of p5-IPC-Run.

The answer may lie in the commits between those versions here:

https://github.com/cpan-authors/IO-Tty/commits/master/

I see that Debian Bookwork is still using 1.17, which fits with your
guess that FreeBSD is just breaking first because it is using a newer
version.

Re: cfbot is failing all tests on FreeBSD/Meson builds

2024-01-12 Thread Tom Lane

Thomas Munro  writes:
> Looking at the relevant packages
> p5-IPC-Run and p5-IO-Tty I see that they recently moved to 20231003.0
> and  1.18, respectively.  Downgrading p5-IPC-Run to
> p5-IPC-Run-20220807.0.pkg was not enough.  Downgrading p5-IO-Tty to
> p5-IO-Tty-1.17.pkg allowed psql/010_tab_completion to pass with either
> version of p5-IPC-Run.

> The answer may lie in the commits between those versions here:

> https://github.com/cpan-authors/IO-Tty/commits/master/

> I see that Debian Bookwork is still using 1.17, which fits with your
> guess that FreeBSD is just breaking first because it is using a newer
> version.

Yeah, I have nothing newer than 1.17 here either.

Time for a bug report to IO::Tty's authors, I guess.

regards, tom lane

Re: cfbot is failing all tests on FreeBSD/Meson builds

2024-01-12 Thread Thomas Munro

On Sat, Jan 13, 2024 at 1:51 PM Tom Lane  wrote:
> Time for a bug report to IO::Tty's authors, I guess.

Ahh, there is one: https://github.com/cpan-authors/IO-Tty/issues/38

In the meantime, will look into whether I can pin that package to 1.17
somewhere in the pipeline, hopefully later today...

Re: Fix minor memory leak in connection string validation

2024-01-12 Thread Nathan Bossart

On Fri, Jan 12, 2024 at 03:06:26PM -0800, Jeff Davis wrote:
> It makes me wonder if we should use the resowner mechanism to track
> pointers to malloc'd memory. Then we could use a standard pattern for
> these kinds of cases, and it would also catch more remote issues, like
> if a pstrdup() fails in an error path (which can happen a few lines up
> if the parse fails).

That seems worth exploring.

>   if (!uses_password)
> + {
> + /* malloc'd, so we must free it explicitly */
> + PQconninfoFree(opts);
> +
>   ereport(ERROR,
>   
> (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
>errmsg("password is required"),
>errdetail("Non-superusers must provide 
> a password in the connection string.")));
> + }
>   }
>  
>   PQconninfoFree(opts);

Another option could be to surround this with PG_TRY/PG_FINALLY, but your
patch seems sufficient, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix minor memory leak in connection string validation

2024-01-12 Thread Tom Lane

Nathan Bossart  writes:
> On Fri, Jan 12, 2024 at 03:06:26PM -0800, Jeff Davis wrote:
>> It makes me wonder if we should use the resowner mechanism to track
>> pointers to malloc'd memory. Then we could use a standard pattern for
>> these kinds of cases, and it would also catch more remote issues, like
>> if a pstrdup() fails in an error path (which can happen a few lines up
>> if the parse fails).

> That seems worth exploring.

I'm pretty dubious about adding overhead for that, mainly because
most of the direct callers of malloc in a backend are going to be
code that's not under our control.  Modifying the callers that we
do control is not going to give a full solution, and could well be
outright misleading.

> Another option could be to surround this with PG_TRY/PG_FINALLY, but your
> patch seems sufficient, too.

Yeah, seems fine for now.  If that function grows any more complexity
then we could think about using PG_TRY.

regards, tom lane

Re: Synchronizing slots from primary to standby

2024-01-12 Thread Amit Kapila

On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot
 wrote:
>
> On Fri, Jan 12, 2024 at 08:42:39AM +0530, Amit Kapila wrote:
> > On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot
> >  wrote:
> > >
> > > I'm not sure to follow here. If the remote slot is re-created then it 
> > > would
> > > be also dropped / re-created locally, or am I missing something?
> > >
> >
> > As our slot-syncing mechanism is asynchronous (from time to time we
> > check the slot information on primary), isn't it possible that the
> > same name slot is dropped and recreated between slot-sync worker's
> > checks?
> >
>
> Yeah, I should have thought harder ;-) So for this case, let's imagine that 
> If we
> had an easy way to detect that a remote slot has been drop/re-created then I 
> think
> we would also drop and re-create it on the standby too.
>
> If so, I think we should then update all the fields (that we're currently 
> updating
> in the "create locally" case) when we detect that (at least) one of the 
> following differs:
>
> - dboid
> - plugin
> - two_phase
>

Right, I think even if any of restart/confirmed LSN's or xmin has
changed then also there is no harm in simply copying all the fields
from remote_slot as done by Hou-San in latest patch.

> Maybe the "best" approach would be to have a way to detect that a slot has 
> been
> re-created on the primary (but that would mean rely on more than the slot name
> to "identify" a slot and probably add a new member to the struct to do so).
>

Right, I also thought so but not sure further complicating the slot
machinery is worth detecting this case explicitly. If we see any
problem with the idea discussed then we may need to think something
along those lines.

-- 
With Regards,
Amit Kapila.

Re: Recovering from detoast-related catcache invalidations

2024-01-12 Thread Xiaoran Wang

Great! That's what exactly we need.

The patch LGTM,  +1


Tom Lane  于2024年1月13日周六 04:47写道：

> I wrote:
> > This is uncomfortably much in bed with the tuple table slot code,
> > perhaps, but I don't see a way to do it more cleanly unless we want
> > to add some new provisions to that API.  Andres, do you have any
> > thoughts about that?
>
> Oh!  After nosing around a bit more I remembered systable_recheck_tuple,
> which is meant for exactly this purpose.  So v4 attached.
>
> regards, tom lane
>
>

Re: plpgsql memory leaks

2024-01-12 Thread Pavel Stehule

Hi

pá 12. 1. 2024 v 22:25 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > default master branch - res 190MB ram
> > jit_inline_above_cost = -1 doesn't helps
> > disabling JIT doesn't helps too,
>
> > so it looks like the wrong hypothesis , and the problem is maybe
> somewhere
> > else :-/
>
> I see no leak with these examples on HEAD, either with or without
> --enable-llvm --- the process size stays quite stable according
> to "top".  I wonder if you are using some extension that's
> contributing to the problem.
>

memory info after DO $$ BEGIN END $$;

(2024-01-13 05:36:46) postgres=# do $$ begin end $$;
DO
(2024-01-13 05:37:16) postgres=# select meminfo();
NOTICE:  Total non-mmapped bytes (arena):   1114112
NOTICE:  # of free chunks (ordblks):11
NOTICE:  # of free fastbin blocks (smblks): 0
NOTICE:  # of mapped regions (hblks):   2
NOTICE:  Bytes in mapped regions (hblkhd):  401408
NOTICE:  Max. total allocated space (usmblks):  0
NOTICE:  Free bytes held in fastbins (fsmblks): 0
NOTICE:  Total allocated space (uordblks):  1039216
NOTICE:  Total free space (fordblks):   74896
NOTICE:  Topmost releasable block (keepcost):   67360

after script execution

NOTICE:  ("1165 kB","1603 kB","438 kB")
NOTICE:  Total non-mmapped bytes (arena):   22548480
NOTICE:  # of free chunks (ordblks):25
NOTICE:  # of free fastbin blocks (smblks): 0
NOTICE:  # of mapped regions (hblks):   2
NOTICE:  Bytes in mapped regions (hblkhd):  401408
NOTICE:  Max. total allocated space (usmblks):  0
NOTICE:  Free bytes held in fastbins (fsmblks): 0
NOTICE:  Total allocated space (uordblks):  1400224
NOTICE:  Total free space (fordblks):   21148256
NOTICE:  Topmost releasable block (keepcost):   20908384

so attached memory is 20MB -  but is almost free. The sum of memory context
is very stable without leaks (used 1165kB).

but when I modify the script to

CREATE OR REPLACE FUNCTION public.fx(iter integer)
 RETURNS void
 LANGUAGE plpgsql
AS $function$
declare
  c cursor(m bigint) for select distinct i from generate_series(1, m) g(i);
  t bigint;
  s bigint;
begin
  for i in 1..iter
  loop
open c(m := i * 1);
s := 0;
loop
  fetch c into t;
  exit when not found;
  s := s + t;
end loop;
raise notice '===before close';
raise notice '%', (select (pg_size_pretty(sum(used_bytes)),
pg_size_pretty(sum(total_bytes)), pg_size_pretty(sum(free_bytes))) from
pg_get_backend_memory_contexts());
--perform meminfo();
raise notice '---after close';
close c;
raise notice '%=%', i, s;
raise notice '%', (select (pg_size_pretty(sum(used_bytes)),
pg_size_pretty(sum(total_bytes)), pg_size_pretty(sum(free_bytes))) from
pg_get_backend_memory_contexts());
--perform meminfo();
  end loop;
end;
$function$

meminfo is simple extension - see the attachment, I got interesting things

NOTICE:  ===before close
NOTICE:  ("149 MB","154 MB","5586 kB")
NOTICE:  Total non-mmapped bytes (arena):   132960256
NOTICE:  # of free chunks (ordblks):49
NOTICE:  # of free fastbin blocks (smblks): 0
NOTICE:  # of mapped regions (hblks):   4
NOTICE:  Bytes in mapped regions (hblkhd):  51265536
NOTICE:  Max. total allocated space (usmblks):  0
NOTICE:  Free bytes held in fastbins (fsmblks): 0
NOTICE:  Total allocated space (uordblks):  110730576
NOTICE:  Total free space (fordblks):   9680
NOTICE:  Topmost releasable block (keepcost):   133008

so this script really used mbytes memory, but it is related to query
`select distinct i from generate_series(1, m) g(i);`

This maybe is in correlation to my default work mem 64MB - when I set work
mem to 10MB, then it consumes only 15MB

So I was confused because it uses only about 3x work_mem, which is not too
bad.

Regards

Pavel

>
> regards, tom lane
>

Re: Synchronizing slots from primary to standby

2024-01-12 Thread Amit Kapila

On Fri, Jan 12, 2024 at 5:50 PM shveta malik  wrote:
>
> There are multiple approaches discussed and tried when it comes to
> starting a slot-sync worker. I am summarizing all here:
>
>  1) Make slotsync worker as an Auxiliary Process (like checkpointer,
> walwriter, walreceiver etc). The benefit this approach provides is, it
> can control begin and stop in a more flexible way as each auxiliary
> process could have different checks before starting and can have
> different stop conditions. But it needs code duplication for process
> management(start, stop, crash handling, signals etc) and currently it
> does not support db-connection smoothly (none of the auxiliary process
> has one so far)
>

As slotsync worker needs to perform transactions and access syscache,
we can't make it an auxiliary process as that doesn't initialize the
required stuff like syscache. Also, see the comment "Auxiliary
processes don't run transactions ..." in AuxiliaryProcessMain() which
means this is not an option.

>
> 2) Make slotsync worker as a 'special' process like AutoVacLauncher
> which is neither an Auxiliary process nor a bgworker one. It allows
> db-connection and also provides flexibility to have start and stop
> conditions for a process.
>

Yeah, due to these reasons, I think this option is worth considering
and another plus point is that this allows us to make enable_syncslot
a PGC_SIGHUP GUC rather than a PGC_POSTMASTER.

>
> 3) Make slotysnc worker a bgworker. Here we just need to register our
> process as a bgworker (RegisterBackgroundWorker()) by providing a
> relevant start_time and restart_time and then the process management
> is well taken care of. It does not need any code-duplication and
> allows db-connection smoothly in registered process. The only thing it
> lacks is that it does not provide flexibility of having
> start-condition which then makes us to have 'enable_syncslot' as
> PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I
> feel enable_syncslot is something which will not be changed frequently
> and with the benefits provided by bgworker infra, it seems a
> reasonably good option to choose this approach.
>

I agree but it may be better to make it a PGC_SIGHUP parameter.

> 4) Another option is to have Logical Replication Launcher(or a new
> process) to launch slot-sync worker. But going by the current design
> where we have only 1 slotsync worker, it may be an overhead to have an
> additional manager process maintained.
>

I don't see any good reason to have an additional launcher process here.

>
> Thus weighing pros and cons of all these options, we have currently
> implemented the bgworker approach (approach 3).  Any feedback is
> welcome.
>

I vote to go for (2) unless we face difficulties in doing so but (3)
is also okay especially if others also think so.

-- 
With Regards,
Amit Kapila.

1 2 >

100 matches

Mail list logo