Re: psql tests hangs

2023-05-12 Thread Kirk Wolak
On Fri, May 12, 2023 at 2:40 AM Pavel Stehule 
wrote:

> pá 12. 5. 2023 v 8:20 odesílatel Kirk Wolak  napsal:
>
>> On Fri, May 12, 2023 at 1:46 AM Pavel Stehule 
>> wrote:
>>
>>> pá 12. 5. 2023 v 6:50 odesílatel Kirk Wolak  napsal:
>>>
 On Fri, May 12, 2023 at 12:14 AM Tom Lane  wrote:

> Kirk Wolak  writes:
> > Did you try the print statement that Andrey asked Pavel to try?
> ...


>>> The strange thing is hanging. Broken tests depending on locale are
>>> usual. But I didn't remember hanging.
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>
>> So, if you do psql -c "..."
>> with both of those \watch  instructions, do either one hang? (I am now
>> guessing "no")
>>
>> I know that perl is using a special library to "remote control psql"
>> (like a pseudo terminal, I guess).
>> [I had to abort some of the perl testing in Windows because that perl
>> library didn't work with my psql in Windows]
>>
>> Next, can you detect which process is hanging? (is it perl, the library,
>> psql, ?).
>>
>
> It hangs in perl
>
> but now I found there is dependency on PSQL_PAGER setting
>
> it started pager in background, I had lot of zombie pspg processes
>
> Unfortunately, when I unset this variable, the test hangs still
>
> here is backtrace
>
> Missing separate debuginfos, use: dnf debuginfo-install
> perl-interpreter-5.36.1-496.fc38.x86_64
> (gdb) bt
> #0  0x7fbbc1129ade in select () from /lib64/libc.so.6
> #1  0x7fbbc137363b in Perl_pp_sselect () from /lib64/libperl.so.5.36
> #2  0x7fbbc1317958 in Perl_runops_standard () from
> /lib64/libperl.so.5.36
> #3  0x7fbbc128259d in perl_run () from /lib64/libperl.so.5.36
> #4  0x56392bd9034a in main ()
>
> It is waiting on reading from pipe probably
>
> psql is living too, and it is waiting too
>
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x7f071740bc37 in wait4 () from /lib64/libc.so.6
> Missing separate debuginfos, use: dnf debuginfo-install
> glibc-2.37-4.fc38.x86_64 ncurses-libs-6.4-3.20230114.fc38.x86_64
> readline-8.2-3.fc38.x86_64
> (gdb) bt
> #0  0x7f071740bc37 in wait4 () from /lib64/libc.so.6
> #1  0x7f07173a9a10 in _IO_proc_close@@GLIBC_2.2.5 () from
> /lib64/libc.so.6
> #2  0x7f07173b51e9 in __GI__IO_file_close_it () from /lib64/libc.so.6
> #3  0x7f07173a79fb in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
> #4  0x00406be4 in do_watch (query_buf=query_buf@entry=0x5ae540,
> sleep=sleep@entry=0.01, iter=0, iter@entry=3) at command.c:5348
> #5  0x004087a5 in exec_command_watch 
> (scan_state=scan_state@entry=0x5ae490,
> active_branch=active_branch@entry=true, query_buf=query_buf@entry=0x5ae540,
> previous_buf=previous_buf@entry=0x5ae560) at command.c:2875
> #6  0x0040d4ba in exec_command (previous_buf=0x5ae560,
> query_buf=0x5ae540, cstack=0x5ae520, scan_state=0x5ae490, cmd=0x5ae9a0
> "watch") at command.c:413
> #7  HandleSlashCmds (scan_state=scan_state@entry=0x5ae490,
> cstack=cstack@entry=0x5ae520, query_buf=0x5ae540, previous_buf=0x5ae560)
> at command.c:230
>
> I am not sure, it is still doesn't work but probably there are some
> dependencies on my setting
>
> PSQL_PAGER and PSQL_WATCH_PAGER
>
> so this tests fails due my setting
>
> [pavel@localhost postgresql.master]$ set |grep PSQL
> PSQL_PAGER='pspg -X'
> PSQL_WATCH_PAGER='pspg -X --stream'
>
> Regards
>
> Pavel
>
>
Ummm... We are testing PSQL \watch  and you potentially have a
PSQL_WATCH_PAGER that is kicking in?
By chance does that attempt to read/process/understand the \watch ?
Also, if it is interfering with the stream, that would explain it.  The
perl library is trying to "control" psql.
If it ends up talking to you instead... All bets are off, imo.  I don't
know enough about PSQL_WATCH_PAGER.

Now I would be curious if you changed the test from
SELECT 1 \watch c=3 0.01

to
SELECT 1 \watch 0.01

because that should work.  Then I would test
SELECT \watch 0.01 c=3

If you are trying to parse the watch at all, that could break.  Then your
code might be trying to "complain",
and then that is screwing up the planned interaction (Just Guessing).

Kirk...


Re: Assert failure of the cross-check for nullingrels

2023-05-12 Thread Richard Guo
On Fri, Mar 17, 2023 at 11:05 AM Richard Guo  wrote:

> Here is an updated patch with comments and test case.  I also change the
> code to store 'group_clause_relids' directly in RestrictInfo.
>

BTW, I've added an open item for this issue.

Thanks
Richard


Re: psql tests hangs

2023-05-12 Thread Alvaro Herrera
On 2023-May-12, Pavel Stehule wrote:

> It hangs in perl

I wonder if "hanging" actually means that it interpreted the sleep time
as a very large integer, so it's just sleeping for a long time.

About the server locale, note that the ->new() call explicitly requests
the C locale -- it's only psql that is using the Czech locale.
Supposedly, the Perl code should also be using the Czech locale, so the
sprintf('%g') should be consistent with what psql \watch expects.
However, you cannot ask the server to be consistent with that -- say, if
you hypothetically tried to use "to_char(9D99)" and \gset that to use as
\watch argument, it wouldn't work, because that'd use the server's C
locale, not Czech.  (I know because I tried.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"




Re: Add standard collation UNICODE

2023-05-12 Thread Peter Eisentraut

On 08.05.23 17:48, Peter Eisentraut wrote:

On 27.04.23 13:44, Daniel Verite wrote:

This collation has an empty pg_collation.collversion column, instead
of being set to the same value as "und-x-icu" to track its version.


The original patch implements this as an INSERT in which it would be 
easy to

fix I guess, but in current HEAD it comes as an entry in
include/catalog/pg_collation.dat:

{ oid => '963',
   descr => 'sorts using the Unicode Collation Algorithm with default
settings',
   collname => 'unicode', collprovider => 'i', collencoding => '-1',
   colliculocale => 'und' },

Should it be converted back into an INSERT or better left
in this file and collversion being updated afterwards?


How about we do it with an UPDATE command.  We already do this for 
pg_database in a similar way.  See attached patch.


This has been committed.




Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE

2023-05-12 Thread Peter Eisentraut

On 10.05.23 20:04, Andres Freund wrote:

This commit adds a test

is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");

*before* that skip_all call.  This appears to be invalid.  If the skip_all
happens, you get a complaint like

t/004_verify_heapam.pl (Wstat: 0 Tests: 1 Failed: 0)
   Parse errors: Bad plan.  You planned 0 tests but ran 1.

We could move the is() test after all the skip_all's.  Any thoughts?


I think the easiest fix is to just die if we can't get the offsets - it's not
like we can really continue afterwards...


This should do it:

-is(scalar @lp_off, $ROWCOUNT, "acquired row offsets");
+scalar @lp_off == $ROWCOUNT or BAIL_OUT("row offset counts mismatch");

But I'm not sure what the latest thinking on BAIL_OUT is.  It is used 
nearby in a similar way though.






Re: psql tests hangs

2023-05-12 Thread Pavel Stehule
pá 12. 5. 2023 v 9:46 odesílatel Alvaro Herrera 
napsal:

> On 2023-May-12, Pavel Stehule wrote:
>
> > It hangs in perl
>
> I wonder if "hanging" actually means that it interpreted the sleep time
> as a very large integer, so it's just sleeping for a long time.
>

There is some interaction with pspg in stream mode

The probable scenario

It is starting pspg due to my setting PSQL_WATCH_PAGER. pspg is waiting on
quit command, or on pipe ending. Quit command cannot to come because it is
not on tty, so it is dead lock

I can write to safeguard the fast ending on pspg when it is in stream mode,
and tty is not available.

And generally, the root perl should to reset PSQL_WATCH_PAGER env variable
before executing psql. Probably it does with PSQL_PAGER, and maybe with
PAGER.

Regards

Pavel


>
> About the server locale, note that the ->new() call explicitly requests
> the C locale -- it's only psql that is using the Czech locale.
> Supposedly, the Perl code should also be using the Czech locale, so the
> sprintf('%g') should be consistent with what psql \watch expects.
> However, you cannot ask the server to be consistent with that -- say, if
> you hypothetically tried to use "to_char(9D99)" and \gset that to use as
> \watch argument, it wouldn't work, because that'd use the server's C
> locale, not Czech.  (I know because I tried.)
>
> --
> Álvaro HerreraBreisgau, Deutschland  —
> https://www.EnterpriseDB.com/
> "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
>


Re: Allow pg_archivecleanup to remove backup history files

2023-05-12 Thread torikoshia

On 2023-05-10 17:52, Bharath Rupireddy wrote:
Thanks for your comments!


Just curious to know the driving point behind this proposal - is
pg_archivecleanup deployed in production that was unable to clean up
the history files and there were many such history files left? It will
help us know how pg_archivecleanup is being used.


Yes.


Just curious to know the driving point behind this proposal - is
pg_archivecleanup deployed in production that was unable to clean up
the history files and there were many such history files left? It will
help us know how pg_archivecleanup is being used.

I'm wondering if making -x generic with '-x' '.backup', is simpler
than adding another option?


Since according to the current semantics, deleting backup history files 
with -x demands not '-x .backup' but '-x .007C9330.backup' when the file 
name is 0001123455CD.007C9330.backup, it needs special 
treatment for backup history files, right?


I think it would be intuitive and easier to remember than new option.

I was a little concerned about what to do when deleting both the files 
ending in .gz and backup history files.
Is making it possible to specify both "-x .backup" and "-x .gz" the way 
to go?


I also concerned someone might add ".backup" to WAL files, but does that 
usually not happen?



Comments on the patch:
1. Why just only the backup history files? Why not remove the timeline
history files too? Is it because there may not be as many tli switches
happening as backups?


Yeah, do you think we should also add logic for '-x .history'?


2.+sub remove_backuphistoryfile_run_check
+{
Why to invent a new function when run_check() can be made generic with
few arguments passed?


Thanks, I'm going to reconsider it.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: Orphaned files in base/[oid]

2023-05-12 Thread Alexey Gordeev
 12.05.2023, 14:17, "Andres Freund" :Alternatively we could do something without marker files, with someadded complexity: Keep track of all "uncommitted new files" in memory,and log them every checkpoint. Commit/abort records clear elements ofthat list. Since we always start replay at the beginning of acheckpoint, we'd always reach a moment with such an up2date list ofpending-action files before reaching end-of-recovery. At end-of-recoverywe can delete all unconfirmed files. To avoid out-of-memory due to toomany tracked relations, we'd possibly still have to have marker files...Hi, hackers. I'm sorry, but I want to bump this thread, because there is still no good solution to solve the problem. I see there are few threads with undo-based approaches, which looks preferable, but have some pitfalls. Is there any chance to return to non-undo approaches partially discussed here? What do you think about the following solutions?1) Make `pendingDeletes` shared and let postmaster clean all garbage in case of child process dying. Cons: Not works in case of postmaster dying. Should care about `pendingDeletes` pointers validity.2) Catch and store all records with relfilenode during WAL replay, delete all orphaned nodes at the end of replaying. Cons: The final delete may use an incomplete list of nodes, as there was something before the latest checkpoint. The general opacity - we remove something without a corresponded WAL record (or possibly do it in wrong place in general).3) This way is close to one I quoted and a combination of two above. `pendingDeletes` is shared. Each checkpoint creates a WAL record with a list of open transactions and created nodes. WAL replaying can use this list as base, adding nodes to it from newer records. The final delete operation has a complete list of orphaned nodes. Cons: Complexity(?). Others(?).  Can it work? Are any of this approaches still relevant?-- Regards,Alex Go, C developerg...@arenadata.io, www.arenadata.tech 

RE: Non-superuser subscription owners

2023-05-12 Thread Zhijie Hou (Fujitsu)
On Tuesday, April 4, 2023 1:57 AM Robert Haas  wrote:
> 
> On Sat, Apr 1, 2023 at 12:00 PM Alexander Lakhin 
> wrote:
> > I've managed to reproduce it using the following script:
> > for ((i=1;i<=10;i++)); do
> > echo "iteration $i"
> > echo "
> > CREATE ROLE sub_user;
> > CREATE SUBSCRIPTION testsub CONNECTION 'dbname=db'
> >   PUBLICATION testpub WITH (connect = false); ALTER SUBSCRIPTION
> > testsub ENABLE; DROP SUBSCRIPTION testsub; SELECT pg_sleep(0.001);
> > DROP ROLE sub_user; " | psql psql -c "ALTER SUBSCRIPTION testsub
> > DISABLE;"
> > psql -c "ALTER SUBSCRIPTION testsub SET (slot_name = NONE);"
> > psql -c "DROP SUBSCRIPTION testsub;"
> > grep 'TRAP' server.log && break
> > done
> 
> After a bit of experimentation this repro worked for me -- I needed
> -DRELCACHE_FORCE_RELEASE as well, and a bigger iteration count. I verified
> that the patch fixed it, and committed the patch with the addition of a
> comment.

Thanks for pushing!

While testing this, I found a similar problem in table sync worker,
as we also invoke superuser_arg() in table sync worker which is not in a
transaction.

LogicalRepSyncTableStart
...
/* Is the use of a password mandatory? */
must_use_password = MySubscription->passwordrequired &&
!superuser_arg(MySubscription->owner);

#0  0x7f18bb55aaff in raise () from /lib64/libc.so.6
#1  0x7f18bb52dea5 in abort () from /lib64/libc.so.6
#2  0x00b69a22 in ExceptionalCondition (conditionName=0xda4338 
"IsTransactionState()", fileName=0xda403e "catcache.c", lineNumber=1208) at 
assert.c:66
#3  0x00b4842a in SearchCatCacheInternal (cache=0x27cab80, nkeys=1, 
v1=10, v2=0, v3=0, v4=0) at catcache.c:1208
#4  0x00b48329 in SearchCatCache1 (cache=0x27cab80, v1=10) at 
catcache.c:1162
#5  0x00b630c7 in SearchSysCache1 (cacheId=11, key1=10) at 
syscache.c:825
#6  0x00b982e3 in superuser_arg (roleid=10) at superuser.c:70

I can reproduce this via gdb following similar steps in [1].

I think we need to move this call into a transaction as well and here is an 
attempt
to do that.

[1] 
https://www.postgresql.org/message-id/OS0PR01MB5716E596E4FB83DE46F592FE948C9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj


0001-Fix-possible-logical-replication-table-sync-crash.patch
Description:  0001-Fix-possible-logical-replication-table-sync-crash.patch


Re: Orphaned files in base/[oid]

2023-05-12 Thread Alexey Gordeev

On 14.08.2017 23:56, Andres Freund wrote:

Alternatively we could do something without marker files, with some
added complexity: Keep track of all "uncommitted new files" in memory,
and log them every checkpoint. Commit/abort records clear elements of
that list. Since we always start replay at the beginning of a
checkpoint, we'd always reach a moment with such an up2date list of
pending-action files before reaching end-of-recovery. At end-of-recovery
we can delete all unconfirmed files.  To avoid out-of-memory due to too
many tracked relations, we'd possibly still have to have marker files...



Hi, hackers.

I'm sorry, but I want to bump this thread, because there is still no 
good solution to solve the problem. I see there are few threads with 
undo-based approaches, which looks preferable, but have some pitfalls. 
Is there any chance to return to non-undo approaches partially discussed 
here? What do you think about the following solutions?
1) Make `pendingDeletes` shared and let postmaster clean all garbage in 
case of child process dying. Cons: Not works in case of postmaster 
dying. Should care about `pendingDeletes` pointers validity.
2) Catch and store all records with relfilenode during WAL replay, 
delete all orphaned nodes at the end of replaying. Cons: The final 
delete may use an incomplete list of nodes, as there was something 
before the latest checkpoint. The general opacity - we remove something 
without a corresponded WAL record (or possibly do it in wrong place in 
general).
3) This way is close to one I quoted and a combination of two above. 
`pendingDeletes` is shared. Each checkpoint creates a WAL record with a 
list of open transactions and created nodes. WAL replaying can use this 
list as base, adding nodes to it from newer records. The final delete 
operation has a complete list of orphaned nodes. Cons: Complexity(?). 
Others(?).


Can it work? Are any of this approaches still relevant?

I'm sorry for the last html-formatted message. Our web-based app is too 
smart.


--
Regards,
Alex Go, C developer
g...@arenadata.io, www.arenadata.tech




Re: Large files for relations

2023-05-12 Thread Dagfinn Ilmari Mannsåker
Thomas Munro  writes:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
>> On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:
>>> I am not aware of any modern/non-historic filesystem[2] that can't do
>>> large files with ease.  Anyone know of anything to worry about on that
>>> front?
>>
>> There is some trouble in the ambiguity of what we mean by "modern" and
>> "large files". There are still a large number of users of ext4 where
>> the max file size is 16TB. Switching to a single large file per
>> relation would effectively cut the max table size in half for those
>> users. How would a user with say a 20TB table running on ext4 be
>> impacted by this change?
[…]
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional.  For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> base backup.  Etc.  Then someone concerned about hitting the 16TB
> limit on ext4 could opt out.  Or something like that.  It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).

If we're going to have to keep the segment code for the foreseeable
future anyway, could we not get most of the benefit by increasing the
segment size to something like 1TB?  The vast majority of tables would
fit in one file, and there would be less risk of hitting filesystem
limits.

- ilmari




Re: v16 regression - wrong query results with LEFT JOINs + join removal

2023-05-12 Thread Robert Haas
On Thu, May 11, 2023 at 4:16 PM Kirk Wolak  wrote:
> Forgive the noob question...  But does this trigger a regression test to be 
> created?
> And who tracks/pushes that?

Tom included one in the commit.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: running logical replication as the subscription owner

2023-05-12 Thread Ajin Cherian
On Fri, May 12, 2023 at 1:49 PM Amit Kapila  wrote:
>
> On Fri, May 12, 2023 at 9:10 AM Masahiko Sawada  wrote:
> >
> > On Fri, May 12, 2023 at 1:12 AM Robert Haas  wrote:
> > >
> > > On Thu, May 11, 2023 at 7:38 AM Amit Kapila  
> > > wrote:
> > > > Do we want the initial sync to also respect 'run_as_owner' option? I
> > > > might be missing something but I don't see anything in the docs about
> > > > initial sync interaction with this option. In the commit a2ab9c06ea,
> > > > we did the permission checking during the initial sync so I thought we
> > > > should do it here as well.
> > >
> > > It definitely should work that way. lf it doesn't, that's a bug.
> >
> > After some tests, it seems that the initial sync worker respects
> > 'run_as_owner' during catching up but not during COPYing.
> >
>
> Yeah, I was worried during copy phase only. During catchup, the code
> is common with apply worker code, so it will work.
>

I tried the following test:


Repeat On the publisher and subscriber:
 /* Create role regress_alice with  NOSUPERUSER on
   publisher and subscriber and a table for replication */

CREATE ROLE regress_alice NOSUPERUSER LOGIN;
CREATE ROLE regress_admin SUPERUSER LOGIN;
GRANT CREATE ON DATABASE postgres TO regress_alice;
SET SESSION AUTHORIZATION regress_alice;
CREATE SCHEMA alice;
GRANT USAGE ON SCHEMA alice TO regress_admin;
CREATE TABLE alice.test (i INTEGER);
ALTER TABLE alice.test REPLICA IDENTITY FULL;

On the publisher:
postgres=> insert into alice.test values(1);
postgres=> insert into alice.test values(2);
postgres=> insert into alice.test values(3);
postgres=> CREATE PUBLICATION alice FOR TABLE alice.test
WITH (publish_via_partition_root = true);

On the subscriber: /* create table admin_audit which regress_alice
does not have access to */
SET SESSION AUTHORIZATION regress_admin;
create table admin_audit (i integer);

On the subscriber: /* Create a trigger for table alice.test which
inserts on table admin_audit which the table owner of alice.test does
not have access to */
SET SESSION AUTHORIZATION regress_alice;
CREATE OR REPLACE FUNCTION alice.alice_audit()
RETURNS trigger AS
$$
BEGIN
insert into public.admin_audit values(2);
RETURN NEW;
END;
$$
LANGUAGE 'plpgsql';
create trigger test_alice after insert on alice.test for each row
execute procedure alice.alice_audit();
alter table alice.test enable always trigger test_alice;

On the subscriber: /* Create a subscription with run_as_owner = false */
CREATE SUBSCRIPTION admin_sub CONNECTION 'dbname=postgres
host=localhost port=6972' PUBLICATION alice WITH (run_as_owner =
false);
===

What I see is that as part of tablesync, the trigger invokes an
updates admin_audit which it shouldn't, as the table owner
of alice.test should not have access to the
table admin_audit. This means the table copy is being invoked as the
subscription owner and not the table owner.

However, I see subsequent inserts fail on replication with
permission denied error, so the apply worker correctly
applies the inserts as the table owner.

If nobody else is working on this, I can come up with a patch to fix this

regards,
Ajin Cherian
Fujitsu Australia




Re: walsender performance regression due to logical decoding on standby changes

2023-05-12 Thread Bharath Rupireddy
On Wed, May 10, 2023 at 3:23 PM Drouvot, Bertrand
 wrote:
>
> >> My current guess is that mis-using a condition variable is the best bet. I
> >> think it should work to use ConditionVariablePrepareToSleep() before a
> >> WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use
> >> ConditionVariableSleep(). The latch set from ConditionVariableBroadcast()
> >> would still cause the necessary wakeup.
> >
> > How about something like the attached? Recovery and subscription tests
> > don't complain with the patch.
>
> I launched a full Cirrus CI test with it but it failed on one environment 
> (did not look in details,
> just sharing this here): https://cirrus-ci.com/task/6570140767092736

Yeah, v1 had ConditionVariableInit() such that the CV was getting
initialized for every backend as opposed to just once after the WAL
sender shmem was created.

> Also I have a few comments:

Indeed, v1 was a WIP patch. Please have a look at the attached v2
patch, which has comments and passing CI runs on all platforms -
https://github.com/BRupireddy/postgres/tree/optimize_walsender_wakeup_logic_v2.

On Wed, May 10, 2023 at 3:41 PM Zhijie Hou (Fujitsu)
 wrote:
>
> if (AllowCascadeReplication())
> -   WalSndWakeup(switchedTLI, true);
> +   ConditionVariableBroadcast(&WalSndCtl->cv);
>
> After the change, we wakeup physical walsender regardless of switchedTLI flag.
> Is this intentional ? if so, I think It would be better to update the 
> comments above this.

That's not the case with the attached v2 patch. Please have a look.

On Thu, May 11, 2023 at 10:27 AM Masahiko Sawada  wrote:
>
> We can have two condition variables for
> logical and physical walsenders, and selectively wake up walsenders
> sleeping on the condition variables. It should work, it seems like
> much of a hack, though.

Andres, rightly put it - 'mis-using' CV infrastructure. It is simple,
works, and makes the WalSndWakeup() easy solving the performance
regression.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 02c4770964737d1e7a3c7d579080bc6c5bc01fdc Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 12 May 2023 11:30:14 +
Subject: [PATCH v2] Optimize walsender wake up logic with Conditional
 Variables

WalSndWakeup() currently loops through all the walsenders slots,
with a spinlock acquisition and release for every iteration, just
to wake up only the waiting walsenders. WalSndWakeup() gets
costlier even when there's a single walsender available on the
standby (i.e., a cascading replica or a logical decoding client).

This wasn't a problem before e101dfac3a53c. But, for allowing
logical decoding on standby, we needed to wake up logical
walsenders after every WAL record is applied on the standby. This
really made WalSndWakeup() costly, causing performance regression.

To solve this, we use condition variable (CV) to efficiently wake
up walsenders in WalSndWakeup().

Every walsender prepares to sleep on a shared memory CV. Note that
it just prepares to sleep on the CV (i.e., adds itself to the CV's
waitlist), but not actually waits on the CV (IOW, it never calls
ConditionVariableSleep()). It still uses WaitEventSetWait() for
waiting, because CV infrastructure doesn't handle FeBe socket
events currently. The processes (startup process, walreceiver
etc.) wanting to wake up walsenders use
ConditionVariableBroadcast(), which in turn calls SetLatch(),
helping walsenders come out of WaitEventSetWait().

This approach is simple and efficient because it makes
WalSndWakeup() life easy.

When available, WaitEventSetWait() can be replaced with its CV's
counterpart.

And, we use separate shared memory CVs for physical and logical
walsenders for selective wake ups, see WalSndWakeup() for more
details.

Reported-by: Andres Freund
Suggested-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Drouvot, Bertrand
Reviewed-by: Zhijie Hou
Discussion: https://www.postgresql.org/message-id/20230509190247.3rrplhdgem6su6cg%40awork3.anarazel.de
---
 src/backend/replication/walsender.c | 72 ++---
 src/include/replication/walsender_private.h |  7 ++
 2 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 45b8b3684f..dc4d376e7c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3309,6 +3309,9 @@ WalSndShmemInit(void)
 
 			SpinLockInit(&walsnd->mutex);
 		}
+
+		ConditionVariableInit(&WalSndCtl->physicalWALSndCV);
+		ConditionVariableInit(&WalSndCtl->logicalWALSndCV);
 	}
 }
 
@@ -3330,31 +,17 @@ WalSndShmemInit(void)
 void
 WalSndWakeup(bool physical, bool logical)
 {
-	int			i;
-
-	for (i = 0; i < max_wal_senders; i++)
-	{
-		Latch	   *latch;
-		ReplicationKind kind;
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
-
-		/*
-		 * Get latch pointer with spinlock held, for the unlikely ca

Re: psql tests hangs

2023-05-12 Thread Pavel Stehule
pá 12. 5. 2023 v 10:31 odesílatel Pavel Stehule 
napsal:

>
>
> pá 12. 5. 2023 v 9:46 odesílatel Alvaro Herrera 
> napsal:
>
>> On 2023-May-12, Pavel Stehule wrote:
>>
>> > It hangs in perl
>>
>> I wonder if "hanging" actually means that it interpreted the sleep time
>> as a very large integer, so it's just sleeping for a long time.
>>
>
> There is some interaction with pspg in stream mode
>
> The probable scenario
>
> It is starting pspg due to my setting PSQL_WATCH_PAGER. pspg is waiting on
> quit command, or on pipe ending. Quit command cannot to come because it is
> not on tty, so it is dead lock
>
> I can write to safeguard the fast ending on pspg when it is in stream
> mode, and tty is not available.
>
> And generally, the root perl should to reset PSQL_WATCH_PAGER env variable
> before executing psql. Probably it does with PSQL_PAGER, and maybe with
> PAGER.
>

with last change in pspg, this tests fails as "expected"

aster/src/bin/psql/../../../src/test/regress/pg_regress' /usr/bin/prove -I
../../../src/test/perl/ -I .  t/*.pl
# +++ tap check in src/bin/psql +++
t/001_basic.pl ... 59/?
#   Failed test '\watch with 3 iterations: no stderr'
#   at t/001_basic.pl line 356.
#  got: 'stream mode can be used only in interactive mode (tty is
not available)'
# expected: ''

#   Failed test '\watch with 3 iterations: matches'
#   at t/001_basic.pl line 356.
#   ''
# doesn't match '(?^l:1\n1\n1)'
# Looks like you failed 2 tests of 80.
t/001_basic.pl ... Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/80 subtests
t/010_tab_completion.pl .. ok
t/020_cancel.pl .. ok

Test Summary Report
---
t/001_basic.pl (Wstat: 512 (exited 2) Tests: 80 Failed: 2)
  Failed tests:  69-70
  Non-zero exit status: 2
Files=3, Tests=169,  7 wallclock secs ( 0.16 usr  0.03 sys +  3.31 cusr
 1.31 csys =  4.81 CPU)
Result: FAIL
make: *** [Makefile:87: check] Chyba 1

Regards

Pavel

>
> Regards
>
> Pavel
>
>
>>
>> About the server locale, note that the ->new() call explicitly requests
>> the C locale -- it's only psql that is using the Czech locale.
>> Supposedly, the Perl code should also be using the Czech locale, so the
>> sprintf('%g') should be consistent with what psql \watch expects.
>> However, you cannot ask the server to be consistent with that -- say, if
>> you hypothetically tried to use "to_char(9D99)" and \gset that to use as
>> \watch argument, it wouldn't work, because that'd use the server's C
>> locale, not Czech.  (I know because I tried.)
>>
>> --
>> Álvaro HerreraBreisgau, Deutschland  —
>> https://www.EnterpriseDB.com/
>> "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
>>
>


Re: psql tests hangs

2023-05-12 Thread Tom Lane
Pavel Stehule  writes:
> And generally, the root perl should to reset PSQL_WATCH_PAGER env variable
> before executing psql. Probably it does with PSQL_PAGER, and maybe with
> PAGER.

Oh!  AFAICS, we don't do any of those things, but I agree it seems like
a good idea.  Can you confirm that if you unset PSQL_WATCH_PAGER then
the test passes for you?

regards, tom lane




Re: Large files for relations

2023-05-12 Thread Jim Mlodgenski
On Thu, May 11, 2023 at 7:38 PM Thomas Munro  wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro 
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease.  Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit).  It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size.  But however hypothetical the
> scenario might be, it should work,
>

Agreed, it is frustrating, but it is not hypothetical. I have seen a number
of
users having single tables larger than 16TB and don't use partitioning
because
of the limitations we have today. The most common reason is needing multiple
unique constraints on the table that don't include the partition key.
Something
like a user_id and email. There are workarounds for those cases, but usually
it's easier to deal with a single large table than to deal with the sharp
edges
those workarounds introduce.


Re: Large files for relations

2023-05-12 Thread Stephen Frost
Greetings,

* Dagfinn Ilmari Mannsåker (ilm...@ilmari.org) wrote:
> Thomas Munro  writes:
> > On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> >> On Mon, May 1, 2023 at 9:29 PM Thomas Munro  wrote:
> >>> I am not aware of any modern/non-historic filesystem[2] that can't do
> >>> large files with ease.  Anyone know of anything to worry about on that
> >>> front?
> >>
> >> There is some trouble in the ambiguity of what we mean by "modern" and
> >> "large files". There are still a large number of users of ext4 where
> >> the max file size is 16TB. Switching to a single large file per
> >> relation would effectively cut the max table size in half for those
> >> users. How would a user with say a 20TB table running on ext4 be
> >> impacted by this change?
> […]
> > A less aggressive version of the plan would be that we just keep the
> > segment code for the foreseeable future with no planned cut off, and
> > we make all of those "piggy back" transformations that I showed in the
> > patch set optional.  For example, I had it so that CLUSTER would
> > quietly convert your relation to large format, if it was still in
> > segmented format (might as well if you're writing all the data out
> > anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> > base backup.  Etc.  Then someone concerned about hitting the 16TB
> > limit on ext4 could opt out.  Or something like that.  It seems funny
> > though, that's exactly the user who should want this feature (they
> > have 16,000 relation segment files).
> 
> If we're going to have to keep the segment code for the foreseeable
> future anyway, could we not get most of the benefit by increasing the
> segment size to something like 1TB?  The vast majority of tables would
> fit in one file, and there would be less risk of hitting filesystem
> limits.

While I tend to agree that 1GB is too small, 1TB seems like it's
possibly going to end up on the too big side of things, or at least,
if we aren't getting rid of the segment code then it's possibly throwing
away the benefits we have from the smaller segments without really
giving us all that much.  Going from 1G to 10G would reduce the number
of open file descriptors by quite a lot without having much of a net
change on other things.  50G or 100G would reduce the FD handles further
but starts to make us lose out a bit more on some of the nice parts of
having multiple segments.

Just some thoughts.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: createuser --memeber and PG 16

2023-05-12 Thread Peter Eisentraut

On 11.05.23 16:07, Robert Haas wrote:

On Wed, May 10, 2023 at 1:33 PM Bruce Momjian  wrote:

This seems like it will be forever confusing to people.  I frankly don't
know why --role matching CREATE ROLE ... ROLE IN was not already
confusing in pre-PG 16.  Any new ideas for improvement?


Yeah, it's a bad situation. I think --role is basically misnamed.
Something like --add-to-group would have been clearer, but that also
has the problem of being inconsistent with the SQL command. The whole
ROLE vs. IN ROLE thing is inherently quite confusing, I think. It's
very easy to get confused about which direction the membership arrows
are pointing.


It's hard to tell that for the --member option as well.  For

createuser foo --member bar

it's not intuitive whether foo becomes a member of bar or bar becomes a 
member of foo.  Maybe something more verbose like --member-of would help?






Re: psql tests hangs

2023-05-12 Thread Pavel Stehule
pá 12. 5. 2023 v 15:26 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > And generally, the root perl should to reset PSQL_WATCH_PAGER env
> variable
> > before executing psql. Probably it does with PSQL_PAGER, and maybe with
> > PAGER.
>
> Oh!  AFAICS, we don't do any of those things, but I agree it seems like
> a good idea.  Can you confirm that if you unset PSQL_WATCH_PAGER then
> the test passes for you?
>

yes, I tested it now, and unset  PSQL_WATCH_PAGER fixed this issue.

Regards

Pavel


> regards, tom lane
>


[PATCH] Slight improvement of worker_spi.c example

2023-05-12 Thread Aleksander Alekseev
Hi,

Currently the example uses the following order of calls:

StartTransactionCommand();
SPI_connect();
PushActiveSnapshot(...);

...

SPI_finish();
PopActiveSnapshot();
CommitTransactionCommand();

This could be somewhat misleading. Typically one expects something to be freed
in a reverse order compared to initialization. This creates a false impression
that PushActiveSnapshot(...) _should_ be called after SPI_connect().

The patch changes the order to:

StartTransactionCommand();
PushActiveSnapshot(...);
SPI_connect();

...

SPI_finish();
PopActiveSnapshot();
CommitTransactionCommand();

... and also clarifies that the order of PushActiveSnapshot(...) and
SPI_connect() is not important.

--
Best regards,
Aleksander Alekseev
From 9f20a508c3b52d94455d67e3bdf7872787c63255 Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Fri, 12 May 2023 17:15:04 +0300
Subject: [PATCH v1] Slight improvement of worker_spi.c example

Previously the example used the following order of calls:

	StartTransactionCommand();
	SPI_connect();
	PushActiveSnapshot(...);

	...

	SPI_finish();
	PopActiveSnapshot();
	CommitTransactionCommand();

This could be somewhat misleading. Typically one expects something to be freed
in a reverse order comparing to initialization. This creates a false impression
that PushActiveSnapshot(...) _should_ be called after SPI_connect().

The patch changes the order to:

 	StartTransactionCommand();
	PushActiveSnapshot(...);
	SPI_connect();

	...

	SPI_finish();
	PopActiveSnapshot();
	CommitTransactionCommand();

... and also clarifies that the order of PushActiveSnapshot(...) and
SPI_connect() is not important.

Aleksander Alekseev, reviewed by TODO FIXME
Discussion: TODO FIXME
---
 src/test/modules/worker_spi/worker_spi.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index ad491d7722..8045bb3546 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -74,8 +74,8 @@ initialize_worker_spi(worktable *table)
 
 	SetCurrentStatementStartTimestamp();
 	StartTransactionCommand();
-	SPI_connect();
 	PushActiveSnapshot(GetTransactionSnapshot());
+	SPI_connect();
 	pgstat_report_activity(STATE_RUNNING, "initializing worker_spi schema");
 
 	/* XXX could we use CREATE SCHEMA IF NOT EXISTS? */
@@ -222,17 +222,19 @@ worker_spi_main(Datum main_arg)
 		 * preceded by SetCurrentStatementStartTimestamp(), so that statement
 		 * start time is always up to date.
 		 *
-		 * The SPI_connect() call lets us run queries through the SPI manager,
-		 * and the PushActiveSnapshot() call creates an "active" snapshot
+		 * The PushActiveSnapshot() call creates an "active" snapshot,
 		 * which is necessary for queries to have MVCC data to work on.
+		 * The SPI_connect() call lets us run queries through the SPI manager.
+		 * The order of PushActiveSnapshot() and SPI_connect() is not really
+		 * important.
 		 *
 		 * The pgstat_report_activity() call makes our activity visible
 		 * through the pgstat views.
 		 */
 		SetCurrentStatementStartTimestamp();
 		StartTransactionCommand();
-		SPI_connect();
 		PushActiveSnapshot(GetTransactionSnapshot());
+		SPI_connect();
 		debug_query_string = buf.data;
 		pgstat_report_activity(STATE_RUNNING, buf.data);
 
-- 
2.40.1



Re: psql tests hangs

2023-05-12 Thread Tom Lane
Pavel Stehule  writes:
> pá 12. 5. 2023 v 15:26 odesílatel Tom Lane  napsal:
>> Oh!  AFAICS, we don't do any of those things, but I agree it seems like
>> a good idea.  Can you confirm that if you unset PSQL_WATCH_PAGER then
>> the test passes for you?

> yes, I tested it now, and unset  PSQL_WATCH_PAGER fixed this issue.

OK.  So after looking at this a bit, the reason PAGER and PSQL_PAGER
don't cause us any problems in the test environment is that they are
not honored unless isatty(fileno(stdin)) && isatty(fileno(stdout)).
It seems to me that it's a bug that there is no such check before
using PSQL_WATCH_PAGER.  Is there actually any defensible reason
for that?

I think we do need to clear out all three variables in
Cluster::interactive_psql.  But our regular psql tests shouldn't
be at risk here.

regards, tom lane




Re: psql tests hangs

2023-05-12 Thread Pavel Stehule
pá 12. 5. 2023 v 17:50 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > pá 12. 5. 2023 v 15:26 odesílatel Tom Lane  napsal:
> >> Oh!  AFAICS, we don't do any of those things, but I agree it seems like
> >> a good idea.  Can you confirm that if you unset PSQL_WATCH_PAGER then
> >> the test passes for you?
>
> > yes, I tested it now, and unset  PSQL_WATCH_PAGER fixed this issue.
>
> OK.  So after looking at this a bit, the reason PAGER and PSQL_PAGER
> don't cause us any problems in the test environment is that they are
> not honored unless isatty(fileno(stdin)) && isatty(fileno(stdout)).
> It seems to me that it's a bug that there is no such check before
> using PSQL_WATCH_PAGER.  Is there actually any defensible reason
> for that?
>

Theoretically, we can write tests for these features, and then stdout,
stdin may not be tty.

Except for testing, using pager in non-interactive mode makes no sense.

Regards

Pavel



> I think we do need to clear out all three variables in
> Cluster::interactive_psql.  But our regular psql tests shouldn't
> be at risk here.
>
> regards, tom lane
>


Re: Large files for relations

2023-05-12 Thread MARK CALLAGHAN
Repeating what was mentioned on Twitter, because I had some experience with
the topic. With fewer files per table there will be more contention on the
per-inode mutex (which might now be the per-inode rwsem). I haven't read
filesystem source in a long time. Back in the day, and perhaps today, it
was locked for the duration of a write to storage (locked within the
kernel) and was briefly locked while setting up a read.

The workaround for writes was one of:
1) enable disk write cache or use battery-backed HW RAID to make writes
faster (yes disks, I encountered this prior to 2010)
2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't
locked for the duration of a write

I have a vague memory that filesystems have improved in this regard.


On Thu, May 11, 2023 at 4:38 PM Thomas Munro  wrote:

> On Fri, May 12, 2023 at 8:16 AM Jim Mlodgenski  wrote:
> > On Mon, May 1, 2023 at 9:29 PM Thomas Munro 
> wrote:
> >> I am not aware of any modern/non-historic filesystem[2] that can't do
> >> large files with ease.  Anyone know of anything to worry about on that
> >> front?
> >
> > There is some trouble in the ambiguity of what we mean by "modern" and
> "large files". There are still a large number of users of ext4 where the
> max file size is 16TB. Switching to a single large file per relation would
> effectively cut the max table size in half for those users. How would a
> user with say a 20TB table running on ext4 be impacted by this change?
>
> Hrmph.  Yeah, that might be a bit of a problem.  I see it discussed in
> various places that MySQL/InnoDB can't have tables bigger than 16TB on
> ext4 because of this, when it's in its default one-file-per-object
> mode (as opposed to its big-tablespace-files-to-hold-all-the-objects
> mode like DB2, Oracle etc, in which case I think you can have multiple
> 16TB segment files and get past that ext4 limit).  It's frustrating
> because 16TB is still really, really big and you probably should be
> using partitions, or more partitions, to avoid all kinds of other
> scalability problems at that size.  But however hypothetical the
> scenario might be, it should work, and this is certainly a plausible
> argument against the "aggressive" plan described above with the hard
> cut-off where we get to drop the segmented mode.
>
> Concretely, a 20TB pg_upgrade in copy mode would fail while trying to
> concatenate with the above patches, so you'd have to use link or
> reflink mode (you'd probably want to use that anyway unless due to
> sheer volume of data to copy otherwise, since ext4 is also not capable
> of block-range sharing), but then you'd be out of luck after N future
> major releases, according to that plan where we start deleting the
> code, so you'd need to organise some smaller partitions before that
> time comes.  Or pg_upgrade to a target on xfs etc.  I wonder if a
> future version of extN will increase its max file size.
>
> A less aggressive version of the plan would be that we just keep the
> segment code for the foreseeable future with no planned cut off, and
> we make all of those "piggy back" transformations that I showed in the
> patch set optional.  For example, I had it so that CLUSTER would
> quietly convert your relation to large format, if it was still in
> segmented format (might as well if you're writing all the data out
> anyway, right?), but perhaps that could depend on a GUC.  Likewise for
> base backup.  Etc.  Then someone concerned about hitting the 16TB
> limit on ext4 could opt out.  Or something like that.  It seems funny
> though, that's exactly the user who should want this feature (they
> have 16,000 relation segment files).
>
>
>

-- 
Mark Callaghan
mdcal...@gmail.com


Re: psql tests hangs

2023-05-12 Thread Tom Lane
Pavel Stehule  writes:
> pá 12. 5. 2023 v 17:50 odesílatel Tom Lane  napsal:
>> OK.  So after looking at this a bit, the reason PAGER and PSQL_PAGER
>> don't cause us any problems in the test environment is that they are
>> not honored unless isatty(fileno(stdin)) && isatty(fileno(stdout)).
>> It seems to me that it's a bug that there is no such check before
>> using PSQL_WATCH_PAGER.  Is there actually any defensible reason
>> for that?

> Theoretically, we can write tests for these features, and then stdout,
> stdin may not be tty.

Well, you'd test using pty's, so that psql thinks it's talking to a
terminal.  That's what we're doing now to test tab completion,
for example.

> Except for testing, using pager in non-interactive mode makes no sense.

Agreed.  Let's solve this by inserting isatty tests in psql, rather
than hacking the test environment.

regards, tom lane




Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing

2023-05-12 Thread Ryan Booz
Thanks for the continued work, Peter. I hate to be the guy that starts this way,
but this is my first ever response on pgsql-hackers. (insert awkward
smile face).
Hopefully I've followed etiquette well, but please forgive any
missteps, and I'm
happy for any help in making better contributions in the future.

On Thu, May 11, 2023 at 9:19 PM Peter Geoghegan  wrote:
>
> On Thu, May 4, 2023 at 3:18 PM samay sharma  wrote:
> > What do you think about the term "Exhaustion"? Maybe something like "XID 
> > allocation exhaustion" or "Exhaustion of allocatable XIDs"?
>
> I use the term "transaction ID exhaustion" in the attached revision,
> v4. Overall, v4 builds on the work that went into v2 and v3, by
> continuing to polish the overhaul of everything related to freezing,
> relfrozenxid advancement, and anti-wraparound autovacuum.

Just to say on the outset, as has been said earlier in the tread by others,
that this is herculean work. Thank you for putting the effort you have thus far.
There's a lot of good from where I sit in the modification efforts.
It's a heavy,
dense topic, so there's probably never going to be a perfect way to
get it all in,
but some of the context early on, especially, is helpful for framing.

>
> It would be nice if it was possible to add an animation/diagram a
> little like this one: https://tuple-freezing-demo.angusd.com (this is
> how I tend to think about the "transaction ID space".)

Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But,
this or something akin to the "clock" image I've seen elsewhere when
describing the transaction ID space would probably be helpful if it were ever
possible. In fact, there's just a lot about the MVCC stuff in general that
would benefit from diagrams. But alas, I guess that's why we have some
good go-to community talks/slide decks. :-)

> v4 also limits use of the term "wraparound" to places that directly
> discuss anti-wraparound autovacuums (plus one place in xact.sgml,
> where discussion of "true unsigned integer wraparound" and related
> implementation details has been moved). Otherwise we use the term
> "transaction ID exhaustion", which is pretty much the user-facing name
> for "xidStopLimit". I feel that this is a huge improvement, for the
> reason given to Greg earlier. I'm flexible on the details, but I feel
> strongly that we should minimize use of the term wraparound wherever
> it might have the connotation of "the past becoming the future". This
> is not a case of inventing a new terminology for its own sake. If
> anybody is skeptical I ask that they take a look at what I came up
> with before declaring it a bad idea. I have made that as easy as
> possible, by once again attaching a prebuilt routine-vacuuming.html.

Thanks again for doing this. Really helpful for doc newbies like me that
want to help but are still working through the process. Really helpful
and appreciated.

>
>
> Other changes in v4, compared to v3:
>
> * Improved discussion of the differences between non-aggressive and
> aggressive VACUUM.

This was helpful for me and not something I've previously put much thought
into. Helpful context that is missing from the current docs.

> * Explains "catch-up freezing" performed by aggressive VACUUMs directly.
>
> "Catch-up" freezing is the really important "consequence" -- something
> that emerges from how each type of VACUUM behaves over time. It is an
> indirect consequence of the behaviors. I would like to counter the
> perception that some users have about freezing only happening during
> aggressive VACUUMs (or anti-wraparound autovacuums). But more than
> that, talking about catch-up freezing seems essential because it is
> the single most important difference.
>

Similarly, this was helpful overall context of various things
happening with freezing.

> * Much improved handling of the discussion of anti-wraparound
> autovacuum, and how it relates to aggressive VACUUMs, following
> feedback from Samay.
>
> There is now only fairly minimal overlap in the discussion of
> aggressive VACUUM and anti-wraparound autovacuuming. We finish the
> discussion of aggressive VACUUM just after we start discussing
> anti-wraparound autovacuum. This transition works well, because it
> enforces the idea that anti-wraparound autovacuum isn't really special
> compared to any other aggressive autovacuum. This was something that
> Samay expressed particularly concern about: making anti-wraparound
> autovacuums sound less scary. Though it's also a concern I had from
> the outset, based on practical experience and interactions with people
> that have much less knowledge of Postgres than I do.

Agree. This flows fairly well and helps the user understand that each
"next step"
in the vacuum/freezing process has a distinct job based on previous work.

>
> * Anti-wraparound autovacuum is now mostly discussed as something that
> happens to static or mostly-static tables
> ...This moves discussion of anti-wraparound av in the direction 

Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing

2023-05-12 Thread Ryan Booz
And, of course, I forgot that I switch to text-mode after writing most
of this, so the carriage returns were unnecessary. (facepalm... sigh)

--
Ryan

On Fri, May 12, 2023 at 1:36 PM Ryan Booz  wrote:
>
> Thanks for the continued work, Peter. I hate to be the guy that starts this 
> way,
> but this is my first ever response on pgsql-hackers. (insert awkward
> smile face).
> Hopefully I've followed etiquette well, but please forgive any
> missteps, and I'm
> happy for any help in making better contributions in the future.
>
> On Thu, May 11, 2023 at 9:19 PM Peter Geoghegan  wrote:
> >
> > On Thu, May 4, 2023 at 3:18 PM samay sharma  wrote:
> > > What do you think about the term "Exhaustion"? Maybe something like "XID 
> > > allocation exhaustion" or "Exhaustion of allocatable XIDs"?
> >
> > I use the term "transaction ID exhaustion" in the attached revision,
> > v4. Overall, v4 builds on the work that went into v2 and v3, by
> > continuing to polish the overhaul of everything related to freezing,
> > relfrozenxid advancement, and anti-wraparound autovacuum.
>
> Just to say on the outset, as has been said earlier in the tread by others,
> that this is herculean work. Thank you for putting the effort you have thus 
> far.
> There's a lot of good from where I sit in the modification efforts.
> It's a heavy,
> dense topic, so there's probably never going to be a perfect way to
> get it all in,
> but some of the context early on, especially, is helpful for framing.
>
> >
> > It would be nice if it was possible to add an animation/diagram a
> > little like this one: https://tuple-freezing-demo.angusd.com (this is
> > how I tend to think about the "transaction ID space".)
>
> Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But,
> this or something akin to the "clock" image I've seen elsewhere when
> describing the transaction ID space would probably be helpful if it were ever
> possible. In fact, there's just a lot about the MVCC stuff in general that
> would benefit from diagrams. But alas, I guess that's why we have some
> good go-to community talks/slide decks. :-)
>
> > v4 also limits use of the term "wraparound" to places that directly
> > discuss anti-wraparound autovacuums (plus one place in xact.sgml,
> > where discussion of "true unsigned integer wraparound" and related
> > implementation details has been moved). Otherwise we use the term
> > "transaction ID exhaustion", which is pretty much the user-facing name
> > for "xidStopLimit". I feel that this is a huge improvement, for the
> > reason given to Greg earlier. I'm flexible on the details, but I feel
> > strongly that we should minimize use of the term wraparound wherever
> > it might have the connotation of "the past becoming the future". This
> > is not a case of inventing a new terminology for its own sake. If
> > anybody is skeptical I ask that they take a look at what I came up
> > with before declaring it a bad idea. I have made that as easy as
> > possible, by once again attaching a prebuilt routine-vacuuming.html.
>
> Thanks again for doing this. Really helpful for doc newbies like me that
> want to help but are still working through the process. Really helpful
> and appreciated.
>
> >
> >
> > Other changes in v4, compared to v3:
> >
> > * Improved discussion of the differences between non-aggressive and
> > aggressive VACUUM.
>
> This was helpful for me and not something I've previously put much thought
> into. Helpful context that is missing from the current docs.
>
> > * Explains "catch-up freezing" performed by aggressive VACUUMs directly.
> >
> > "Catch-up" freezing is the really important "consequence" -- something
> > that emerges from how each type of VACUUM behaves over time. It is an
> > indirect consequence of the behaviors. I would like to counter the
> > perception that some users have about freezing only happening during
> > aggressive VACUUMs (or anti-wraparound autovacuums). But more than
> > that, talking about catch-up freezing seems essential because it is
> > the single most important difference.
> >
>
> Similarly, this was helpful overall context of various things
> happening with freezing.
>
> > * Much improved handling of the discussion of anti-wraparound
> > autovacuum, and how it relates to aggressive VACUUMs, following
> > feedback from Samay.
> >
> > There is now only fairly minimal overlap in the discussion of
> > aggressive VACUUM and anti-wraparound autovacuuming. We finish the
> > discussion of aggressive VACUUM just after we start discussing
> > anti-wraparound autovacuum. This transition works well, because it
> > enforces the idea that anti-wraparound autovacuum isn't really special
> > compared to any other aggressive autovacuum. This was something that
> > Samay expressed particularly concern about: making anti-wraparound
> > autovacuums sound less scary. Though it's also a concern I had from
> > the outset, based on practical experience and interactions with people
> > that have much

Re: smgrzeroextend clarification

2023-05-12 Thread Greg Stark
On Thu, 11 May 2023 at 05:37, Peter Eisentraut
 wrote:
>
> Maybe it was never meant that way and only works accidentally?  Maybe
> hash indexes are broken?

It's explicitly documented to be this way. And I think it has to work
this way for recovery to work.

I think the reason you and Bharath and Andres are talking past each
other is that they're thinking about how the implementation works and
you're talking about the API definition.

If you read the API definition and treat the functions as a black box
I think you're right -- those two definitions sound pretty much
equivalent to me. They both extend the file, possibly multiple blocks,
and zero fill. The only difference is that smgrextend() additionally
allows you to provide data.

-- 
greg




Re: smgrzeroextend clarification

2023-05-12 Thread Thomas Munro
On Sat, May 13, 2023 at 6:07 AM Greg Stark  wrote:
> On Thu, 11 May 2023 at 05:37, Peter Eisentraut
>  wrote:
> > Maybe it was never meant that way and only works accidentally?  Maybe
> > hash indexes are broken?
>
> It's explicitly documented to be this way. And I think it has to work
> this way for recovery to work.
>
> I think the reason you and Bharath and Andres are talking past each
> other is that they're thinking about how the implementation works and
> you're talking about the API definition.
>
> If you read the API definition and treat the functions as a black box
> I think you're right -- those two definitions sound pretty much
> equivalent to me. They both extend the file, possibly multiple blocks,
> and zero fill. The only difference is that smgrextend() additionally
> allows you to provide data.

Just a thought: should RelationCopyStorageUsingBuffer(), the new code
used by CREATE DATABASE with the default strategy WAL_LOG, use the
newer interface so that it creates fully allocated files instead of
sparse ones?




Re: Should CSV parsing be stricter about mid-field quotes?

2023-05-12 Thread Greg Stark
On Thu, 11 May 2023 at 10:04, Joel Jacobson  wrote:
>
> The parser currently accepts quoting within an unquoted field. This can lead 
> to
> data misinterpretation when the quote is part of the field data (e.g.,
> for inches, like in the example).

I think you're thinking about it differently than the parser. I think
the parser is treating this the way, say, the shell treats quotes.
That is, it sees a quoted "I bought this for my 6" followed by an
unquoted "a laptop but it didn't fit my 8" followed by a quoted "
tablet".

So for example, in that world you might only quote commas and newlines
so you might print something like

1,2,I bought this for my "6"" laptop
" but it "didn't" fit my "8""" laptop

The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only
allows fully quoted or fully unquoted fields and there can only be
escaped double-doublequote characters in quoted fields and no
doublequote characters in unquoted fields.

But it also says

  Due to lack of a single specification, there are considerable
  differences among implementations.  Implementors should "be
  conservative in what you do, be liberal in what you accept from
  others" (RFC 793 [8]) when processing CSV files.  An attempt at a
  common definition can be found in Section 2.


So the real question is are there tools out there that generate
entries like this and what are their intentions?

> I think we should throw a parsing error for unescaped mid-field quotes,
> and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
> quotes are necessary. The error message could suggest this option when it
> encounters an unescaped mid-field quote.
>
> I think the convenience of not having to use an extra option doesn't outweigh
> the risk of undetected data integrity issues.

It's also a pretty annoying experience to get a message saying "error,
turn this option on to not get an error". I get what you're saying
too, which is more of a risk depends on whether turning off the error
is really the right thing most of the time or is just causing data to
be read incorrectly.



-- 
greg




Re: psql tests hangs

2023-05-12 Thread Tom Lane
I wrote:
> Pavel Stehule  writes:
>> Except for testing, using pager in non-interactive mode makes no sense.

> Agreed.  Let's solve this by inserting isatty tests in psql, rather
> than hacking the test environment.

Here's a proposed patch for this.  I noticed that another memo the
PSQL_WATCH_PAGER patch had not gotten was the lesson learned in
commit 18f8f784c, namely that it's a good idea to ignore empty
or all-blank settings.

regards, tom lane

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 97f7d97220..607a57715a 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5197,14 +5197,20 @@ do_watch(PQExpBuffer query_buf, double sleep, int iter)
 
 	/*
 	 * For \watch, we ignore the size of the result and always use the pager
-	 * if PSQL_WATCH_PAGER is set.  We also ignore the regular PSQL_PAGER or
-	 * PAGER environment variables, because traditional pagers probably won't
-	 * be very useful for showing a stream of results.
+	 * as long as we're talking to a terminal and "\pset pager" is enabled.
+	 * However, we'll only use the pager identified by PSQL_WATCH_PAGER.  We
+	 * ignore the regular PSQL_PAGER or PAGER environment variables, because
+	 * traditional pagers probably won't be very useful for showing a stream
+	 * of results.
 	 */
 #ifndef WIN32
 	pagerprog = getenv("PSQL_WATCH_PAGER");
+	/* if variable is empty or all-white-space, don't use pager */
+	if (pagerprog && strspn(pagerprog, " \t\r\n") == strlen(pagerprog))
+		pagerprog = NULL;
 #endif
-	if (pagerprog && myopt.topt.pager)
+	if (pagerprog && myopt.topt.pager &&
+		isatty(fileno(stdin)) && isatty(fileno(stdout)))
 	{
 		fflush(NULL);
 		disable_sigpipe_trap();


Re: psql tests hangs

2023-05-12 Thread Pavel Stehule
pá 12. 5. 2023 v 21:08 odesílatel Tom Lane  napsal:

> I wrote:
> > Pavel Stehule  writes:
> >> Except for testing, using pager in non-interactive mode makes no sense.
>
> > Agreed.  Let's solve this by inserting isatty tests in psql, rather
> > than hacking the test environment.
>
> Here's a proposed patch for this.  I noticed that another memo the
> PSQL_WATCH_PAGER patch had not gotten was the lesson learned in
> commit 18f8f784c, namely that it's a good idea to ignore empty
> or all-blank settings.
>

+1

Pavel


> regards, tom lane
>
>


Re: Should CSV parsing be stricter about mid-field quotes?

2023-05-12 Thread Andrew Dunstan


On 2023-05-11 Th 10:03, Joel Jacobson wrote:

Hi hackers,

I've come across an unexpected behavior in our CSV parser that I'd like to
bring up for discussion.

% cat example.csv
id,rating,review
1,5,"Great product, will buy again."
2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"

% psql
CREATE TABLE reviews (id int, rating int, review text);
\COPY reviews FROM example.csv WITH CSV HEADER;
SELECT * FROM reviews;

This gives:

id | rating |   review
++-
  1 |  5 | Great product, will buy again.
  2 |  3 | I bought this for my 6 laptop but it didn't fit my 8 tablet
(2 rows)



Maybe this is unexpected by you, but it's not by me. What other sane 
interpretation of that data could there be? And what CSV producer 
outputs such horrible content? As you've noted, ours certainly does not. 
Our rules are clear: quotes within quotes must be escaped (default 
escape is by doubling the quote char). Allowing partial fields to be 
quoted was a deliberate decision when CSV parsing was implemented, 
because examples have been seen in the wild.


So I don't think our behaviour is broken or needs fixing. As mentioned 
by Greg, this is an example of the adage about being liberal in what you 
accept.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


improve more permissions-related error messages

2023-05-12 Thread Nathan Bossart
This is intended as a follow-up to de4d456 [0].  I noticed that c3afe8c
introduced another "must have privileges" error message that I think should
be adjusted to use the new style introduced in de4d456.  І've attached a
small patch for this.

While looking around for other such error messages, I found a few dozen
"must be superuser" errors that might be improved with the new style.  If
folks feel this is worthwhile, I'll put together a patch.

[0] https://postgr.es/m/20230126002251.GA1506128%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 8b3de032ee..e8b288d01c 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -611,7 +611,9 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt,
 	if (!has_privs_of_role(owner, ROLE_PG_CREATE_SUBSCRIPTION))
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must have privileges of pg_create_subscription to create subscriptions")));
+ errmsg("permission denied to create subscription"),
+ errdetail("Only roles with privileges of the \"%s\" role may create subscriptions.",
+		   "pg_create_subscription")));
 
 	/*
 	 * Since a subscription is a database object, we also check for CREATE
diff --git a/src/test/regress/expected/subscription.out b/src/test/regress/expected/subscription.out
index d736246259..8b5f657897 100644
--- a/src/test/regress/expected/subscription.out
+++ b/src/test/regress/expected/subscription.out
@@ -79,7 +79,8 @@ ERROR:  subscription "regress_testsub" already exists
 -- fail - must be superuser
 SET SESSION AUTHORIZATION 'regress_subscription_user2';
 CREATE SUBSCRIPTION regress_testsub2 CONNECTION 'dbname=regress_doesnotexist' PUBLICATION foo WITH (connect = false);
-ERROR:  must have privileges of pg_create_subscription to create subscriptions
+ERROR:  permission denied to create subscription
+DETAIL:  Only roles with privileges of the "pg_create_subscription" role may create subscriptions.
 SET SESSION AUTHORIZATION 'regress_subscription_user';
 -- fail - invalid option combinations
 CREATE SUBSCRIPTION regress_testsub2 CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, copy_data = true);


Re: improve more permissions-related error messages

2023-05-12 Thread Tom Lane
Nathan Bossart  writes:
> This is intended as a follow-up to de4d456 [0].  I noticed that c3afe8c
> introduced another "must have privileges" error message that I think should
> be adjusted to use the new style introduced in de4d456.  І've attached a
> small patch for this.

+1

> While looking around for other such error messages, I found a few dozen
> "must be superuser" errors that might be improved with the new style.  If
> folks feel this is worthwhile, I'll put together a patch.

The new style is better for cases where we've broken out a predefined role
that has the necessary privilege.  I'm not sure it's worth troubling
with cases that are still just "must be superuser".  It seems like
you'd mostly just be creating work for the translation team.

regards, tom lane




Re: improve more permissions-related error messages

2023-05-12 Thread Nathan Bossart
On Fri, May 12, 2023 at 04:43:08PM -0400, Tom Lane wrote:
> Nathan Bossart  writes:
>> While looking around for other such error messages, I found a few dozen
>> "must be superuser" errors that might be improved with the new style.  If
>> folks feel this is worthwhile, I'll put together a patch.
> 
> The new style is better for cases where we've broken out a predefined role
> that has the necessary privilege.  I'm not sure it's worth troubling
> with cases that are still just "must be superuser".  It seems like
> you'd mostly just be creating work for the translation team.

Makes sense, thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Memory leak from ExecutorState context?

2023-05-12 Thread Melanie Plageman
Thanks for continuing to work on this.

Are you planning to modify what is displayed for memory usage in
EXPLAIN ANALYZE?

Also, since that won't help a user who OOMs, I wondered if the spillCxt
is helpful on its own or if we need some kind of logging message for
users to discover that this is what led them to running out of memory.

On Wed, May 10, 2023 at 02:24:19PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 8 May 2023 11:56:48 -0400 Melanie Plageman 
>  wrote:
> 
> > > @@ -934,13 +943,16 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
> > >  hashtable, nbatch, hashtable->spaceUsed);
> > >  #endif
> > >  
> > > - oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
> > > -
> > >   if (hashtable->innerBatchFile == NULL)
> > >   {
> > > + MemoryContext oldcxt =
> > > MemoryContextSwitchTo(hashtable->fileCxt); +
> > >   /* we had no file arrays before */
> > >   hashtable->innerBatchFile = palloc0_array(BufFile *,
> > > nbatch); hashtable->outerBatchFile = palloc0_array(BufFile *, nbatch);
> > > +  
> > > + MemoryContextSwitchTo(oldcxt);
> > > +
> > >   /* time to establish the temp tablespaces, too */
> > >   PrepareTempTablespaces();
> > >   }
> > > @@ -951,8 +963,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)  
> > 
> > I don't see a reason to call repalloc0_array() in a different context
> > than the initial palloc0_array().
> 
> Unless I'm wrong, wrapping the whole if/else blocks in memory context
> hashtable->fileCxt seemed useless as repalloc() actually realloc in the 
> context
> the original allocation occurred. So I only wrapped the palloc() calls.

My objection is less about correctness and more about the diff. The
existing memory context switch is around the whole if/else block. If you
want to move it to only wrap the if statement, I would do that in a
separate commit with a message describing the rationale. It doesn't seem
to save us much and it makes the diff a bit more confusing. I don't feel
strongly enough about this to protest much more, though.

> > > diff --git a/src/include/executor/hashjoin.h
> > > b/src/include/executor/hashjoin.h index 8ee59d2c71..74867c3e40 100644
> > > --- a/src/include/executor/hashjoin.h
> > > +++ b/src/include/executor/hashjoin.h
> > > @@ -25,10 +25,14 @@
> > >   *
> > >   * Each active hashjoin has a HashJoinTable control block, which is
> > >   * palloc'd in the executor's per-query context.  All other storage 
> > > needed
> > > - * for the hashjoin is kept in private memory contexts, two for each
> > > hashjoin.
> > > + * for the hashjoin is kept in private memory contexts, three for each
> > > + * hashjoin:  
> > 
> > Maybe "hash table control block". I know the phrase "control block" is
> > used elsewhere in the comments, but it is a bit confusing. Also, I wish
> > there was a way to make it clear this is for the hashtable but relevant
> > to all batches.
> 
> I tried to reword the comment with this additional info in mind in v6. Does it
> match what you had in mind?

Review of that below.

> > So, if we are going to allocate the array of pointers to the spill files
> > in the fileCxt, we should revise this comment. As I mentioned above, I
> > agree with you that if the SharedTupleStore-related buffers are also
> > allocated in this context, perhaps it shouldn't be called the fileCxt.
> > 
> > One idea I had is calling it the spillCxt. Almost all memory allocated in 
> > this
> > context is related to needing to spill to permanent storage during 
> > execution.
> 
> Agree
> 
> > The one potential confusing part of this is batch 0 for parallel hash
> > join. I would have to dig back into it again, but from a cursory look at
> > ExecParallelHashJoinSetUpBatches() it seems like batch 0 also gets a
> > shared tuplestore with associated accessors and files even if it is a
> > single batch parallel hashjoin.
> > 
> > Are the parallel hash join read_buffer and write_chunk also used for a
> > single batch parallel hash join?
> 
> I don't think so.
> 
> For the inner side, there's various Assert() around the batchno==0 special
> case. Plus, it always has his own block when inserting in a batch, to directly
> write in shared memory calling ExecParallelHashPushTuple().
> 
> The outer side of the join actually creates all batches using shared tuple
> storage mechanism, including batch 0, **only** if the number of batch is
> greater than 1. See in ExecParallelHashJoinOuterGetTuple:
> 
>   /*
>* In the Parallel Hash case we only run the outer plan directly for
>* single-batch hash joins. Otherwise we have to go to batch files, even
>* for batch 0.
>*/
>   if (curbatch == 0 && hashtable->nbatch == 1)
>   {
>   slot = ExecProcNode(outerNode);
> 
> So, for a single batch PHJ, it seems there's no temp files involved.

spill context seems appropriate, then.

> 
> > Though, perhaps spillCxt still makes sense? It doesn't specify
> > multi-batch...
> 
> I'm not sure the see 

Re: smgrzeroextend clarification

2023-05-12 Thread Andres Freund
Hi, 

On May 11, 2023 2:37:00 AM PDT, Peter Eisentraut 
 wrote:
>On 10.05.23 20:10, Andres Freund wrote:
>>> Moreover, the text "except the relation can be extended by multiple blocks
>>> at once and the added blocks will be filled with zeroes" doesn't make much
>>> sense as a differentiation, because smgrextend() does that as well.
>> 
>> Hm? smgrextend() writes a single block, and it's filled with the caller
>> provided buffer.
>
>But there is nothing that says that the block written by smgrextend() has to 
>be the one right after the last existing block.  You can give it any block 
>number, and it will write there, and the blocks in between that are skipped 
>over will effectively be filled with zeros.  This is because of the way the 
>POSIX file system APIs work.

Sure, but that's pretty much independent of my changes. With the exception of, 
I believe, hash indexes we are quite careful to never leave holes in files. And 
not just for performance reasons - itd make it much more likely to encounter 
ENOSPC while writing back blocks. Being unable to checkpoint (because they fail 
due to ENOSPC), is quite nasty. 


>Maybe it was never meant that way and only works accidentally?  Maybe hash 
>indexes are broken?

It's known behavior I think - but also quite bad. I think it got a good bit 
worse after WAL support for hash indexes went in. I think during replay we 
sometimes end up actually allocating the blocks one by one.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: WAL Insertion Lock Improvements

2023-05-12 Thread Michael Paquier
On Fri, May 12, 2023 at 07:35:20AM +0530, Bharath Rupireddy wrote:
> --enable-atomics=no, -T60:
> --enable-spinlocks=no, -T60:
> --enable-atomics=no --enable-spinlocks=no, -T60:

Thanks for these extra tests, I have not done these specific cases but
the profiles look similar to what I've seen myself.  If I recall
correctly the fallback implementation of atomics just uses spinlocks
internally to force the barriers required.
--
Michael


signature.asc
Description: PGP signature


Re: Large files for relations

2023-05-12 Thread Thomas Munro
On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN  wrote:
> Repeating what was mentioned on Twitter, because I had some experience with 
> the topic. With fewer files per table there will be more contention on the 
> per-inode mutex (which might now be the per-inode rwsem). I haven't read 
> filesystem source in a long time. Back in the day, and perhaps today, it was 
> locked for the duration of a write to storage (locked within the kernel) and 
> was briefly locked while setting up a read.
>
> The workaround for writes was one of:
> 1) enable disk write cache or use battery-backed HW RAID to make writes 
> faster (yes disks, I encountered this prior to 2010)
> 2) use XFS and O_DIRECT in which case the per-inode mutex (rwsem) wasn't 
> locked for the duration of a write
>
> I have a vague memory that filesystems have improved in this regard.

(I am interpreting your "use XFS" to mean "use XFS instead of ext4".)

Right, 80s file systems like UFS (and I suspect ext and ext2, which
were probably based on similar ideas and ran on non-SMP machines?)
used coarse grained locking including vnodes/inodes level.  Then over
time various OSes and file systems have improved concurrency.  Brief
digression, as someone who got started on IRIX in the 90 and still
thinks those were probably the coolest computers: At SGI, first they
replaced SysV UFS with EFS (E for extent-based allocation) and
invented O_DIRECT to skip the buffer pool, and then blew the doors off
everything with XFS, which maximised I/O concurrency and possibly (I
guess, it's not open source so who knows?) involved a revamped VFS to
lower stuff like inode locks, motivated by monster IRIX boxes with up
to 1024 CPUs and huge storage arrays.  In the Linux ext3 era, I
remember hearing lots of reports of various kinds of large systems
going faster just by switching to XFS and there is lots of writing
about that.  ext4 certainly changed enormously.  One reason back in
those days (mid 2000s?) was the old
fsync-actually-fsyncs-everything-in-the-known-universe-and-not-just-your-file
thing, and another was the lack of write concurrency especially for
direct I/O, and probably lots more things.  But that's all ancient
history...

As for ext4, we've detected and debugged clues about the gradual
weakening of locking over time on this list: we know that concurrent
read/write to the same page of a file was previously atomic, but when
we switched to pread/pwrite for most data (ie not making use of the
current file position), it ceased to be (a concurrent reader can see a
mash-up of old and new data with visible cache line-ish stripes in it,
so there isn't even a write-lock for the page); then we noticed that
in later kernels even read/write ceased to be atomic (implicating a
change in file size/file position interlocking, I guess).  I also
vaguely recall reading on here a long time ago that lseek()
performance was dramatically improved with weaker inode interlocking,
perhaps even in response to this very program's pathological SEEK_END
call frequency (something I hope to fix, but I digress).  So I think
it's possible that the effect you mentioned is gone?

I can think of a few differences compared to those other RDBMSs.
There the discussion was about one-file-per-relation vs
one-big-file-for-everything, whereas we're talking about
one-file-per-relation vs many-files-per-relation (which doesn't change
the point much, just making clear that I'm not proposing a 42PB file
to whole everything, so you can still partition to get different
files).  We also usually call fsync in series in our checkpointer
(after first getting the writebacks started with sync_file_range()
some time sooner).  Currently our code believes that it is not safe to
call fdatasync() for files whose size might have changed.  There is no
basis for that in POSIX or in any system that I currently know of
(though I haven't looked into it seriously), but I believe there was a
historical file system that at some point in history interpreted
"non-essential meta data" (the stuff POSIX allows it not to flush to
disk) to include "the size of the file" (whereas POSIX really just
meant that you don't have to synchronise the mtime and similar), which
is probably why PostgreSQL has some code that calls fsync() on newly
created empty WAL segments to "make sure the indirect blocks are down
on disk" before allowing itself to use only fdatasync() later to
overwrite it with data.  The point being that, for the most important
kind of interactive/user facing I/O latency, namely WAL flushes, we
already use fdatasync().  It's possible that we could use it to flush
relation data too (ie the relation files in question here, usually
synchronised by the checkpointer) according to POSIX but it doesn't
immediately seem like something that should be at all hot and it's
background work.  But perhaps I lack imagination.

Thanks, thought-provoking stuff.




Re: smgrzeroextend clarification

2023-05-12 Thread Andres Freund
Hi, 

On May 12, 2023 11:36:23 AM PDT, Thomas Munro  wrote:
>Just a thought: should RelationCopyStorageUsingBuffer(), the new code
>used by CREATE DATABASE with the default strategy WAL_LOG, use the
>newer interface so that it creates fully allocated files instead of
>sparse ones?

I played with that, but at least on Linux with ext4 and xfs it was hard to find 
cases where it really was beneficial. That's actually how I ended up finding 
the issues I'd fixed recently-ish.

I think it might be different if we had an option to not use a strategy for the 
target database - right now we IIRC write back due to ring replacement. 
Entirely or largely in order, which I think removes most of the issues you 
could have.

One issue is that it'd be worse on platforms / filesystems without fallocate 
support, because we would write the data back twice (once with zeros, once the 
real data). Perhaps we should add a separate parameter controlling the fallback 
behaviour.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [PATCHES] Post-special page storage TDE support

2023-05-12 Thread Stephen Frost
Greetings,

* David Christensen (david.christen...@crunchydata.com) wrote:
> Refreshing this with HEAD as of today, v4.

Thanks for updating this!

> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> 
> This space is reserved for extended data on the Page structure which will be 
> ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data 
> appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime 
> based on specific
> ControlFile features.
> 
> No effort is made to ensure this is backwards-compatible with existing 
> clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with 
> different settings here.

This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.

Looking at the changes to bufpage.h, in particular ...

> diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h

> @@ -19,6 +19,14 @@
>  #include "storage/item.h"
>  #include "storage/off.h"
>  
> +extern PGDLLIMPORT int reserved_page_size;
> +
> +#define SizeOfPageReservedSpace() reserved_page_size
> +#define MaxSizeOfPageReservedSpace 0
> +
> +/* strict upper bound on the amount of space occupied we have reserved on
> + * pages in this cluster */

This will eventually be calculated based on what features are supported
concurrently?

> @@ -36,10 +44,10 @@
>   * |  v pd_upper 
>   |
>   * +-++
>   * |  | tupleN ... |
> - * +-+--+-+
> - * |... tuple3 tuple2 tuple1 | "special space" |
> - * ++-+
> - *   ^ 
> pd_special
> + * +-+-++++
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +---++++
> + *  ^ pd_special  ^ 
> reserved_page_space

Right, adds a dynamic amount of space 'post-special area'.

> @@ -73,6 +81,8 @@
>   * stored as the page trailer.  an access method should always
>   * initialize its pages with PageInit and then set its own opaque
>   * fields.
> + *
> + * XXX - update more comments here about reserved_page_space
>   */

Would be good to do. ;)

> @@ -325,7 +335,7 @@ static inline void
>  PageValidateSpecialPointer(Page page)
>  {
>   Assert(page);
> - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> + AssertPageHeader) page)->pd_special + reserved_page_size) <= 
> BLCKSZ);
>   Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
>  }

This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout?  Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time.  Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting.  That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB.  This would likely require multiple shared buffer
pools, of course...

> diff --git a/src/backend/storage/page/bufpage.c 
> b/src/backend/storage/page/bufpage.c
> index 9a302ddc30..a93cd9df9f 100644
> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c
> @@ -26,6 +26,8 @@
>  /* GUC variable */
>  bool ignore_checksum_failure = false;
>  
> +int  reserved_page_size = 0; /* how much page space to 
> reserve for extended unencrypted metadata */
> +
>  
>  /* 
>   *   Page support functions
> @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
>  {
>   PageHeader  p = (PageHeader) page;
>  
> - specialSize = MAXALIGN(specialSize);
> + specialSize = MAXALIGN(specialSize) + reserved_page_size;

Rather than make it part of specialSize, I would think we'd be better
off just treating them independently.  Eg, the later pd_upper setting
would be done by:

p->pd_upper = pageSize - specialSize - reserved_page_size;

etc.

> @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int 
> flags)
>   *   one that is both unused and deallocated.
>   *
>   *   If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> - *   MaxHeapTuplesPerPage

Re: Large files for relations

2023-05-12 Thread Thomas Munro
On Sat, May 13, 2023 at 11:01 AM Thomas Munro  wrote:
> On Sat, May 13, 2023 at 4:41 AM MARK CALLAGHAN  wrote:
> > use XFS and O_DIRECT

As for direct I/O, we're only just getting started on that.  We
currently can't produce more than one concurrent WAL write, and then
for relation data, we just got very basic direct I/O support but we
haven't yet got the asynchronous machinery to drive it properly (work
in progress, more soon).  I was just now trying to find out what the
state of parallel direct writes is in ext4, and it looks like it's
finally happening:

https://www.phoronix.com/news/Linux-6.3-EXT4




Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing

2023-05-12 Thread Peter Geoghegan
On Fri, May 12, 2023 at 10:36 AM Ryan Booz  wrote:
> Just to say on the outset, as has been said earlier in the tread by others,
> that this is herculean work. Thank you for putting the effort you have thus 
> far.

Thanks!

> > It would be nice if it was possible to add an animation/diagram a
> > little like this one: https://tuple-freezing-demo.angusd.com (this is
> > how I tend to think about the "transaction ID space".)
>
> Indeed. With volunteer docs, illustrations/diagrams are hard for sure. But,
> this or something akin to the "clock" image I've seen elsewhere when
> describing the transaction ID space would probably be helpful if it were ever
> possible. In fact, there's just a lot about the MVCC stuff in general that
> would benefit from diagrams. But alas, I guess that's why we have some
> good go-to community talks/slide decks. :-)

A picture is worth a thousand words. This particular image may be
worth even more, though.

It happens to be *exactly* what I'd have done if I was tasked with
coming up with an animation that conveys the central ideas. Obviously
I brought this image up because I think that it would be great if we
could find a way to do something like that directly (not impossible,
there are a few images already). However, there is a less obvious
reason why I brought it to your attention: it's a very intuitive way
of understanding what I actually intend to convey through words -- at
least as far as talk about the cluster-wide XID space is concerned. It
might better equip you to review the patch series.

Sure, the animation will make the general idea clearer to just about
anybody -- that's a big part of what I like about it. But it also
captures the nuance that might matter to experts (e.g., the oldest XID
moves forward in jerky discrete jumps, while the next/unallocated XID
moves forward in a smooth, continuous fashion). So it works on
multiple levels, for multiple audiences/experience levels, without any
conflicts -- which is no small thing.

Do my words make you think of something a little like the animation?
If so, good.

> Thanks again for doing this. Really helpful for doc newbies like me that
> want to help but are still working through the process. Really helpful
> and appreciated.

I think that this is the kind of thing that particularly benefits from
diversity in perspectives.

> Agree. This flows fairly well and helps the user understand that each
> "next step"
> in the vacuum/freezing process has a distinct job based on previous work.

I'm trying to make it possible to read in short bursts, and to skim.
The easiest wins in this area will come from simply having more
individual sections/headings, and a more consistent structure. The
really difficult part is coming up with prose that can sort of work
for all audiences at the same time -- without alienating anybody.

Here is an example of what I mean:

The general idea of freezing can reasonably be summarized as "a
process that VACUUM uses to make pages self-contained (no need to do
pg_xact lookups anymore), that also has a role in avoiding transaction
ID exhaustion". That is a totally reasonable beginner-level (well,
relative-beginner-level) understanding of freezing. It *isn't* dumbed
down. You, as a beginner, have a truly useful take-away. At the same
time, you have avoided learning anything that you'll need to unlearn
some day. If I can succeed in doing that, I'll feel a real sense of
accomplishment.

> > * Much improved "Truncating Transaction Status Information" subsection.
> >
> > My explanation of the ways in which autovacuum_freeze_max_age can
> > affect the storage overhead of commit/abort status in pg_xact is much
> > clearer than it was in v3 -- pg_xact truncation is now treated as
> > something loosely related to the global config of anti-wraparound
> > autovacuum, which makes most sense.
> >
> This one isn't totally sinking in with me yet. Need another read.

"Truncating Transaction Status Information" is explicitly supposed to
matter much less than the rest of the stuff on freezing. The main
benefit that the DBA can expect from understanding this content is how
to save a few GB of disk space for pg_xact, which isn't particularly
likely to be possible, and is very very unlikely to be of any real
consequence, compared to everything else. If you were reading the
revised "Routine Vacuuming" as the average DBA, what you'd probably
have ended up doing is just not reading this part at all. And that
would probably be the ideal outcome. It's roughly the opposite of what
you'll get right now, by the way (bizarrely, the current docs place a
great deal of emphasis on this).

(Of course I welcome your feedback here too. Just giving you the context.)

> > It took a great deal of effort to find a structure that covered
> > everything, and that highlighted all of the important relationships
> > without going too far, while at the same time not being a huge mess.
> > That's what I feel I've arrived at with v4.
>
> In most respec

Re: [PATCH] Use RelationClose rather than table_close in heap_create_with_catalog

2023-05-12 Thread Xiaoran Wang
Thanks for all your responses.  It seems better to change the comments on the 
code
rather than call RelationClose here.

 table_close(new_rel_desc, NoLock);  /* do not unlock till end of xact */

Do I need to create another patch to fix the comments?

Best regards, xiaoran

From: tender wang 
Sent: Thursday, May 11, 2023 3:26 PM
To: Tom Lane 
Cc: Bharath Rupireddy ; Xiaoran Wang 
; pgsql-hackers@lists.postgresql.org 

Subject: Re: [PATCH] Use RelationClose rather than table_close in 
heap_create_with_catalog

!! External Email


Tom Lane mailto:t...@sss.pgh.pa.us>> 于2023年5月11日周四 00:32写道:
Bharath Rupireddy 
mailto:bharath.rupireddyforpostg...@gmail.com>>
 writes:
> And, the /* do not unlock till end of xact */, it looks like it's been
> there from day 1. It may be indicating that the ref count fo the new
> relation created in heap_create_with_catalog() will be decremented at
> the end of xact, but I'm not sure what it means.

Hmm, I think it's been copied-and-pasted from somewhere.  It's quite
common for us to not release locks on modified tables until end of
transaction.  However, that's not what's happening here, because we
actually *don't have any such lock* at this point, as you can easily
prove by stepping through this code and watching the contents of
pg_locks from another session.  We do acquire AccessExclusiveLock
on the new table eventually, but not till control returns to
DefineRelation.

I'm not real sure that I like the proposed code change: it's unclear
to me whether it's an unwise piercing of a couple of abstraction
layers or an okay change because those abstraction layers haven't
yet been applied to the new relation at all.  However, I think the
existing comment is actively misleading and needs to be changed.
Maybe something like

/*
 * Close the relcache entry, since we return only an OID not a
 * relcache reference.  Note that we do not yet hold any lockmanager
 * lock on the new rel, so there's nothing to release.
 */
table_close(new_rel_desc, NoLock);

/*
 * ok, the relation has been cataloged, so close catalogs and return
 * the OID of the newly created relation.
 */
table_close(pg_class_desc, RowExclusiveLock);
+1
 Personally, I prefer above code.

Given these comments, maybe changing the first call to RelationClose
would be sensible, but I'm still not quite convinced.

regards, tom lane



!! External Email: This email originated from outside of the organization. Do 
not click links or open attachments unless you recognize the sender.


Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-05-12 Thread John Naylor
Attached is v9, which is mostly editing the steps for restoring normal
operation, which are in 0003 now but will be squashed into 0002. Comments
to polish the wording welcome.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 0e9d6ea72216b196d37de4629736c0cf34e7cd5c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sat, 13 May 2023 11:03:40 +0700
Subject: [PATCH v9 3/3] Rough draft of complete steps to recover from (M)XID
 generation failure

TODO: squash with previous
---
 doc/src/sgml/maintenance.sgml | 61 ++-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 45d6cd1815..fee88cb647 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -675,7 +675,25 @@ HINT:  Execute a database-wide VACUUM in that database.
 In this condition any transactions already in progress can continue,
 but only read-only transactions can be started. Operations that
 modify database records or truncate relations will fail.
-The VACUUM command can still be run normally to recover.
+The VACUUM command can still be run normally.
+To restore normal operation:
+
+
+ 
+  Commit or rollback any prepared transactions.
+ 
+ 
+  Terminate any sessions that might have open transactions.
+ 
+ 
+  Drop any old replication slots.
+ 
+ 
+  Ensure autovacuum is running, and execute a database-wide VACUUM.
+To reduce the time required, it as also possible to issue manual VACUUM
+commands on the tables where relminxid is oldest.
+ 
+

 

@@ -761,6 +779,47 @@ HINT:  Execute a database-wide VACUUM in that database.
  have the oldest multixact-age.  Both of these kinds of aggressive
  scans will occur even if autovacuum is nominally disabled.
 
+
+   
+Similar to the XID case, if autovacuum fails to clear old MXIDs from a table, the
+system will begin to emit warning messages like this when the database's
+oldest MXIDs reach forty million transactions from the wraparound point:
+
+
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To prevent MultiXactId generation failure, execute a database-wide VACUUM in that database.
+
+
+(A manual VACUUM should fix the problem, as suggested by the
+hint; but note that the VACUUM must be performed by a
+superuser, else it will fail to process system catalogs and thus not
+be able to advance the database's datfrozenxid.)
+If these warnings are ignored, the system will refuse to generate new
+MXIDs once there are fewer than three million left until wraparound:
+
+
+ERROR:  database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss in database "mydb"
+HINT:  Execute a database-wide VACUUM in that database.
+
+   
+
+   
+To restore normal operation:
+
+ 
+  Commit or rollback each prepared transaction that might appear in a multixact.
+ 
+ 
+  Resolve each transaction that might appear in a multixact.
+ 
+ 
+  Ensure autovacuum is running, and execute a database-wide VACUUM.
+To reduce the time required, it as also possible to issue manual VACUUM
+commands on the tables where relminmxid is oldest.
+ 
+
+   
+

   
 
-- 
2.39.2

From 267609882a0be2764bc33fc289c0a962d47643c4 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 28 Apr 2023 16:08:33 +0700
Subject: [PATCH v9 1/3] Correct outdated docs and messages regarding XID/MXID
 limits

Previously, when approaching xidStopLimit or xidWrapLimit, log messages
would warn against a "database shutdown", and when it reached those
limits claimed that it "is not accepting commands". This language
originated in commit 60b2444cc when the xidStopLimit was added in
2005. At that time, even a trivial SELECT would have failed.

Commit 295e63983d in 2007 introduced virtual transaction IDs, which
allowed actual XIDs to be allocated lazily when it is necessary
to do so, such as when modifying database records. Since then, the
behavior at these limits is merely to refuse to allocate new XIDs,
so read-only queries can continue to be initiated.

The "database shutdown" message was also copied to the message
warning for multiWarnLimit when it was added.

This has been wrong for a very long time, so backpatch to all
supported branches.

Aleksander Alekseev, with some editing by me

Reviewed by Pavel Borisov and Peter Geoghegan
Discussion: https://postgr.es/m/caj7c6tm2d277u2wh8x78kg8ph3tduqebv3_jcjqakyqfhcf...@mail.gmail.com
---
 doc/src/sgml/maintenance.sgml  | 22 --
 src/backend/access/transam/multixact.c |  4 ++--
 src/backend/access/transam/varsup.c| 12 ++--
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 9cf9d030a8..48d43cb339 100644

Re: cutting down the TODO list thread

2023-05-12 Thread John Naylor
On Mon, Feb 6, 2023 at 11:04 AM John Naylor 
wrote:
> I'll try to get back to culling the list items at the end of April.

I've made another pass at this. Previously, I went one or two sections at a
time, but this time I tried just skimming the whole thing and noting what
jumps out at me. Also, I've separated things into three categories: Remove,
move to "not wanted list", and revise. Comments and objections welcome, as
always.

-
1. Remove

These are either ill-specified, outdated, possibly done already, or not
enough interest. If someone proposed them again, we could consider it, so I
propose to just remove these, but not move them to the Not Wanted list.
Also some questions:

Improve UTF8 combined character handling?
-> If this is referring to normalization, we have it now. If not, what
needs improving?

Improve COPY performance
-> What's the criterion for declaring this done? There are many areas that
get performance improvements -- why does this need to be called out
separately? (There's been some work in the past couple years, so maybe at
least we need to find out the current bottlenecks.)

Improve line drawing characters
-> This links to a proposal with no responses titled "Add a setting in psql
that set the linestyle to unicode only if the client encoding was actually
UTF8". If we don't drop this, the entry text should at least give an idea
what the proposal is.

Consider improving the continuation prompt
-> A cosmetic proposal that stalled -- time to retire it?

SMP scalability improvements
-> Both threads are from 2007

Consider GnuTLS if OpenSSL license becomes a problem
-> We now have the ability to swap in another implementation during the
build process

Allow creation of universal binaries for Darwin
-> From 2008: Is this still a thing?

Allow plug-in modules to emulate features from other databases
-> This sounds like a hook or extension.

Rethink our type system
-> Way out of scope, and short on details.

Add support for polymorphic arguments and return types to languages other
than PL/PgSQL
-> Seems if someone needed this, they would say so (no thread).

Add support for OUT and INOUT parameters to languages other than PL/PgSQL
-> Ditto


2. Propose to move to the "Not Wanted list":

(Anything already at the bottom under the heading "Features We Do Not
Want", with the exception of "threads in a single process". I'll just
remove that -- if we ever decide that's worth pursuing, it'll be because we
decided we can't really avoid it anymore, and in that case we surely don't
need to put it here.)

Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM ANALYZE,
and CLUSTER
-> There are external tools that help with this kind of analysis

Allow regex operations in PL/Perl using UTF8 characters in non-UTF8 encoded
databases
-> Seems pie-in-the-sky as well as a niche problem?

Add option to print advice for people familiar with other databases
-> Doesn't seem relevant anymore?

Consider having single-page pruning update the visibility map
-> Comment from Heikki in the thread:
"I think I was worried about the possible performance impact of having to
clear the bit in visibility map again. If you're frequently updating a
tuple so that HOT and page pruning is helping you, setting the bit in
visibility map seems counter-productive; it's going to be cleared soon
again by another UPDATE. That's just a hunch, though. Maybe the overhead
is negligible."

Consider mmap()'ing entire files into a backend?
-> Isn't this a can of worms?

Consider allowing control of upper/lower case folding of unquoted
identifiers
-> Would we ever consider this?

-
Other -- need adjustment or update?

Do async I/O for faster random read-ahead of data
-> This section needs to be revised, since there is on-going work on AIO.
There are a couple other entries that should maybe be put under a different
heading?

*** The whole section on Windows has lots of old stuff -- which are still
relevant?

(ECPG) Fix nested C comments
-> What needs fixing? It should work fine.

Improve speed of tab completion
-> Is this still a problem?

Testing pgstat via pg_regress is tricky and inefficient. Consider making a
dedicated pgstat test-suite.
-> This has significantly changed recently -- how are things now?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: cutting down the TODO list thread

2023-05-12 Thread Tom Lane
John Naylor  writes:
> I've made another pass at this. Previously, I went one or two sections at a
> time, but this time I tried just skimming the whole thing and noting what
> jumps out at me. Also, I've separated things into three categories: Remove,
> move to "not wanted list", and revise. Comments and objections welcome, as
> always.

Generally agree that the items you've listed are obsolete.  Some comments:

> Allow creation of universal binaries for Darwin
> -> From 2008: Is this still a thing?

This entry might stem from 4b362c662.  It's still something that nobody
has bothered to make happen, not even after another architecture
transition on Apple's part.  And there are things partly outside our
control in the area, see d69a419e6.  I doubt it will ever happen.

> Add support for polymorphic arguments and return types to languages other
> than PL/PgSQL
> -> Seems if someone needed this, they would say so (no thread).

I think this is still an issue.  Surprised nobody has yet gotten annoyed
enough to do something about it.

> Add support for OUT and INOUT parameters to languages other than PL/PgSQL
> -> Ditto

And ditto.

> Consider allowing control of upper/lower case folding of unquoted
> identifiers
> -> Would we ever consider this?

I think that one's dead as a doornail.

> (ECPG) Fix nested C comments
> -> What needs fixing? It should work fine.

I might be mistaken, but I think 8ac5e88f9 may have fixed this.

> Improve speed of tab completion
> -> Is this still a problem?

I keep worrying that tab-complete.c will become so ungainly as to
present a human-scale performance problem.  But there's been pretty
much zero complaints so far.  Let's drop this one until some actual
issue emerges.

regards, tom lane