Re: Add the ability to limit the amount of memory that can be allocated to backends.

2024-12-27 Thread Tomas Vondra



On 12/27/24 20:14, James Hunter wrote:
> Reviving this thread, because I am thinking about something related --
> please ignore the "On Fri, Dec 27, 2024" date, this seems to be an
> artifact of me re-sending the message, from the list archive. The
> original message was from January 28, 2024.
> 
> On Fri, Dec 27, 2024 at 11:02 AM Tomas Vondra
>  wrote:
>>
>> Firstly, I agree with the goal of having a way to account for memory
>> used by the backends, and also ability to enforce some sort of limit.
>> It's difficult to track the memory at the OS level (interpreting RSS
>> values is not trivial), and work_mem is not sufficient to enforce a
>> backend-level limit, not even talking about a global limit.
>>
>> But as I said earlier, it seems quite strange to start by introducing
>> some sort of global limit, combining memory for all backends. I do
>> understand that the intent is to have such global limit in order to
>> prevent issues with the OOM killer and/or not to interfere with other
>> stuff running on the same machine. And while I'm not saying we should
>> not have such limit, every time I wished to have a memory limit it was a
>> backend-level one. Ideally a workmem-like limit that would "adjust" the
>> work_mem values used by the optimizer (but that's not what this patch
>> aims to do), or at least a backstop in case something goes wrong (say, a
>> memory leak, OLTP application issuing complex queries, etc.).
> 
> I think what Tomas suggests is the right strategy. I am thinking of
> something like:
> 
> 1. Say we have a backend_work_mem limit. Then the total amount of
> memory available on the entire system, for all queries, as work_mem,
> would be backend_work_mem * max_connections.
> 
> 2. We use this backend_work_mem to "adjust" work_mem values used by
> the executor. (I don't care about the optimizer right now -- optimizer
> just does its best to predict what will happen at runtime.)
> 
> At runtime, every node that uses work_mem currently checks its memory
> usage against the session work_mem (and possibly hash_mem_multiplier)
> GUC(s). Instead, now, every node would check against its node-specific
> "adjusted" work_mem. If it exceeds this limit, it spills, using
> existing logic.
> 
> In other words -- existing logic spills based on comparison to global
> work_mem GUC. Let's make it spill, instead, based on comparison to an
> operator-local "PlanState.work_mem" field.
> 
> And then let's set that "PlanState.work_mem" field based on a new
> "backend_work_mem" GUC. Then no queries will run OOM (at least, not
> due to work_mem -- we'll address other memory usages separately), so
> they won't need to be canceled. Instead, they'll spill.
> 
> This strategy solves the ongoing problem of how to set work_mem, if
> some queries have lots of operators and others don't -- now we just
> set backend_work_mem, as a limit on the entire query's total work_mem.
> And a bit of integration with the optimizer will allow us to
> distribute the total backend_work_mem to individual execution nodes,
> with the goal of minimizing spilling, without exceeding the
> backend_work_mem limit.
> 
> Anyway, I revived this thread to see if there's interest in this sort
> of strategy --
> 

I think there's still interest in having a memory limit of this kind,
but it's not very clear to me what you mean by "adjusted work_mem". Can
you explain how would that actually work in practice?

Whenever I've been thinking about this in the past, it wasn't clear to
me how would we know when to start adjusting work_mem, because we don't
know which nodes will actually use work_mem concurrently.

We certainly don't know that during planning - e.g. because we don't
actually know what nodes will be "above" the operation we're planning.
And AFIAK we don't have any concept which parts of the plan may be
active at the same time ...

Let's say we have limits work_mem=4MB and query_mem=8MB (to keep it
simple). And we have a query plan with 3 separate hash joins, each
needing 4MB. I believe these things can happen:

1) The hash joins are in distant parts of the query plan, and will not
actually run concurrently - we'll run each hashjoin till completion
before starting the next one. So, no problem with query_mem.

2) The hash joins are active at the same time, initialized one by one.
But we don't know what, so we'll init HJ #1, it'll get 4MB of memory.
Then we'll init HJ #2, it'll also get 4MB of memory. And now we want to
init HJ #3, but we've already exhausted query_mem, but the memory is
already used and we can't reclaim it.

How would the work_mem get adjusted in these cases? Ideally, we'd not
adjust work_mem in (1), because it'd force hash joins to spill data to
disk, affecting performance when there's enough memory to just run the
query plan (but maybe that's a reasonable price for the memory limit?).
While in (2) we'd need to start adjusting the memory from the beginning,
but I don't think we know that so early.

Can you explain / walk me thro

RE: Connection limits/permissions, slotsync workers, etc

2024-12-27 Thread Zhijie Hou (Fujitsu)
On Saturday, December 28, 2024 1:31 AM Tom Lane  wrote:
> 
> "Zhijie Hou (Fujitsu)"  writes:
> > On Thursday, December 26, 2024 3:50 AM Tom Lane 
> >> I wonder if the AV launcher and slotsync worker could be reclassified
> >> as "auxiliary processes" instead of being their own weird animal.
> 
> > It appears that the current aux processes do not run transactions as
> > stated in the comments[1], so we may need to somehow release this
> > restriction to achieve the goal.
> 
> Ah, right, I'd forgotten about that restriction.  I agree that removing it 
> wouldn't
> be very reasonable.  However, I still would rather avoid making the slotsync
> worker be its very own special snowflake, because that offers no support for
> the next person who wants to invent a new sort of specialized
> transaction-capable process.
> 
> Attached is an alternative proposal that groups the autovac launcher and
> slotsync worker into a new category of "special workers" (better name
> welcome).  I chose to put them into the existing autovacFreeProcs freelist,
> partly because the autovac launcher lives there already but mostly because I
> don't want to add another freelist in a patch we need to put into v17.  (As
> written, your patch is an ABI break.

Right, thanks for pointing it out. The new version patch looks good to me.

Best Regards,
Hou zj





Re: Fix bank selection logic in SLRU

2024-12-27 Thread Andrey Borodin



> On 19 Dec 2024, at 20:48, Yura Sokolov  wrote:
> 
> Here's version with type change bits16 -> uint16

Thanks! This version looks good to me. I’ll mark the CF entry as RfC.


Best regards, Andrey Borodin.



Re: RFC: Allow EXPLAIN to Output Page Fault Information

2024-12-27 Thread Jeremy Schneider
On Fri, 27 Dec 2024 15:15:40 +0100
"Jelte Fennema-Nio"  wrote:

> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> > torikoshia  writes:  
> >> I have attached a PoC patch that modifies EXPLAIN to include page
> >> fault information during both the planning and execution phases of
> >> a query.   
> >
> > Surely these numbers would be too unstable to be worth anything.  
> 
> What makes you think that? I'd expect them to be similarly stable to
> the numbers we get for BUFFERS. i.e. Sure they won't be completely
> stable, but I expect them to be quite helpful when debugging perf
> issues, because large numbers indicate that the query is disk-bound
> and small numbers indicate that it is not.
> 
> These numbers seem especially useful for setups where shared_buffers
> is significantly smaller than the total memory available to the
> system. In those cases the output from BUFFERS might give the
> impression that that you're disk-bound, but if your working set still
> fits into OS cache then the number of page faults is likely still
> low. Thus telling you that the numbers that you get back from BUFFERS
> are not as big of a problem as they might seem.

We'd probably need to combine both pg_buffercache_evict() and
/proc/sys/vm/drop_caches to get stable numbers - which is something I
have done in the past for testing.

Another thought would be splitting out the IO timing information into
two values - IO timing for reads that triggered major faults, versus
IO timing for reads that did not.

And system views like pg_stat_database seem worth considering too.

-Jeremy




Remove support for OpenSSL *eay32 libs on Windows

2024-12-27 Thread Daniel Gustafsson
The thread in [0] which adds Meson support for *eay32 OpenSSL library names on
Windows reminded me that we should remove these from master since we no longer
support OpenSSL 1.0.2 (which was the final version with those names).  Attached
is a small (as of yet untested on Windows/autoconf) diff removing these names
from the Makefiles.

[0] cahrt6656w9onfomqthbgydcm5ckz7hcgzft8l+n0ezbzfcn...@mail.gmail.com
--
Daniel Gustafsson



v1-0001-Remove-support-for-linking-with-libeay32-and-ssle.patch
Description: Binary data


Re: RFC: Allow EXPLAIN to Output Page Fault Information

2024-12-27 Thread Jelte Fennema-Nio

On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:

torikoshia  writes:
I have attached a PoC patch that modifies EXPLAIN to include page fault 
information during both the planning and execution phases of a query. 


Surely these numbers would be too unstable to be worth anything.


What makes you think that? I'd expect them to be similarly stable to the
numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
but I expect them to be quite helpful when debugging perf issues,
because large numbers indicate that the query is disk-bound and small
numbers indicate that it is not.

These numbers seem especially useful for setups where shared_buffers is
significantly smaller than the total memory available to the system. In
those cases the output from BUFFERS might give the impression that that
you're disk-bound, but if your working set still fits into OS cache then
the number of page faults is likely still low. Thus telling you that the
numbers that you get back from BUFFERS are not as big of a problem as
they might seem.




Re: FileFallocate misbehaving on XFS

2024-12-27 Thread Bruce Momjian
On Fri, Dec 20, 2024 at 01:25:41PM +0100, Jakub Wartak wrote:
> On Thu, Dec 19, 2024 at 7:49 AM Michael Harris  wrote:
> No one else has responded, so I'll try. My take is that we got very limited
> number of reports (2-3) of this stuff happening and it always seem to be >90%
> space used, yet the adoption of PG16 is rising, so we may or may not see more
> errors of those kind, but on another side of things: it's frequency is so rare

My guess is that if you are seeing this with 90% full, and not lesser
values, that something is being temporarily exhausted in XFS and the
kernel errno API just doesn't allow returning enough detail to explain
what is being exhausted.  A traditional Unix file system only has limits
for the inode table and free blocks, but I am sure XFS has many more
areas of possible exhaustion.

I didn't see any mention of checking the kernel log, which might have
more details of what XFS is having problems with.

I agree trying to modify Postgres for this now makes no sense since we
don't even know the cause.  Once we find the cause, and admit it can't
be avoided or quickly fixed, we can reevaluate.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.






Re: cannot to compile extension by meson on windows

2024-12-27 Thread Pavel Stehule
pá 27. 12. 2024 v 9:50 odesílatel Vladlen Popolitov <
v.popoli...@postgrespro.ru> napsal:

> Pavel Stehule писал(а) 2024-12-01 20:52:
> > Hi
> >
> > Did somebody test compilation of any extension on the WIN platform by
> > using meson?
> >
> > I prepared meson.build
> > https://github.com/orafce/orafce/blob/master/meson.build
> >
> > I tested it successfully on Linux.
> >
> > But it fails on Windows - a lot of compilation fails on missing
> > libintl.h
> >
> > DOCDIR = C:/PROGRA~1/POSTGR~1/16/doc
> > HTMLDIR = C:/PROGRA~1/POSTGR~1/16/doc
> > INCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include
> > PKGINCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include
> > INCLUDEDIR-SERVER = C:/PROGRA~1/POSTGR~1/16/include/server`
> >
> > looks so msvc cannot work with just this configuration.
> >
> > I can compile orafce when I use setup described by
> > https://github.com/orafce/orafce/blob/master/README.msvc
> >
> > Regards
> >
> > Pavel
>
> Hi!
>
>   In other thread
>
> https://www.postgresql.org/message-id/TYVPR01MB1133078C93F9FE432CA466573E40E2%40TYVPR01MB11330.jpnprd01.prod.outlook.com
> Kohei Harikae makes good work to clarify meson documentation, especially
> regarding additional libraries.
>
> I suppose in your case meson did not found gettext library, libintl.h
> from this library.
>
> You can:
> 1) install gettext, f.e by vcpkg package manager:
> vcpkg.exe install gettext:x64-windows
> --x-install-root=c:\postgres\gettext
>
> 2) You could add gettext .pc file directory to PKG_CONFIG_PATH ( ;
> separated list, meson uses configarations in this order
> SET
>
> PKG_CONFIG_PATH=c:\postgres\gettext\x64-windows\lib\pkgconfig;%PKG_CONFIG_PATH%
> but this file is not created for this library (you can create it by
> yourself, but exists other solution - in step 4)
> 3) make all other steps to run meson, f.e. call "C:\Program
> Files\Microsoft Visual
> Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
>
> 4) run meson setup with option -Dnls=enabled - it enables national
> languages and uses gettext.
> And you have to add options for gettext libraries and includes
> directories:
> meson setup c:\builddir -Dnsl=enabled
>
>
> -Dextra_include_dirs=c:\postgres\gettext\x64-windows\include,c:\otherlibs\include
>   -Dextra_lib_dirs=c:\postgres\gettext\x64-windows\lib,c:\otherlibs\lib
> ...other options...
>
> extra_include_dirs and extra_lib_dirs are options defined by PostgreSQL.
> This options are
> comma separated lists of directories, if every value does not contain
> comma itself (if contains, it better
> to read meson get_options() documentation, it is not easy to explain
> shortly meson language syntax).
>
> meson using PKG_CONFIG_PATH detects all library and makes all work with
> include and lib paths,
> but if library does not have .pc file, you can define paths in options.
>
> If you build with -Dnlas=enabled, you have to be careful with tests.
> Tests in this case run
> with the default system language, and some tests will fail, as they are
> check log files output
> and compare it with English answers.
>
> I hope this helps you.
>

Thank you for the interesting information. I compiled extensions already.
Maybe the problem is in pg_config

if meson.get_compiler('c').get_id() == 'msvc'
  incdir = [includedir_server / 'port/win32_msvc',
includedir_server / 'port/win32',
includedir_server,
includedir]
  postgres_lib = meson.get_compiler('c').find_library(
   'postgres',
dirs: libdir,
static: true,
required: true
 )
else
  incdir = [ includedir_server ]
  postgres_lib = ''
endif

looks so returned include dir is not enough for successful compilation, and
there should be some extra magic.
Maybe pg_config doesn't support specific msvc configuration.





>
> --
> Best regards,
>
> Vladlen Popolitov.
>


Re: Add Postgres module info

2024-12-27 Thread David E. Wheeler
On Dec 26, 2024, at 20:09, Andrei Lepikhov  wrote:

> We intentionally wrote a library, not an extension. According to user usage 
> and upgrade patterns, it works across the whole instance and in any database 
> or locally in a single backend and ends its impact at the end of its life.

The same is true for the shared libraries included in many extensions. A shared 
library is just an extension that’s available in all databases and has no 
associated SQL interface.

> Also, it doesn't maintain any object in the database and is managed by GUCs.

Sure, but this is just a semantic argument. The Postgres developers get to 
decide what terms mean. I’m I argue it can be worthwhile to merge the idea of a 
library into extensions.

> For example, my libraries add query tree transformations/path recommendations 
> to the planner. It doesn't depend on a database and doesn't maintain DSM 
> segments and users sometimes want to use it in specific backends, not 
> databases - in a backend dedicated to analytic queries without extra overhead 
> to backends, picked out for short queries. For what reason do I need to add 
> complexity and call 'CREATE EXTENSION' here and add version info only in a 
> specific database? Just because of a formal one-directory structure?

Perhaps shared-library only extensions are not limited to a single database.

Best,

David



signature.asc
Description: Message signed with OpenPGP


Re: More reliable nbtree detection of unsatisfiable RowCompare quals involving a leading NULL key/element

2024-12-27 Thread Peter Geoghegan
On Mon, Dec 23, 2024 at 1:02 PM Peter Geoghegan  wrote:
> Attached patch fixes the problem by moving detection of RowCompare
> unsatisfiable-due-to-NULL cases into _bt_fix_scankey_strategy.

Attached is v2, which adds several new regression tests, giving
certain relevant nbtree code paths test coverage.

v2 also makes some small tweaks to _bt_check_rowcompare(), the actual
comparator used by scans with a RowCompare qual. We no longer need to
account for the possibility that the first row member from a
RowCompare scan key contains a null once the scan is underway (we know
that _bt_preprocess_keys would have recognized the qual as
unsatisfiable had the RowCompare looked like that).

-- 
Peter Geoghegan


v2-0001-Improve-nbtree-unsatisfiable-RowCompare-detection.patch
Description: Binary data


RE: Connection limits/permissions, slotsync workers, etc

2024-12-27 Thread Zhijie Hou (Fujitsu)
On Thursday, December 26, 2024 3:50 AM Tom Lane 

Hi,

> In connection with the discussion at [1], I started to look at exactly which 
> server
> processes ought to be subject to connection limits (datconnlimit,
> ACL_CONNECT, and related checks).  The current situation seems to be an
> inconsistent mess.
> 
> Looking at InitPostgres, InitializeSessionUserId, and CheckMyDatabase, we
> have several different permissions and resource-limiting checks that may be
> enforced against an incoming process:
>   ReservedConnections/SuperuserReservedConnections
>   rolcanlogin
>   rolconnlimit
>   datallowconn
>   datconnlimit
>   database's ACL_CONNECT privilege
> 
> I want to argue that ReservedConnections, rolconnlimit, and datconnlimit
> should only be applied to regular backends.  It makes no particular sense to
> enforce them against autovac workers, background workers, or wal senders,
> because each of those types of processes has its own resource-limiting
> PGPROC pool.  It's especially bad to enforce them against parallel workers,
> since that creates edge-case failures that make the use of parallelism not
> transparent to applications.  We hacked around that at 73c9f91a1, but I think
> that should be reverted in favor of not applying the check at all.
> 
> I further argue that rolcanlogin, datallowconn, and ACL_CONNECT should not
> be checked in a parallel worker, again primarily on the grounds that this 
> creates
> parallelism-specific failures (cf [1]).
> The two scenarios where this occurs are (1) permissions were revoked since
> the leader process connected, or (2) leader is currently running as a role 
> that
> wouldn't have permission to connect on its own.  We don't attempt to kick out
> the leader process when either of those things happen, so why should we
> prevent it from using parallelism?
> 
> The current situation for each of these checks is:
> 
> ReservedConnections is enforced if !am_superuser && !am_walsender, so it is
> enforced against non-superuser background workers, which is silly because
> BG workers have their own PGPROC pool; moreover, what's the argument for
> letting walsenders but not other kinds of background processes escape this?
> I propose changing it to apply only to regular backends.
> 
> rolcanlogin is enforced if IsUnderPostmaster and we reach
> InitializeSessionUserId, which basically reduces to regular backends, parallel
> workers, logrep workers, and walsenders.  Seems reasonable for logrep
> workers and walsenders which represent fresh logins, but not for parallel
> workers.  I propose fixing this by making ParallelWorkerMain pass
> BGWORKER_BYPASS_ROLELOGINCHECK.
> 
> rolconnlimit is enforced if IsUnderPostmaster and we reach
> InitializeSessionUserId and it's not a superuser.  So that applies to
> non-superuser parallel workers, logrep workers, and walsenders, and I don't
> think it's reasonable to apply it to any of them since those all come out of 
> other
> PGPROC pools.  I propose switching that to apply only to regular backends.
> 
> BTW, I kind of wonder why rolconnlimit is ineffectual for superusers, 
> especially
> when rolcanlogin does apply to them.  Not a bug exactly, but it sure seems
> inconsistent.  If you've taken the trouble to set it you'd expect it to work. 
>  Shall
> we take out the !is_superuser check?
> 
> datallowconn is enforced against all non-standalone, non-AV-worker
> processes that connect to a specific database, except bgworkers that pass
> BGWORKER_BYPASS_ALLOWCONN (nothing in core except test module).
> So again that includes parallel workers, logrep workers, and walsenders.
> Again this seems reasonable for logrep workers and walsenders but not for
> parallel workers.  I propose fixing this by making ParallelWorkerMain pass
> BGWORKER_BYPASS_ALLOWCONN.
> 
> datconnlimit is enforced against all non-superuser processes, including
> per-DB walsenders and BG workers (see above).
> This is fairly dubious given that they have their own PGPROC pools.
> I propose switching that to apply only to regular backends, too.
> 
> ACL_CONNECT is enforced against all non-superuser processes, including
> per-DB walsenders and BG workers (includes parallel workers, subscription
> apply workers, logrep workers).
> Perhaps that's OK for most, but I argue not for parallel workers; maybe skip 
> it if
> BGWORKER_BYPASS_ALLOWCONN?
> 
> Also, the enforcement of datconnlimit and rolconnlimit is inconsistent in
> another way: our counting of the pre-existing processes is pretty random.
> CountDBConnections is not consistent with either the current set of processes
> that datconnlimit is enforced against, or my proposal to enforce it only 
> against
> regular backends.  It counts anything that is not
> AmBackgroundWorkerProcess, including AV workers and per-DB walsenders.
> I think it should count only regular backends, because anything else leads to
> weird inconsistencies in whether a rejection occurs.
> 
> The same applies to CountUs

Re: Connection limits/permissions, slotsync workers, etc

2024-12-27 Thread Tom Lane
"Zhijie Hou (Fujitsu)"  writes:
> On Thursday, December 26, 2024 3:50 AM Tom Lane 
>> I wonder if the AV launcher and slotsync worker could be reclassified as 
>> "auxiliary
>> processes" instead of being their own weird animal.

> It appears that the current aux processes do not run transactions as stated 
> in the
> comments[1], so we may need to somehow release this restriction to achieve the
> goal.

Ah, right, I'd forgotten about that restriction.  I agree that
removing it wouldn't be very reasonable.  However, I still would
rather avoid making the slotsync worker be its very own special
snowflake, because that offers no support for the next person
who wants to invent a new sort of specialized transaction-capable
process.

Attached is an alternative proposal that groups the autovac launcher
and slotsync worker into a new category of "special workers" (better
name welcome).  I chose to put them into the existing autovacFreeProcs
freelist, partly because the autovac launcher lives there already
but mostly because I don't want to add another freelist in a patch
we need to put into v17.  (As written, your patch is an ABI break.
It'd probably be safe to add a new freelist at the end of the struct
in v17, but I'm a little shy about that in view of recent bugs.  In
any case, a freelist having at most two members seems rather silly.)

I was amused but not terribly surprised to notice that the comments
in InitProcGlobal were *already* out of date, in that they didn't
account for the walsender PGPROC pool.  We have a remarkably bad
track record for updating comments that are more than about two
lines away from the code they describe :-(

Thoughts?

regards, tom lane

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 720ef99ee8..10d4fb4ea1 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -197,11 +197,12 @@ InitProcGlobal(void)
 
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
-	 * five separate consumers: (1) normal backends, (2) autovacuum workers
-	 * and the autovacuum launcher, (3) background workers, (4) auxiliary
-	 * processes, and (5) prepared transactions.  Each PGPROC structure is
-	 * dedicated to exactly one of these purposes, and they do not move
-	 * between groups.
+	 * six separate consumers: (1) normal backends, (2) autovacuum workers and
+	 * special workers, (3) background workers, (4) walsenders, (5) auxiliary
+	 * processes, and (6) prepared transactions.  (For largely-historical
+	 * reasons, we combine autovacuum and special workers into one category
+	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
+	 * one of these purposes, and they do not move between groups.
 	 */
 	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
@@ -269,12 +270,13 @@ InitProcGlobal(void)
 		}
 
 		/*
-		 * Newly created PGPROCs for normal backends, autovacuum and bgworkers
-		 * must be queued up on the appropriate free list.  Because there can
-		 * only ever be a small, fixed number of auxiliary processes, no free
-		 * list is used in that case; InitAuxiliaryProcess() instead uses a
-		 * linear search.   PGPROCs for prepared transactions are added to a
-		 * free list by TwoPhaseShmemInit().
+		 * Newly created PGPROCs for normal backends, autovacuum workers,
+		 * special workers, bgworkers, and walsenders must be queued up on the
+		 * appropriate free list.  Because there can only ever be a small,
+		 * fixed number of auxiliary processes, no free list is used in that
+		 * case; InitAuxiliaryProcess() instead uses a linear search.  PGPROCs
+		 * for prepared transactions are added to a free list by
+		 * TwoPhaseShmemInit().
 		 */
 		if (i < MaxConnections)
 		{
@@ -282,13 +284,13 @@ InitProcGlobal(void)
 			dlist_push_tail(&ProcGlobal->freeProcs, &proc->links);
 			proc->procgloballist = &ProcGlobal->freeProcs;
 		}
-		else if (i < MaxConnections + autovacuum_max_workers + 1)
+		else if (i < MaxConnections + autovacuum_max_workers + NUM_SPECIAL_WORKER_PROCS)
 		{
-			/* PGPROC for AV launcher/worker, add to autovacFreeProcs list */
+			/* PGPROC for AV or special worker, add to autovacFreeProcs list */
 			dlist_push_tail(&ProcGlobal->autovacFreeProcs, &proc->links);
 			proc->procgloballist = &ProcGlobal->autovacFreeProcs;
 		}
-		else if (i < MaxConnections + autovacuum_max_workers + 1 + max_worker_processes)
+		else if (i < MaxConnections + autovacuum_max_workers + NUM_SPECIAL_WORKER_PROCS + max_worker_processes)
 		{
 			/* PGPROC for bgworker, add to bgworkerFreeProcs list */
 			dlist_push_tail(&ProcGlobal->bgworkerFreeProcs, &proc->links);
@@ -358,8 +360,11 @@ InitProcess(void)
 	if (IsUnderPostmaster)
 		RegisterPostmasterChildActive();
 
-	/* Decide which list should supply our PGPROC. */
-	if (AmAutoVacuumLauncherProcess() || AmAutoVacuumWorkerProcess())
+	/*
+	 * Decide which list sh

Re: add support for the old naming libs convention on windows (ssleay32.lib and libeay32.lib)

2024-12-27 Thread Daniel Gustafsson
> On 9 Dec 2024, at 07:25, Darek Ślusarczyk  wrote:

> I've prepared another patch:
> - it prioritizes libssl and libcrypto over ssleay32 and libeay32
> - it checks ssleay32 and libeay32 on windows only
> - I tested it locally on both lnx/win enforcing various possible scenarios

I'm neither a Windows or Meson expert but this seems correct to me and matches
the autoconf stanza for OpenSSL libraries on Windows.  CC:ing one of the
project experts on Meson for a second opinion.

--
Daniel Gustafsson





Re: POC: make mxidoff 64 bits

2024-12-27 Thread Maxim Orlov
On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas  wrote:

>
> Attached is some more cleanup on top of patch set v9, removing more dead
> stuff related to wraparound. I also removed the oldestOffsetKnown
> variable and related code. It was needed to deal with clusters upgraded
> from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will
> rewrite the SLRUs, it's no longer needed.
>
Yep, multixact.c looks correct to me. As for "XXX could use
SimpleLruTruncate()", yes, for sure.
Actually, xl_multixact_truncate.startTruncMemb is also no longer needed, is
it?


> Does the pg_upgrade code work though, if you have that buggy situation
> where oldestOffsetKnown == false ?
>
> >
> >   if (!TransactionIdIsValid(*xactptr))
> >   {
> >   /* Corner case 3: we must be looking at unused
> slot zero */
> >   Assert(offset == 0);
> >   continue;
> >   }
>
> After upgrade, this corner case 3 would *not* happen on offset == 0. So
> looks like we're still missing test coverage for this upgrade corner case.
>
Am I understanding correctly that you want to have a test corresponding to
the buggy 9.3 and 9.4 era versions?
Do you think we could imitate this scenario on a current master branch like
that:
1) generate a couple of offsets segments for the first table;
2) generate more segments for a second table;
3) drop first table;
4) stop pg cluster;
5) remove pg_multixact/offsets/
6) upgrade?

PFA, v10-0016-TEST-try-to-replicate-buggy-oldest-offset.patch
This test will fail now, for an obvious reason, but is this case a relevant
one?

-- 
Best regards,
Maxim Orlov.
<>


Re: [PATCHES] Post-special page storage TDE support

2024-12-27 Thread Greg Sabino Mullane
Re-reading this thread, and this has been nagging me:

On Mon, Nov 13, 2023 at 3:03 PM David Christensen <
david.christen...@crunchydata.com> wrote:

>   - so why do we need to recompute offsets on every single page?  I'd
>> instead
>
>   add a distinct offset variable for each feature.
>>
>
> This would work iff there is a single page feature set across all pages in
> the cluster; I'm not sure we don't want more flexibility here.
>

I'm quite skeptical about the usefulness of this; seems to be at best a
footgun to have per-page features. Also, none of the examples given so far
(e.g. encryption, better checksums) are of a non-global nature, so did you
have specific examples in mind?

Cheers,
Greg


Re: [PATCHES] Post-special page storage TDE support

2024-12-27 Thread Greg Sabino Mullane
On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian  wrote:

> The value of TDE is limited from a security value perspective, but high on
> the list of security policy requirements.  Our community is much more
> responsive to actual value vs policy compliance value.
>

True. The number of forks, though, makes me feel this is a "when", not "if"
feature. Has there been any other complex feature forked/implemented by so
many? Maybe columnar storage?

Cheers,
Greg


Re: Test to dump and restore objects left behind by regression

2024-12-27 Thread Daniel Gustafsson
> On 20 Dec 2024, at 11:01, Ashutosh Bapat  wrote:
> On Wed, Dec 18, 2024 at 7:39 PM Daniel Gustafsson  wrote:
>> 
>>> On 18 Dec 2024, at 12:28, Ashutosh Bapat  
>>> wrote:

>> +   if (   $ENV{PG_TEST_EXTRA}
>> +   && $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/)
>> Should this also test that $oldnode and $newnode have matching pg_version to
>> keep this from running in a cross-version upgrade test?  While it can be 
>> argued
>> that running this in a cross-version upgrade is breaking it and getting to 
>> keep
>> both pieces, it's also not ideal to run a resource intensive test we know 
>> will
>> fail.  (It can't be done at this exact callsite, just picked to illustrate.)
> 
> You already wrote it in parenthesis. At the exact callsite $oldnode
> and $newnode can not be of different versions. In fact newnode is yet
> to be created at this point. But $oldnode has the same version as the
> server run from the code. In a cross-version upgrade this test will
> not be executed. I am confused as to what this comment is about.

Sure, it can't be checked until $newnode is created, but it seems like a cheap
test to ensure it's not executed as part of someones cross-version tests.

>> +  my $format_spec = substr($format, 0, 1);
>> This doesn't seem great for readability, how about storing the formats and
>> specfiers in an array of Perl hashes which can be iterated over with
>> descriptive names, like $format{'name'} and $format{'spec'}?
> 
> Instead of an array of hashes, I used a single hash with format
> description as key and format spec as value. Hope that's acceptable.

LGTM.

--
Daniel Gustafsson





Bug and memory leaks with access to file links with long names (Windows, MSVS)

2024-12-27 Thread Vladlen Popolitov

Hi!

PostgreSQL build under Windows with MS Visual Studio has functions to 
work with
links (unlike msys2 that has own functions). If a database has link 
pointing
to location longer 130 chars, function pgreadlink fails to recognise 
this

link and cancels query.
 The reason of the error - small buffer for link name - MAX_PATH 
symbols, though this
buffer must have the place for at least 2 MAX_PATH : substitution and 
print names

of the link.

 How to reproduce:

If build directory has length ~100 chars or longer, 
pg_rewind/004_pg_xlog_symlink

test will fail (Windows, MSVS build)

 Steps to reproduce:

call "C:\Program Files\Microsoft Visual 
Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat" x64

SET LC_MESSAGES=C

SET 
BUILDDIR=c:\postgresql-builds\length_30\length_40\length_50\length_60\length_70\length_80\length_90\length100\

meson setup %BUILDDIR%  --prefix=c:\postgresql-builds\install
c:
cd %BUILDDIR%
meson compile -C %BUILDDIR%
meson install -C %BUILDDIR%
meson test -C %BUILDDIR%

 Memory leak:
In case of error the function in the code branch reporting the error 
does not return

Windows file handle and Windows heap allocation for error message text.

 Solution:
Proposed patch fixes link information buffer size changing it to the 
documented value

MAXIMUM_REPARSE_DATA_BUFFER_SIZE,
and fixes memory leaks - the message buffer copied to a buffer on stack
with maximal message size 64Kb.

--
Best regards,

Vladlen Popolitov.
From 4eb8afe77064a0cf3d815aa0ffa1ebf815fd1165 Mon Sep 17 00:00:00 2001
From: Vladlen Popolitov 
Date: Fri, 27 Dec 2024 15:23:30 +0300
Subject: [PATCH v1] Fix bug with access to file links with long name (Windows,
 MSVC)

---
 src/port/dirmod.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/port/dirmod.c b/src/port/dirmod.c
index f98d5a7bf2..dbdd8a6e9f 100644
--- a/src/port/dirmod.c
+++ b/src/port/dirmod.c
@@ -310,7 +310,12 @@ pgreadlink(const char *path, char *buf, size_t size)
 {
 	DWORD		attr;
 	HANDLE		h;
-	char		buffer[MAX_PATH * sizeof(WCHAR) + offsetof(REPARSE_JUNCTION_DATA_BUFFER, PathBuffer)];
+/*
+ * The buffer size described in documentation
+ * https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/fsctl-set-reparse-point
+ * It stores paths for SubstitutaName and PrintName.
+ */
+	char		buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE];
 	REPARSE_JUNCTION_DATA_BUFFER *reparseBuf = (REPARSE_JUNCTION_DATA_BUFFER *) buffer;
 	DWORD		len;
 	int			r;
@@ -350,6 +355,8 @@ pgreadlink(const char *path, char *buf, size_t size)
 		 NULL))
 	{
 		LPSTR		msg;
+		/* the maximal size of the message returned by FormatMessage is 64Kb */
+		char		msgBuffer[0x1];
 
 		errno = 0;
 		FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER |
@@ -358,17 +365,24 @@ pgreadlink(const char *path, char *buf, size_t size)
 	  NULL, GetLastError(),
 	  MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT),
 	  (LPSTR) &msg, 0, NULL);
+		strncpy(msgBuffer, msg, sizeof(msgBuffer));
+		if (strlen(msg) >= sizeof(msgBuffer))
+		{
+			msgBuffer[sizeof(msgBuffer) - 1] = 0;
+		}
 #ifndef FRONTEND
+		LocalFree(msg);
+		CloseHandle(h);
 		ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not get junction for \"%s\": %s",
-		path, msg)));
+		path, msgBuffer)));
 #else
 		fprintf(stderr, _("could not get junction for \"%s\": %s\n"),
 path, msg);
-#endif
 		LocalFree(msg);
 		CloseHandle(h);
+#endif
 		errno = EINVAL;
 		return -1;
 	}
-- 
2.42.0.windows.2



Re: Exists pull-up application with JoinExpr

2024-12-27 Thread Ilia Evdokimov

Hi Alena,

Thank you for your work on subqueries with JOIN.

Have you considered the scenario where in subquery includes a qual like 
(tc.aid = 1)? When I tried executing those queries I receive different 
results. In my opinion, to prevent this, we should add filters for such 
quals within the loop 'foreach (lc, all_clauses)'


EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta
WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1);
  QUERY PLAN
--
 Hash Join (actual rows=1 loops=1)
   Hash Cond: (ta.id = tb.id)
   Buffers: local hit=3
   ->  Seq Scan on ta (actual rows=3 loops=1)
 Buffers: local hit=1
   ->  Hash (actual rows=3 loops=1)
 Buckets: 4096  Batches: 1  Memory Usage: 33kB
 Buffers: local hit=2
 ->  HashAggregate (actual rows=3 loops=1)
   Group Key: tb.id
   Batches: 1  Memory Usage: 121kB
   Buffers: local hit=2
   ->  Nested Loop (actual rows=3 loops=1)
 Buffers: local hit=2
 ->  Seq Scan on tb (actual rows=3 loops=1)
   Buffers: local hit=1
 ->  Materialize (actual rows=1 loops=3)
   Storage: Memory  Maximum Storage: 17kB
   Buffers: local hit=1
   ->  Seq Scan on tc (actual rows=1 loops=1)
 Filter: (aid = 1)
 Rows Removed by Filter: 1
 Buffers: local hit=1
(23 rows)



EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id =
tb.id WHERE tc.aid = 1);
    QUERY PLAN
---
 Seq Scan on ta (actual rows=1 loops=1)
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 2
   Buffers: local hit=6
   SubPlan 1
 ->  Nested Loop (actual rows=0 loops=3)
   Buffers: local hit=5
   ->  Index Only Scan using tb_pkey on tb (actual rows=0 loops=3)
 Index Cond: (id = ta.id)
 Heap Fetches: 1
 Buffers: local hit=4
   ->  Seq Scan on tc (actual rows=1 loops=1)
 Filter: (aid = 1)
 Buffers: local hit=1
(14 rows)

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.





Re: [Bug Fix]standby may crash when switching-over in certain special cases

2024-12-27 Thread px shi
Hi,


I inserted the following code in walsender.c:2509(v15.8) to reproduce the
issue.


{

WalSnd *walsnd = MyWalSnd;


SpinLockAcquire(&walsnd->mutex);


if (walsnd->flush % wal_segment_size == 0 && walsnd->sentPtr ==
walsnd->flush && walsnd->flush > wal_segment_size)

{

for (int i = 0; i < max_wal_senders; i++)

{

   WalSnd *walsnder = &WalSndCtl->walsnds[i];

if (walsnder->pid == walsnd->pid)

continue;

SpinLockAcquire(&walsnder->mutex);

if (walsnder->pid == 0)

{

SpinLockRelease(&walsnder->mutex);

continue;

}

if (walsnder->flush % wal_segment_size == 0 &&
walsnder->flush > wal_segment_size)

elog(PANIC, "simulate primary crashed,
flushedlsn %X/%X", LSN_FORMAT_ARGS(walsnder->flush));

SpinLockRelease(&walsnder->mutex);

}

   }

   SpinLockRelease(&walsnd->mutex);


}


Regards,

Pixian Shi

px shi  于2024年10月10日周四 15:01写道:

>  Although I've not tested it because I don't have good way to reproduce
>> the problem
>
>
>
> I use GDB to reproduce the issue by killing the primary node with kill -9
> when the standby’s flushedlsn was at the begining of a WAL segment.
>
>
> However, this causes to request the primary to stream data from the
>> requested location again, even if the standby has already valid data. It
>> could be wasteful, for example, when applying WAL record is delayed due to
>> conflict with recovery or recovery_min_apply_delay.
>
>
>  I had overlooked this problem, thanks for pointing that out.
>
> I think that comparing walrcv->receiveStart with recptr would be a better
> approach. Specifically, if walrcv->receiveStart is greater than recptr,  then
> reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from
> the primary when the standby already has valid data.A new patch has been
> submitted.
>
>
> Regards,
>
> Pixian Shi
>
>
> Yugo NAGATA  于2024年10月9日周三 14:46写道:
>
>> On Mon, 30 Sep 2024 15:14:54 +0800
>> px shi  wrote:
>>
>> > Thanks for responding.
>> >
>> >
>> > > It is odd that the standby server crashes when
>> >
>> > replication fails because the standby would keep retrying to get the
>> >
>> > next record even in such case.
>> >
>> >
>> >  As I mentioned earlier, when replication fails, it retries to establish
>> > streaming replication. At this point, the value of *walrcv->flushedUpto
>> *is
>> > not necessarily the data actually flushed to disk. However, the startup
>> > process mistakenly believes that the latest flushed LSN is
>> > *walrcv->flushedUpto* and attempts to open the corresponding WAL file,
>> > which doesn't exist, leading to a file open failure and causing the
>> startup
>> > process to PANIC.
>>
>> Although I've not tested it because I don't have good way to reproduce
>> the problem,
>> the patch seems to fix the issue by rewinding walrcv->flushedUpto to the
>> requested
>> WAL location always when the streaming is requested to start. However,
>> this causes
>> to request the primary to stream data from the requested location again,
>> even if
>> the standby has already valid data. It could be wasteful, for example,
>> when applying WAL
>> record is delayed due to conflict with recovery or
>> recovery_min_apply_delay.
>>
>> It might be better if if could notice that there is not requested record
>> in the primary's
>> timeline before requesting the streaming, but I don't have a good
>> solution for now.
>>
>> Regards,
>> Yugo Nagata
>>
>> > Regards,
>> > Pixian Shi
>> >
>> > Yugo NAGATA  于2024年9月30日周一 13:47写道:
>> >
>> > > On Wed, 21 Aug 2024 09:11:03 +0800
>> > > px shi  wrote:
>> > >
>> > > > Yugo Nagata  于2024年8月21日周三 00:49写道:
>> > > >
>> > > > >
>> > > > >
>> > > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
>> > > standbys
>> > > > > of
>> > > > > > the primary server respectively, it is not surprising that s2
>> has
>> > > > > progressed
>> > > > > > far than s1 when the primary fails. I believe that this is the
>> case
>> > > you
>> > > > > should
>> > > > > > use pg_rewind. Even if flushedUpto is reset as proposed in your
>> > > patch,
>> > > > > s2 might
>> > > > > > already have applied a WAL record that s1 has not processed
>> yet, and
>> > > > > there
>> > > > > > would be no gurantee that subsecuent applys suceed.
>> > > > >
>> > > > >
>> > > >  Thank you for your response. In my scenario, s1 and s2 is the
>> standbys
>> > > of
>> > > > the primary server respectively, and s1 a synchronous standby and
>> s2 is
>> > > an
>> > > > asynchronous standby. You mentioned that if s2's replay progress is
>> ahead
>> > > > of s1, pg_rewind should be used. However, what I'm trying to
>> address is
>> > > an
>> > > > issue

Re: Connection limits/permissions, slotsync workers, etc

2024-12-27 Thread Tom Lane
Also, here's a patch for the rest of what I was talking about.

We'll need to back-patch this given that the CVE-2024-10978 changes
caused these sorts of problems in all branches, but I've not yet
attempted to back-patch.  It looks like it might be a bit painful
thanks to past code churn in these areas.

I didn't do anything about the idea of making rolconnlimit applicable
to superusers.  If we do that at all, it should only be in HEAD.
Also, I got a shade less enthusiastic about it after noting that this
logic is parallel to that for datconnlimit, and it does seems sensible
to allow superusers to ignore datconnlimit.  Maybe it's fine for the
two limits to operate differently, but I'm unsure.

Also, it probably would make sense to rename PGPROC.isBackgroundWorker
to isRegularBackend (inverting the sense of the boolean), but that
doesn't seem like back-patch material either, so I didn't include it
here.  I think we can get away with a subtle adjustment of which
processes that flag is set for in the back branches, but not with
renaming it.

regards, tom lane

From 84017bf4da912cad44416b729860baf2edfe66fd Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Fri, 27 Dec 2024 15:36:31 -0500
Subject: [PATCH v1] Exclude parallel workers from connection privilege/limit
 checks.

Cause parallel workers to not check datallowconn, rolcanlogin, and
ACL_CONNECT privileges.  The leader already checked these things
(except for rolcanlogin which might have been checked for a different
role).  Re-checking can accomplish little except to induce unexpected
failures in applications that might not even be aware that their query
has been parallelized.  We already had the principle that parallel
workers rely on their leader to pass a valid set of authorization
information, so this change just extends that a bit further.

Also, modify the ReservedConnections, datconnlimit and rolconnlimit
logic so that these limits are only enforced against regular backends,
and only regular backends are counted while checking if the limits
were already reached.  Previously, background processes that had an
assigned database or role were subject to these limits (with rather
random exclusions for autovac workers and walsenders), and the set of
existing processes that counted against each limit was quite haphazard
as well.  The point of these limits, AFAICS, is to ensure the
availability of PGPROC slots for regular backends.  Since all other
types of processes have their own separate pools of PGPROC slots, it
makes no sense either to enforce these limits against them or to count
them while enforcing the limit.

While edge-case failures of these sorts have been possible for a
long time, the problem got a good deal worse with commit 5a2fed911
(CVE-2024-10978), which caused parallel workers to make these checks
using the leader's current role instead of its AuthenticatedUserId,
thus allowing parallel queries to fail after SET ROLE.  That older
behavior was fairly accidental and I have no desire to return to it.

This patch allows reverting 73c9f91a1 which was an emergency hack
to suppress these same checks in some cases.  It wasn't complete,
as shown by a recent bug report from Laurenz Albe.  We can also
revert fd4d93d26 and 492217301, which hacked around the problems in
one regression test.

Like 5a2fed911, back-patch to supported branches (which sadly no
longer includes v12).

Discussion: https://postgr.es/m/1808397.1735156...@sss.pgh.pa.us
---
 src/backend/access/transam/parallel.c| 10 --
 src/backend/access/transam/twophase.c|  2 +-
 src/backend/postmaster/bgworker.c|  4 +--
 src/backend/storage/ipc/procarray.c  |  4 +--
 src/backend/storage/lmgr/proc.c  |  4 +--
 src/backend/utils/init/miscinit.c|  8 -
 src/backend/utils/init/postinit.c| 40 
 src/include/miscadmin.h  |  1 +
 src/include/storage/proc.h   |  2 +-
 src/test/modules/worker_spi/worker_spi.c | 10 --
 10 files changed, 37 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1..60d95037b3 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1419,10 +1419,16 @@ ParallelWorkerMain(Datum main_arg)
 			fps->session_user_is_superuser);
 	SetCurrentRoleId(fps->outer_user_id, fps->role_is_superuser);
 
-	/* Restore database connection. */
+	/*
+	 * Restore database connection.  We skip connection authorization checks,
+	 * reasoning that (a) the leader checked these things when it started, and
+	 * (b) we do not want parallel mode to cause these failures, because that
+	 * would make use of parallel query plans not transparent to applications.
+	 */
 	BackgroundWorkerInitializeConnectionByOid(fps->database_id,
 			  fps->authenticated_user_id,
-			  0);
+			  BGWORKER_BYPASS_ALLOWCONN |
+			  BGWORKER_BYPAS

Re: [PATCHES] Post-special page storage TDE support

2024-12-27 Thread Bruce Momjian
On Thu, Dec 12, 2024 at 09:15:55AM -0600, David Christensen wrote:
> On Tue, Dec 10, 2024 at 12:54 AM Michael Paquier  wrote:
> >
> > On Wed, Mar 13, 2024 at 11:26:48AM -0500, David Christensen wrote:
> > > Enclosing v4 for this patch series, rebased atop the
> > > constant-splitting series[1].  For the purposes of having cfbot happy,
> > > I am including the prerequisites as a squashed commit v4-, however
> > > this is not technically part of this series.
> >
> > The last update of this thread is from march 2024, with no replies and
> > no reviews.  Please note that this fails in the CI so I'd suggest a
> > rebase for now, and I have marked the patch as waiting on author.  If
> > there is a lack of interest, well..
> 
> I can't say there is a lack of interest from the author per se :), but
> not really seeing much in the way of community engagement makes me
> think it's largely unwanted.  I'd certainly be happy to rebase and
> reengage, but if it's not wanted at the conceptual level it doesn't
> seem worth the effort.  It's hard to interpret lack of response as
> "don't care, fine" vs "don't want" vs "haven't looked, -hackers is a
> firehose".

The value of TDE is limited from a security value perspective, but high
on the list of security policy requirements.  Our community is much more
responsive to actual value vs policy compliance value.

When I started focusing on TDE, it was going to require changes to
buffer reads/writes, WAL, and require a way to store secret keys.  I
thought those changes would be acceptable given TDE's security value. 
Once the file I/O changes were required, I think the balance tilted to
TDE requiring too many code changes given its security value (not policy
compliance value).

At least that is my analysis, and part of me wishes I was wrong.  I know
there are several commercial forks of TDE, mostly because companies are
more sensitive to policy compliance value, which translates to monetary
value for them.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.






Re: An improvement of ProcessTwoPhaseBuffer logic

2024-12-27 Thread Vitaly Davydov
On Friday, December 27, 2024 10:37 MSK, Michael Paquier  
wrote:

> So please see the attached.  You will note that RemoveTwoPhaseFile(),
> ProcessTwoPhaseBuffer() and TwoPhaseFilePath() now require a
> FullTransactionId, hence callers need to think about the epoch to use.
> That should limit future errors compared to passing the file name as
> optional argument.

In general, I like your solution to use FullTransactionId. I haven't found any 
evident problems. I just would like to propose to create a couple of new 
functions like RemoveTwoPhaseFileInCurrentEpoch, TwoPhaseFilePathInCurrentEpoch 
which accept TransactionId instead FullTransactionId. It may make the code 
clearer and result into less boilerplate code. I tried to do some hand testing. 
It seems, the problem is gone with the patch. Thank you!

As an idea, I would like to propose to store FullTransactionId in global 
transaction state instead of TransactionId. I'm not sure, it will consume 
significant additional memory, but it make the code more clear and probably 
result into less number of locks.

With best regards,
Vitaly
From 1e74e01a2485e2b3a59f880a5eaf86af1af81cc5 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov 
Date: Fri, 27 Dec 2024 18:03:28 +0300
Subject: [PATCH] Some cosmetic fixes

---
 src/backend/access/transam/twophase.c | 92 +++
 1 file changed, 37 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 01f5d728be..9418fd74ee 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -228,6 +228,7 @@ static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 const char *gid, TimestampTz prepared_at, Oid owner,
 Oid databaseid);
 static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning);
+static void RemoveTwoPhaseFileInCurrentEpoch(TransactionId xid, bool giveWarning);
 static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
 
 /*
@@ -934,6 +935,22 @@ TwoPhaseFilePath(char *path, FullTransactionId fxid)
 	XidFromFullTransactionId(fxid));
 }
 
+static inline int
+TwoPhaseFilePathInCurrentEpoch(char *path, TransactionId xid)
+{
+	uint32		epoch;
+	FullTransactionId fxid;
+	FullTransactionId nextFullXid;
+
+	/* Use current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
+
+	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+	return TwoPhaseFilePath(path, fxid);
+}
+
 /*
  * 2PC state file format:
  *
@@ -1279,16 +1296,8 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	pg_crc32c	calc_crc,
 file_crc;
 	int			r;
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
-
-	/* get current epoch */
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
 
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	TwoPhaseFilePath(path, fxid);
+	TwoPhaseFilePathInCurrentEpoch(path, xid);
 
 	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
@@ -1656,18 +1665,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * be from the current epoch.
 	 */
 	if (ondisk)
-	{
-		uint32		epoch;
-		FullTransactionId nextFullXid;
-		FullTransactionId fxid;
-
-		/* get current epoch */
-		nextFullXid = ReadNextFullTransactionId();
-		epoch = EpochFromFullTransactionId(nextFullXid);
-
-		fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-		RemoveTwoPhaseFile(fxid, true);
-	}
+		RemoveTwoPhaseFileInCurrentEpoch(xid, true);
 
 	MyLockedGxact = NULL;
 
@@ -1723,6 +1721,21 @@ RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning)
 	 errmsg("could not remove file \"%s\": %m", path)));
 }
 
+static void
+RemoveTwoPhaseFileInCurrentEpoch(TransactionId xid, bool giveWarning)
+{
+	uint32		epoch;
+	FullTransactionId nextFullXid;
+	FullTransactionId fxid;
+
+	/* get current epoch */
+	nextFullXid = ReadNextFullTransactionId();
+	epoch = EpochFromFullTransactionId(nextFullXid);
+
+	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+	RemoveTwoPhaseFile(fxid, true);
+}
+
 /*
  * Recreates a state file. This is used in WAL replay and during
  * checkpoint creation.
@@ -1735,21 +1748,13 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 	char		path[MAXPGPATH];
 	pg_crc32c	statefile_crc;
 	int			fd;
-	FullTransactionId fxid;
-	FullTransactionId nextFullXid;
-	uint32		epoch;
 
 	/* Recompute CRC */
 	INIT_CRC32C(statefile_crc);
 	COMP_CRC32C(statefile_crc, content, len);
 	FIN_CRC32C(statefile_crc);
 
-	/* Use current epoch */
-	nextFullXid = ReadNextFullTransactionId();
-	epoch = EpochFromFullTransactionId(nextFullXid);
-
-	fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
-	TwoPhaseFilePath(path, fxid);
+	TwoPhaseFilePathInCurrentEpoch(path, xid);
 
 	fd = OpenTransientFile(path,
 		   O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
@@ -2286,7 +2291,6 @@ ProcessTwoPhaseBuffer(Fu

Re: Re: proposal: schema variables

2024-12-27 Thread jian he
hi.

+ if (!OidIsValid(varid))
+ AcceptInvalidationMessages();
+ else if (OidIsValid(varid))
+ LockDatabaseObject(VariableRelationId, varid, 0, AccessShareLock);

we can change it to
+ if (!OidIsValid(varid))
+ AcceptInvalidationMessages();
+ else
+ LockDatabaseObject(VariableRelationId, varid, 0, AccessShareLock);

inval_count didn't explain at all, other places did actually explain it.
Can we add some text explaining inval_count? (i didn't fully understand
this part, that is why i am asking..)

seems IdentifyVariable all these three ereport(ERROR...) don't have
regress tests,
i think we should have it. Am I missing something?

create variable v2 as int;
let v2.a = 1;
ERROR:  type "integer" of target session variable "public.v2" is not a
composite type
LINE 1: let v2.a = 1;
^
the error messages look weird.
IMO, it should either be
"type of session variable "public.v2" is not a composite type"
or
"session variable "public.v2" don't have attribute "a"

also, the above can be as a regress test for:
+ if (attrname && !is_rowtype)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("type \"%s\" of target session variable \"%s.%s\" is not a
composite type",
+ format_type_be(typid),
+ get_namespace_name(get_session_variable_namespace(varid)),
+ get_session_variable_name(varid)),
+ parser_errposition(pstate, stmt->location)));
since we don't have coverage tests for it.


+ if (coerced_expr == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATATYPE_MISMATCH),
+ errmsg("variable \"%s.%s\" is of type %s,"
+ " but expression is of type %s",
+ get_namespace_name(get_session_variable_namespace(varid)),
+ get_session_variable_name(varid),
+ format_type_be(typid),
+ format_type_be(exprtypid)),
+ errhint("You will need to rewrite or cast the expression."),
+ parser_errposition(pstate, exprLocation((Node *) expr;

generally, errmsg(...) should put it into one line for better grep-ability
so we can change it to:
+errmsg("variable \"%s.%s\" is of type %s, but expression is of type %s"...)

also no coverage tests?
simple test case for it:
create variable v2 as int;
let v2 = '1'::jsonb;

---<<<>>>--
+let_target:
+ ColId opt_indirection
+ {
+ $$ = list_make1(makeString($1));
+ if ($2)
+  $$ = list_concat($$,
+   check_indirection($2, yyscanner));
+ }
if you look closely, it seems the indentation level is wrong in
line "$$ = list_concat($$,"?

---<<<>>>--
i did some refactoring in session_variables.sql for privilege check.
make the tests more neat, please check attached.
diff --git a/src/test/regress/expected/session_variables.out b/src/test/regress/expected/session_variables.out
index 9f75128c42..4873ac5438 100644
--- a/src/test/regress/expected/session_variables.out
+++ b/src/test/regress/expected/session_variables.out
@@ -115,85 +115,126 @@ ALTER DEFAULT PRIVILEGES
 SET ROLE TO regress_variable_owner;
 CREATE VARIABLE svartest.var1 AS int;
 SET ROLE TO DEFAULT;
-\dV+ svartest.var1
-List of variables
-  Schema  | Name |  Type   | Collation | Owner  |Access privileges | Description 
---+--+-+---++--+-
- svartest | var1 | integer |   | regress_variable_owner | regress_variable_owner=rw/regress_variable_owner+| 
-  |  | |   || regress_variable_reader=r/regress_variable_owner | 
+--should be ok. since ALTER DEFAULT PRIVILEGES
+--allow regress_variable_reader to have SELECT priviledge
+SELECT has_session_variable_privilege('regress_variable_reader', 'svartest.var1', 'SELECT'); -- t
+ has_session_variable_privilege 
+
+ t
 (1 row)
 
 DROP VARIABLE svartest.var1;
 DROP SCHEMA svartest;
 DROP ROLE regress_variable_reader;
--- check WITH GRANT OPTION
+-begin of check GRANT WITH GRANT OPTION and REVOKE GRANTED BY
 CREATE ROLE regress_variable_r1;
 CREATE ROLE regress_variable_r2;
 SET ROLE TO regress_variable_owner;
-CREATE VARIABLE var1 AS int;
+CREATE VARIABLE var1 AS int;  --var1 will owned by regress_variable_owner
 GRANT SELECT ON VARIABLE var1 TO regress_variable_r1 WITH GRANT OPTION;
 SET ROLE TO regress_variable_r1;
 GRANT SELECT ON VARIABLE var1 TO regress_variable_r2 WITH GRANT OPTION;
 SET ROLE TO DEFAULT;
-SELECT varacl FROM pg_variable WHERE varname = 'var1';
-   varacl
--
- {regress_variable_owner=rw/regress_variable_owner,regress_variable_r1=r*/regress_variable_owner,regress_variable_r2=r*/regress_variable_r1}
+SELECT has_session_variable_privilege('regress_variable_r1', 'pu

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2024-12-27 Thread James Hunter
Reviving this thread, because I am thinking about something related --
please ignore the "On Fri, Dec 27, 2024" date, this seems to be an
artifact of me re-sending the message, from the list archive. The
original message was from January 28, 2024.

On Fri, Dec 27, 2024 at 11:02 AM Tomas Vondra
 wrote:
>
> Firstly, I agree with the goal of having a way to account for memory
> used by the backends, and also ability to enforce some sort of limit.
> It's difficult to track the memory at the OS level (interpreting RSS
> values is not trivial), and work_mem is not sufficient to enforce a
> backend-level limit, not even talking about a global limit.
>
> But as I said earlier, it seems quite strange to start by introducing
> some sort of global limit, combining memory for all backends. I do
> understand that the intent is to have such global limit in order to
> prevent issues with the OOM killer and/or not to interfere with other
> stuff running on the same machine. And while I'm not saying we should
> not have such limit, every time I wished to have a memory limit it was a
> backend-level one. Ideally a workmem-like limit that would "adjust" the
> work_mem values used by the optimizer (but that's not what this patch
> aims to do), or at least a backstop in case something goes wrong (say, a
> memory leak, OLTP application issuing complex queries, etc.).

I think what Tomas suggests is the right strategy. I am thinking of
something like:

1. Say we have a backend_work_mem limit. Then the total amount of
memory available on the entire system, for all queries, as work_mem,
would be backend_work_mem * max_connections.

2. We use this backend_work_mem to "adjust" work_mem values used by
the executor. (I don't care about the optimizer right now -- optimizer
just does its best to predict what will happen at runtime.)

At runtime, every node that uses work_mem currently checks its memory
usage against the session work_mem (and possibly hash_mem_multiplier)
GUC(s). Instead, now, every node would check against its node-specific
"adjusted" work_mem. If it exceeds this limit, it spills, using
existing logic.

In other words -- existing logic spills based on comparison to global
work_mem GUC. Let's make it spill, instead, based on comparison to an
operator-local "PlanState.work_mem" field.

And then let's set that "PlanState.work_mem" field based on a new
"backend_work_mem" GUC. Then no queries will run OOM (at least, not
due to work_mem -- we'll address other memory usages separately), so
they won't need to be canceled. Instead, they'll spill.

This strategy solves the ongoing problem of how to set work_mem, if
some queries have lots of operators and others don't -- now we just
set backend_work_mem, as a limit on the entire query's total work_mem.
And a bit of integration with the optimizer will allow us to
distribute the total backend_work_mem to individual execution nodes,
with the goal of minimizing spilling, without exceeding the
backend_work_mem limit.

Anyway, I revived this thread to see if there's interest in this sort
of strategy --

Thanks,
James




Re: [PATCHES] Post-special page storage TDE support

2024-12-27 Thread Bruce Momjian
On Fri, Dec 27, 2024 at 12:25:11PM -0500, Greg Sabino Mullane wrote:
> On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian  wrote:
> 
> The value of TDE is limited from a security value perspective, but high on
> the list of security policy requirements.  Our community is much more
> responsive to actual value vs policy compliance value.
> 
> 
> True. The number of forks, though, makes me feel this is a "when", not "if"
> feature. Has there been any other complex feature forked/implemented by so
> many? Maybe columnar storage?

That is a great question.  We have TDE implementations from EDB,
Fujitsu, Percona, Cybertec, and Crunchy Data, and perhaps others, and
that is a lot of duplicated effort.

As far as parallels, I think compatibility with Oracle and MSSQL are
areas that several companies have developed that the community is
unlikely to ever develop, I think because they are pure compatibility,
not functionality.  I think TDE having primarily policy compliance value
also might make it something the community never develops.

I think this blog post is the clearest I have seen about the technical
value vs.policy compliance value of TDE:


https://www.percona.com/blog/why-postgresql-needs-transparent-database-encryption-tde/

One possible way TDE could be added to community Postgres is if the code
changes required were reduced due to an API redesign.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.






Re: Windows meson build

2024-12-27 Thread Vladlen Popolitov

Vladlen Popolitov писал(а) 2024-12-27 09:57:

Kohei Harikae (Fujitsu) писал(а) 2024-12-27 04:31:

Hi,

Thank you for your advice.
I added a sample setting PKG_CONFIG_PATH based on your advice.

How do you think about this?

Regards,
Kohei Harikae



Hi,
From my point of view it looks good. Actual information is added,
excessive information is removed.
I think in the phrase:
"add the directory where the .pc file resides to PKG_CONFIG_PATH."
better to add wording, that PKG_CONFIG_PATH is the list of directories 
separated
by PATH separator ( ; in Windows, : in POSIX). Otherwise the reader has 
to
search additional information about this variable in meson 
documentation.


Hi

I think, it is better to add the other options how to configure external
libraries, if they do not have .pc file:
add meson options defined by PostgreSQL for its configuration (gettext 
library example)
 
-Dextra_include_dirs=c:\postgres\gettext\x64-windows\include,c:\otherlibs\include

 -Dextra_lib_dirs=c:\postgres\gettext\x64-windows\lib,c:\otherlibs\lib

extra_include_dirs and extra_lib_dirs are options defined by PostgreSQL. 
This options are
comma separated lists of directories, if every value does not contain 
comma itself.




--
Best regards,

Vladlen Popolitov.




Re: pure parsers and reentrant scanners

2024-12-27 Thread Heikki Linnakangas

On 26/12/2024 20:27, Peter Eisentraut wrote:

On 22.12.24 22:43, Andreas Karlsson wrote:

On 12/19/24 9:57 PM, Peter Eisentraut wrote:
Here is an updated patch set on top of what has been committed so 
far, with all the issues you pointed out addressed.


Other than the discussion of how old versions of flex we should 
support I think this set of patches is ready to be committed. I looked 
at it again and everything looks good.


I have committed these except the plpgsql one, which was still work in 
progress.  But I have progressed on this now and also converted the 
parser and put the local state into yyextra.  This gets rid of all 
internal global state now.  The patches for this are attached.  It's a 
lot of churn, but otherwise pretty standard stuff.


Looks good to me.

Along the way I noticed that the flex documentation now recommends a 
different way to set the yyextra type.  So I have changed the ones we 
already have to that newer style.  I inspected the generated C code and 
there wasn't any significant difference, so I'm not sure, but I figure 
if we're making changes in this area we might as well use the modern style.


+1. According to the flex NEWS file, this syntax was added in flex 
2.5.34, and we already require 2.5.35.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Logical replication timeout

2024-12-27 Thread RECHTÉ Marc
> Right, because of the reason I posted [1].
> 
> I updated the patch which did the same approach. It could pass my CI.
> Could you please apply on 17.2 and test it?
> 
> [1]:
> https://www.postgresql.org/message-id/OSCPR01MB14966B646506E0C9B81B3A4CFF5022%40OSCPR01MB14966.jpnprd01.prod.outlook.com

This is a considerable improvement, the cleanup phase took less than 30s 
(compared to the former 200s).

However, we are talking of a 12s transaction, that takes an overall 64s to be 
replicated. 
In this particular case, the replication system spends most of its time 
creating / deleting small files.
Would not be possible to create just one spill file for the main transaction ?




RE: Logical Replication of sequences

2024-12-27 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for updating the patch! Here are my comments.

01. SyncingRelationsState

```
 * SYNC_RELATIONS_STATE_NEEDS_REBUILD The subscription relations state is no
 * longer valid and the subscription relations should be rebuilt.
```

Can we follow the style like other lines? Like:

SYNC_RELATIONS_STATE_NEEDS_REBUILD indicates that the subscription relations
state is no longer valid and the subscription relations should be rebuilt.

02. FetchRelationStates()

```
/* Fetch tables and sequences that are in non-ready state. */
rstates = GetSubscriptionRelations(MySubscription->oid, true, 
true,

   false);
```

I think rstates should be list_free_deep()'d after the foreach().

03. LogicalRepSyncSequences

```
charslotname[NAMEDATALEN];
...
snprintf(slotname, NAMEDATALEN, "pg_%u_sync_sequences_" UINT64_FORMAT,
 subid, GetSystemIdentifier());

/*
 * Here we use the slot name instead of the subscription name as the
 * application_name, so that it is different from the leader apply 
worker,
 * so that synchronous replication can distinguish them.
 */
LogRepWorkerWalRcvConn =
walrcv_connect(MySubscription->conninfo, true, true,
   must_use_password,
   slotname, &err);
```

Hmm, IIUC the sequence sync worker does not require any replication slots.
I feel the variable name should be "application_name" and the comment can be 
updated.

04. LogicalRepSyncSequences

```
/* Get the sequences that should be synchronized. */
sequences = GetSubscriptionRelations(subid, false, true, false);
```

I think rstates should be list_free_deep()'d after the foreach_ptr().

05. LogicalRepSyncSequences

```
/*
 * COPY FROM does not honor RLS policies.  That is not a 
problem for
 * subscriptions owned by roles with BYPASSRLS privilege (or
 * superuser, who has it implicitly), but other roles should 
not be
 * able to circumvent RLS.  Disallow logical replication into 
RLS
 * enabled relations for such roles.
 */
if (check_enable_rls(RelationGetRelid(sequence_rel), 
InvalidOid, false) == RLS_ENABLED)
ereport(ERROR,
errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("user \"%s\" cannot replicate 
into sequence with row-level security enabled: \"%s\"",
   
GetUserNameFromId(GetUserId(), true),
   
RelationGetRelationName(sequence_rel)));
```

Can we really set a row level security policy to sequences? I've tested but
I couldn't.

```
postgres=# CREATE SEQUENCE s;
CREATE SEQUENCE
postgres=# ALTER TABLE s ENABLE ROW LEVEL SECURITY ;
ERROR:  ALTER action ENABLE ROW SECURITY cannot be performed on relation "s"
DETAIL:  This operation is not supported for sequences.
```

06. copy_sequence

```
appendStringInfo(&cmd, "SELECT c.oid, c.relkind\n"
 "FROM pg_catalog.pg_class c\n"
 "INNER JOIN pg_catalog.pg_namespace 
n\n"
 "  ON (c.relnamespace = n.oid)\n"
 "WHERE n.nspname = %s AND c.relname = 
%s",
 quote_literal_cstr(nspname),
 quote_literal_cstr(relname));
```

I feel the function is not so efficient because it can obtain only a tuple, 
i.e.,
information for one sequence at once. I think you basically copied from 
fetch_remote_table_info(),
but it was OK for it because the tablesync worker handles only a table.

Can you obtain all sequences at once and check whether each sequences match or 
not?

07. LogicalRepSyncSequences

```
/* LOG all the sequences synchronized during current 
batch. */
for (int i = 0; i < curr_batch_seq; i++)
{
SubscriptionRelState *done_seq;

done_seq = (SubscriptionRelState *) 
lfirst(list_nth_cell(sequences_not_synced,

 (curr_seq - 
curr_batch_seq) + i));

ereport(DEBUG1,
errmsg_internal("logical 
replication synchronization for subscription \"%s\", sequence \"%s\" has 
finished",

get_subscr

RE: Memory leak in pg_logical_slot_{get,peek}_changes

2024-12-27 Thread Zhijie Hou (Fujitsu)
On Thursday, December 26, 2024 8:01 PM vignesh C  wrote:
> 
> Here is an updated version which includes registers to reset the memory
> context that is in-line with a recently committed patch at [1].

Thanks for updating patches ! They look good to me.

Just to confirm, would the other stuff (streamed_txns) that allocated under
CacheMemoryContext also leaks memory ? I think it's OK to change them
separately if it does but just would like to confirm if there is a risk.

> [1] -
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=cfd6cbc
> f9be078fbdf9587014231a5772039b386

Best Regards,
Hou zj


Re: Crash: invalid DSA memory alloc request

2024-12-27 Thread Andreas 'ads' Scherbaum
On Wed, Dec 18, 2024 at 2:42 AM Andreas 'ads' Scherbaum  wrote:

> On 17/12/2024 22:32, Nathan Bossart wrote:
> > Committed.
> >
>
> Thanks, I see you backpatched it all the way to 13.
> Will see how far back I can test this, will take a while.
>

Was able to test HEAD in all branches back to 13, no crash seen.

-- 
Andreas 'ads' Scherbaum
German PostgreSQL User Group
European PostgreSQL User Group - Board of Directors
Volunteer Regional Contact, Germany - PostgreSQL Project


Re: cannot to compile extension by meson on windows

2024-12-27 Thread Vladlen Popolitov

Pavel Stehule писал(а) 2024-12-01 20:52:

Hi

Did somebody test compilation of any extension on the WIN platform by
using meson?

I prepared meson.build
https://github.com/orafce/orafce/blob/master/meson.build

I tested it successfully on Linux.

But it fails on Windows - a lot of compilation fails on missing
libintl.h

DOCDIR = C:/PROGRA~1/POSTGR~1/16/doc
HTMLDIR = C:/PROGRA~1/POSTGR~1/16/doc
INCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include
PKGINCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include
INCLUDEDIR-SERVER = C:/PROGRA~1/POSTGR~1/16/include/server`

looks so msvc cannot work with just this configuration.

I can compile orafce when I use setup described by
https://github.com/orafce/orafce/blob/master/README.msvc

Regards

Pavel


Hi!

 In other thread 
https://www.postgresql.org/message-id/TYVPR01MB1133078C93F9FE432CA466573E40E2%40TYVPR01MB11330.jpnprd01.prod.outlook.com
Kohei Harikae makes good work to clarify meson documentation, especially 
regarding additional libraries.


I suppose in your case meson did not found gettext library, libintl.h 
from this library.


You can:
1) install gettext, f.e by vcpkg package manager:
vcpkg.exe install gettext:x64-windows 
--x-install-root=c:\postgres\gettext


2) You could add gettext .pc file directory to PKG_CONFIG_PATH ( ; 
separated list, meson uses configarations in this order
SET 
PKG_CONFIG_PATH=c:\postgres\gettext\x64-windows\lib\pkgconfig;%PKG_CONFIG_PATH%
but this file is not created for this library (you can create it by 
yourself, but exists other solution - in step 4)
3) make all other steps to run meson, f.e. call "C:\Program 
Files\Microsoft Visual 
Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat" x64


4) run meson setup with option -Dnls=enabled - it enables national 
languages and uses gettext.
And you have to add options for gettext libraries and includes 
directories:

meson setup c:\builddir -Dnsl=enabled
 
-Dextra_include_dirs=c:\postgres\gettext\x64-windows\include,c:\otherlibs\include
 -Dextra_lib_dirs=c:\postgres\gettext\x64-windows\lib,c:\otherlibs\lib 
...other options...


extra_include_dirs and extra_lib_dirs are options defined by PostgreSQL. 
This options are
comma separated lists of directories, if every value does not contain 
comma itself (if contains, it better
to read meson get_options() documentation, it is not easy to explain 
shortly meson language syntax).


meson using PKG_CONFIG_PATH detects all library and makes all work with 
include and lib paths,

but if library does not have .pc file, you can define paths in options.

If you build with -Dnlas=enabled, you have to be careful with tests. 
Tests in this case run
with the default system language, and some tests will fail, as they are 
check log files output

and compare it with English answers.

I hope this helps you.

--
Best regards,

Vladlen Popolitov.




Re: Convert macros to static inline functions

2024-12-27 Thread Peter Eisentraut

On 16.05.22 10:27, Peter Eisentraut wrote:
Inspired by [0], I looked to convert more macros to inline functions. 


This is an older thread where I left something unfinished:

Note 2: Many macros in htup_details.h operate both on HeapTupleHeader 
and on MinimalTuple, so converting them to a function doesn't work in a 
straightforward way.  I have some in-progress work in that area, but I 
have not included any of that here.


Here is the patch set for this.

There are actually only two macros that operate on both HeapTupleHeader 
and MinimalTuple, so it wasn't as much as I had written above.  I just 
left those as macros.  I converted the rest to inline functions in a 
straightforward way as before.  A small amount of reordering was necessary.


But just for language-nerd fun, I'm including here an additional patch 
showing how the remaining ones could be done with C11 generic selection. 
 I'm not planning to commit that one at this time.
From 422181163bfe1a292ee51941a007d54bc127ccec Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 27 Dec 2024 11:07:35 +0100
Subject: [PATCH 1/3] Add some const decorations (htup.h)

---
 src/backend/access/heap/heapam.c  | 6 +++---
 src/backend/utils/time/combocid.c | 6 +++---
 src/include/access/htup.h | 8 
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 329e727f80d..75171e8f2c0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7419,10 +7419,10 @@ MultiXactIdGetUpdateXid(TransactionId xmax, uint16 
t_infomask)
  * checking the hint bits.
  */
 TransactionId
-HeapTupleGetUpdateXid(HeapTupleHeader tuple)
+HeapTupleGetUpdateXid(const HeapTupleHeaderData *tup)
 {
-   return MultiXactIdGetUpdateXid(HeapTupleHeaderGetRawXmax(tuple),
-  
tuple->t_infomask);
+   return MultiXactIdGetUpdateXid(HeapTupleHeaderGetRawXmax(tup),
+  
tup->t_infomask);
 }
 
 /*
diff --git a/src/backend/utils/time/combocid.c 
b/src/backend/utils/time/combocid.c
index f85510b74ff..1ecdc93b7e2 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -101,7 +101,7 @@ static CommandId GetRealCmax(CommandId combocid);
  */
 
 CommandId
-HeapTupleHeaderGetCmin(HeapTupleHeader tup)
+HeapTupleHeaderGetCmin(const HeapTupleHeaderData *tup)
 {
CommandId   cid = HeapTupleHeaderGetRawCommandId(tup);
 
@@ -115,7 +115,7 @@ HeapTupleHeaderGetCmin(HeapTupleHeader tup)
 }
 
 CommandId
-HeapTupleHeaderGetCmax(HeapTupleHeader tup)
+HeapTupleHeaderGetCmax(const HeapTupleHeaderData *tup)
 {
CommandId   cid = HeapTupleHeaderGetRawCommandId(tup);
 
@@ -150,7 +150,7 @@ HeapTupleHeaderGetCmax(HeapTupleHeader tup)
  * changes the tuple in shared buffers.
  */
 void
-HeapTupleHeaderAdjustCmax(HeapTupleHeader tup,
+HeapTupleHeaderAdjustCmax(const HeapTupleHeaderData *tup,
  CommandId *cmax,
  bool *iscombo)
 {
diff --git a/src/include/access/htup.h b/src/include/access/htup.h
index 116cb1bb273..a5f1ef2441d 100644
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@@ -78,12 +78,12 @@ typedef HeapTupleData *HeapTuple;
 #define HeapTupleIsValid(tuple) PointerIsValid(tuple)
 
 /* HeapTupleHeader functions implemented in utils/time/combocid.c */
-extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup);
-extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup);
-extern void HeapTupleHeaderAdjustCmax(HeapTupleHeader tup,
+extern CommandId HeapTupleHeaderGetCmin(const HeapTupleHeaderData *tup);
+extern CommandId HeapTupleHeaderGetCmax(const HeapTupleHeaderData *tup);
+extern void HeapTupleHeaderAdjustCmax(const HeapTupleHeaderData *tup,
  
CommandId *cmax, bool *iscombo);
 
 /* Prototype for HeapTupleHeader accessors in heapam.c */
-extern TransactionId HeapTupleGetUpdateXid(HeapTupleHeader tuple);
+extern TransactionId HeapTupleGetUpdateXid(const HeapTupleHeaderData *tup);
 
 #endif /* HTUP_H */

base-commit: d85ce012f99f63249bb45a78fcecb7a2383920b1
-- 
2.47.1

From 89308e0446259a1ae619e39603324d2c2b8ed66c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 27 Dec 2024 11:07:35 +0100
Subject: [PATCH 2/3] Convert macros to static inline functions
 (htup_details.h, itup.h)

---
 contrib/pageinspect/heapfuncs.c   |  15 +-
 src/include/access/htup_details.h | 543 ++
 src/include/access/itup.h |  20 +-
 3 files changed, 352 insertions(+), 226 deletions(-)

diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 41ff597199c..4961a3c866c 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/p

Re: Proposal: add new API to stringinfo

2024-12-27 Thread Tatsuo Ishii
Michael said:
> New APIs are materials for HEAD, so recompilation needs to happen
> anyway.  Using a macro makes things slightly simpler and it would not
> break unnecessarily the compilation of existing extensions.

Ok.

David said:
> I didn't review the patch in detail, but I think "initsize" would be a
> better parameter name than "size".

Ok, will change to "initsize".

With opinions from Michael and David , what about following additional APIs?

#define STRINGINFO_DEFAULT_SIZE 1024/* default initial allocation size */
#define STRINGINFO_SMALL_SIZE   64  /* small initial allocation size */

#define makeStringInfo  makeStringInfoExtended(STRINGINFO_DEFAULT_SIZE)
#define initStringInfo(str) initStringInfoExtended(str, 
STRINGINFO_DEFAULT_SIZE)

extern StringInfo makeStringInfoExtended(int initsize);
extern void initStringInfoExtended(StringInfo str, int initsize);

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: Should fix a comment referring to stats collector?

2024-12-27 Thread Junwang Zhao
On Tue, Oct 31, 2023 at 11:02 PM Bruce Momjian  wrote:
>
> On Mon, Aug  1, 2022 at 09:05:45PM +0900, torikoshia wrote:
> > On 2022-07-30 02:53, Alvaro Herrera wrote:
> >
> > > I don't think this refers to the statistics collector process; I
> > > understand it to refer to ANALYZE that captures the data being stored.
> >
> > Thanks for the explanation!
> >
> > > Maybe it should just say "K may be chosen at ANALYZE time".
> >
> > It seems clearer than current one.
>
> Change made in master.

I believe some redundant wording has been committed.

K may be chosen may be chosen at ANALYZE time.

Attached patch fixes it and with some line adjustments.

>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   Only you can decide what is important to you.
>
>


-- 
Regards
Junwang Zhao


v1-0001-remove-redundant-wording-in-pg_statistic.h.patch
Description: Binary data


Re: [PATCHES] Post-special page storage TDE support

2024-12-27 Thread David Christensen
On Fri, Dec 27, 2024 at 1:58 PM Bruce Momjian  wrote:
>
> On Fri, Dec 27, 2024 at 12:25:11PM -0500, Greg Sabino Mullane wrote:
> > On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian  wrote:
> >
> > The value of TDE is limited from a security value perspective, but high 
> > on
> > the list of security policy requirements.  Our community is much more
> > responsive to actual value vs policy compliance value.
> >
> >
> > True. The number of forks, though, makes me feel this is a "when", not "if"
> > feature. Has there been any other complex feature forked/implemented by so
> > many? Maybe columnar storage?
>
> That is a great question.  We have TDE implementations from EDB,
> Fujitsu, Percona, Cybertec, and Crunchy Data, and perhaps others, and
> that is a lot of duplicated effort.
>
> As far as parallels, I think compatibility with Oracle and MSSQL are
> areas that several companies have developed that the community is
> unlikely to ever develop, I think because they are pure compatibility,
> not functionality.  I think TDE having primarily policy compliance value
> also might make it something the community never develops.
>
> I think this blog post is the clearest I have seen about the technical
> value vs.policy compliance value of TDE:
>
> 
> https://www.percona.com/blog/why-postgresql-needs-transparent-database-encryption-tde/
>
> One possible way TDE could be added to community Postgres is if the code
> changes required were reduced due to an API redesign.

A couple big pieces here could be modifying the API to add
PreparePageForWrite()/PreparePageFromRead() hooks to transform the
data page once read from disk or getting ready to write to disk.  I
think I have a (not yet rebased atop bulk read/write API and various
incremental backup pieces) patch version which basically refactors
things somewhat to support that, but basically making a single call
point that we can add things like checksums, page encryption, etc.

I think there was also a thread floating around moving various
arbitrary read/write file APIs (for temporary files) into a common
API; my recollection is there was something along the lines of 4 or 5
different file read abstractions we used at various pieces in the code
base, so making a common one that could also be hooked would give us
the ability to make a read/write transient file API that we could then
"do stuff" with.  (My recollection is we could support encryption and
compression, but don't have the thread in front of me.)

Some form of early init pluggability would be required, since we'd
need to support reading encrypted WAL before full startup is
accomplished.  This seems harder, at least given the bits I was
originally plugging into.

Obviously existing forks would want to support reading their existing
clusters, so not sure there is an all-in-one solution here.

Just some musing here...

David




Re: Logging parallel worker draught

2024-12-27 Thread Sami Imseih
Thanks for rebasing.

I took a look and have some high level comments.

1/ The way the patches stand now, different parallel commands
will result in a different formatted log line. This may be
OK for reading the logs manually, but will be hard to
consume via tools that parse logs and if we support other
parallel commands in the future.

Thinking about this further, it seems to me this logging only makes sense
for parallel query and not maintenance commands. This is because
for the latter case, the commands are executed manually and a user can issue
VACUUM VERBOSE ( for the case of vacuum ). For CREATE INDEX, one
can enable DEBUG1. For example:

postgres=# CREATE INDEX tbl_c1 ON tbl(c1);
DEBUG:  building index "tbl_c1" on table "tbl" with request for 1
parallel workers
DEBUG:  index "tbl_c1" can safely use deduplication
CREATE INDEX

It does appear though the debug1 message is missing the number of workers
actually used, but this could be taken up in a separate discussion.

2/ For logging parallel query workers, the log line could be simpler.

Instead of:

+ elog(LOG, "%i parallel nodes planned (%i obtained all their workers,
%i obtained none), "
+ "%i workers planned to be launched (%i workers launched)",

what about something like:

"launched %d parallel workers (planned: %)"

For example, if I have 2 gather nodes and they each plan and launch
2 workers, the log line should be as simple as

"launched 4 parallel workers (planned: 4)"

This behavior in which workers planned and launched are aggregated in the
log can be described in the documentation.

3/ Also, log_parallel_query_workers as the name of the GUC will make more sense
if we go logging parallel query only.

What do you think?

Regards,

Sami Imseih
Amazon Web Services (AWS)

On Thu, Dec 26, 2024 at 8:25 AM Benoit Lobréau
 wrote:
>
> This is just a rebase.
>
> As stated before, I added some information to the error message for
> parallel queries as an experiment. It can be removed, if you re not
> convinced.
>
> ---
> Benoit Lobréau
> Consultant
> http://dalibo.com




Re: Expand applicability of aggregate's sortop optimization

2024-12-27 Thread Andrei Lepikhov

On 12/11/24 07:22, Michael Paquier wrote:

On Fri, Nov 08, 2024 at 01:05:04PM +0700, Andrei Lepikhov wrote:

Rebase onto current master and slight improvement.


Your patch is failing in the CI, please rebase.  I have it to the next
CF for now, waiting on author.

Thanks,
I overlooked the fact that regression tests can be launched in parallel. 
Therefore, adding an index to the frequently used tenk1 table was a bad 
idea.

In the attachment - the new patch rebased on the current master.

--
regards, Andrei LepikhovFrom 880ddf7170a0e01fb852172ba99a32e2494f50c6 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Fri, 8 Nov 2024 10:36:06 +0700
Subject: [PATCH v4] Add prosupport helpers to aggregate functions.

Following the spirit of ed1a88d, add a prosupport call for aggregates.
Here, we introduce a new support function node type to allow support functions
to be called for aggregate functions.
This code introduces only min/max trivial optimisation in the core. However, a
user can design alternative support functions on their own.
---
 src/backend/optimizer/plan/planagg.c |  76 +
 src/backend/optimizer/prep/prepagg.c |  26 +
 src/include/catalog/pg_proc.dat  |  92 +++
 src/include/nodes/supportnodes.h |   7 ++
 src/test/regress/expected/aggregates.out | 135 ++-
 src/test/regress/sql/aggregates.sql  |  62 +++
 src/tools/pgindent/typedefs.list |   1 +
 7 files changed, 352 insertions(+), 47 deletions(-)

diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index dbcd3b9a0a..845b17f01b 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -33,6 +33,7 @@
 #include "catalog/pg_type.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/supportnodes.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
@@ -43,6 +44,7 @@
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -512,3 +514,77 @@ fetch_agg_sort_op(Oid aggfnoid)
 
 	return aggsortop;
 }
+
+Datum
+minmax_support(PG_FUNCTION_ARGS)
+{
+	Node   *rawreq = (Node *) PG_GETARG_POINTER(0);
+	Oid		aggsortop;
+
+	if (IsA(rawreq, SupportRequestAggregate))
+	{
+		SupportRequestAggregate   *req = (SupportRequestAggregate *) rawreq;
+		Aggref   *aggref = req->aggref;
+
+		if (list_length(aggref->args) != 1 || aggref->aggfilter != NULL)
+			PG_RETURN_POINTER(NULL);
+
+		aggsortop = fetch_agg_sort_op(aggref->aggfnoid);
+		if (!OidIsValid(aggsortop))
+			PG_RETURN_POINTER(NULL);		/* not a MIN/MAX aggregate */
+
+		if (aggref->aggorder != NIL)
+		{
+			SortGroupClause	   *orderClause;
+			TargetEntry		   *curTarget;
+
+			curTarget = (TargetEntry *) linitial(aggref->args);
+
+			/*
+			 * If the order clause is the same column as the one we're
+			 * aggregating, we can still use the index: It is undefined which
+			 * value is MIN() or MAX(), as well as which value is first or
+			 * last when sorted. So, we can still use the index IFF the
+			 * aggregated expression equals the expression used in the
+			 * ordering operation.
+			 */
+
+			/*
+			 * We only accept a single argument to min/max aggregates,
+			 * orderings that have more clauses won't provide correct results.
+			 */
+			Assert(list_length(aggref->aggorder) == 1);
+
+			orderClause = castNode(SortGroupClause, linitial(aggref->aggorder));
+
+			if (orderClause->tleSortGroupRef != curTarget->ressortgroupref)
+elog(ERROR, "Aggregate order clause isn't found in target list");
+
+			if (orderClause->sortop != aggsortop)
+			{
+List	   *btclasses;
+ListCell   *lc;
+
+btclasses = get_op_btree_interpretation(orderClause->sortop);
+
+foreach(lc, btclasses)
+{
+	OpBtreeInterpretation *interpretation;
+
+	interpretation = (OpBtreeInterpretation *) lfirst(lc);
+	if (op_in_opfamily(aggsortop, interpretation->opfamily_id))
+	{
+		aggref->aggorder = NIL;
+		break;
+	}
+}
+
+list_free_deep(btclasses);
+			}
+			else
+aggref->aggorder = NIL;
+		}
+	}
+
+	PG_RETURN_POINTER(NULL);
+}
diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c
index e935c9d094..74347b3d7e 100644
--- a/src/backend/optimizer/prep/prepagg.c
+++ b/src/backend/optimizer/prep/prepagg.c
@@ -39,6 +39,7 @@
 #include "catalog/pg_type.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/pathnodes.h"
+#include "nodes/supportnodes.h"
 #include "optimizer/cost.h"
 #include "optimizer/optimizer.h"
 #include "optimizer/plancat.h"
@@ -64,6 +65,25 @@ static int	find_compatible_trans(PlannerInfo *root, Aggref *newagg,
   List *transnos);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 
+static void
+optimize_aggregates(Aggref *aggref)
+{
+	SupportRequestAggregate	

Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)

2024-12-27 Thread Tomas Vondra



On 12/27/24 05:00, Michael Paquier wrote:
> On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote:
>> If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not
>> make a fundamental difference ...
>>
>> Anyway, the 128MB value is rather arbitrary. I don't mind increasing the
>> limit, or possibly removing it entirely (and accepting anything the
>> system can handle).
> 
> +DefineCustomIntVariable("stats_history.size",
> +"Sets the amount of memory available for past 
> events.",
> 
> How about some time-based retention?  Data size can be hard to think
> about for the end user, while there would be a set of users that would
> want to retain data for the past week, month, etc?  If both size and
> time upper-bound are define, then entries that match one condition or
> the other are removed.
> 

Right. In my response [1] I suggested replacing the simple memory limit
with a time-based limit, but I haven't done anything about that yet.

And the more I think about it the more I'm convinced we don't need to
keep the data about past runs in memory, a file should be enough (except
maybe for a small buffer). That would mean we don't need to worry about
dynamic shared memory, etc. I initially rejected this because it seemed
like a regression to how pgstat worked initially (sharing data through
files), but I don't think that's actually true - this data is different
(almost append-only), etc.

The one case when we may need co read the data is in response to DROP of
a table, when we need to discard entries for that object. Or we could
handle that by recording OIDs of dropped objects ... not sure how
complex this would need to be.

We'd still want to enforce some limits, of course - a time-based limit
of the minimum amount of time to cover, and maximum amount of disk space
to use (more as a safety).

FWIW there's one "issue" with enforcing the time-based limit - we can
only do that for the "end time", because that's when we see the entry.
If you configure the limit to keep "1 day" history, and then a vacuum
completes after running for 2 days, we want to keep it, so that anyone
can actually see it.

[1]
https://www.postgresql.org/message-id/8df7cee1-31aa-4db3-bbb7-83157ca139da%40vondra.me

> +checkpoint_log_hook(
> +  CheckpointStats.ckpt_start_t,/* start_time */
> +  CheckpointStats.ckpt_end_t,/* end_time */
> +  (flags & CHECKPOINT_IS_SHUTDOWN),/* is_shutdown */
> +  (flags & CHECKPOINT_END_OF_RECOVERY),/* is_end_of_recovery */
> +  (flags & CHECKPOINT_IMMEDIATE),/* is_immediate */
> +  (flags & CHECKPOINT_FORCE),/* is_force */
> +  (flags & CHECKPOINT_WAIT),/* is_wait */
> +  (flags & CHECKPOINT_CAUSE_XLOG),/* is_wal */
> +  (flags & CHECKPOINT_CAUSE_TIME),/* is_time */
> +  (flags & CHECKPOINT_FLUSH_ALL),/* is_flush_all */
> +  CheckpointStats.ckpt_bufs_written,/* buffers_written */
> +  CheckpointStats.ckpt_slru_written,/* slru_written */
> +  CheckpointStats.ckpt_segs_added,/* segs_added */
> +  CheckpointStats.ckpt_segs_removed,/* segs_removed */
> +  CheckpointStats.ckpt_segs_recycled,/* segs_recycled */
> 
> That's a lot of arguments.  CheckpointStatsData and the various
> CHECKPOINT_* flags are exposed, why not just send these values to the
> hook?
> 
> For v1-0001 as well, I'd suggest some grouping with existing
> structures, or expose these structures so as they can be reused for
> out-of-core code via the proposed hook.  More arguments lead to more
> mistakes that could be easily avoided.

Yes, I admit the number of parameters seemed a bit annoying to me too,
and maybe we could reduce that somewhat. Certainly for checkpoints,
where we already have a reasonable CheckpointStats struct and flags,
wrapping most of the fields.

With vacuum it's a bit more complicated, for two reasons: (a) the
counters are simply in LVRelState, mixed with all kinds of other info,
and it seems "not great" to pass it to a "log" hook, and (b) there are
some calculated values, so I guess the hook would need to do that
calculation on it's own, and it'd be easy to diverge over time.

For (a) we could introduce some "stats" struct to keep these counters
for vacuum (the nearby parallel vacuum patch actually does something
like that, I think). Not sure if (b) is actually a problem in practice,
but I guess we could add those fields to the new "stats" struct too.


regards

-- 
Tomas Vondra