Re: Add the ability to limit the amount of memory that can be allocated to backends.
On 12/27/24 20:14, James Hunter wrote: > Reviving this thread, because I am thinking about something related -- > please ignore the "On Fri, Dec 27, 2024" date, this seems to be an > artifact of me re-sending the message, from the list archive. The > original message was from January 28, 2024. > > On Fri, Dec 27, 2024 at 11:02 AM Tomas Vondra > wrote: >> >> Firstly, I agree with the goal of having a way to account for memory >> used by the backends, and also ability to enforce some sort of limit. >> It's difficult to track the memory at the OS level (interpreting RSS >> values is not trivial), and work_mem is not sufficient to enforce a >> backend-level limit, not even talking about a global limit. >> >> But as I said earlier, it seems quite strange to start by introducing >> some sort of global limit, combining memory for all backends. I do >> understand that the intent is to have such global limit in order to >> prevent issues with the OOM killer and/or not to interfere with other >> stuff running on the same machine. And while I'm not saying we should >> not have such limit, every time I wished to have a memory limit it was a >> backend-level one. Ideally a workmem-like limit that would "adjust" the >> work_mem values used by the optimizer (but that's not what this patch >> aims to do), or at least a backstop in case something goes wrong (say, a >> memory leak, OLTP application issuing complex queries, etc.). > > I think what Tomas suggests is the right strategy. I am thinking of > something like: > > 1. Say we have a backend_work_mem limit. Then the total amount of > memory available on the entire system, for all queries, as work_mem, > would be backend_work_mem * max_connections. > > 2. We use this backend_work_mem to "adjust" work_mem values used by > the executor. (I don't care about the optimizer right now -- optimizer > just does its best to predict what will happen at runtime.) > > At runtime, every node that uses work_mem currently checks its memory > usage against the session work_mem (and possibly hash_mem_multiplier) > GUC(s). Instead, now, every node would check against its node-specific > "adjusted" work_mem. If it exceeds this limit, it spills, using > existing logic. > > In other words -- existing logic spills based on comparison to global > work_mem GUC. Let's make it spill, instead, based on comparison to an > operator-local "PlanState.work_mem" field. > > And then let's set that "PlanState.work_mem" field based on a new > "backend_work_mem" GUC. Then no queries will run OOM (at least, not > due to work_mem -- we'll address other memory usages separately), so > they won't need to be canceled. Instead, they'll spill. > > This strategy solves the ongoing problem of how to set work_mem, if > some queries have lots of operators and others don't -- now we just > set backend_work_mem, as a limit on the entire query's total work_mem. > And a bit of integration with the optimizer will allow us to > distribute the total backend_work_mem to individual execution nodes, > with the goal of minimizing spilling, without exceeding the > backend_work_mem limit. > > Anyway, I revived this thread to see if there's interest in this sort > of strategy -- > I think there's still interest in having a memory limit of this kind, but it's not very clear to me what you mean by "adjusted work_mem". Can you explain how would that actually work in practice? Whenever I've been thinking about this in the past, it wasn't clear to me how would we know when to start adjusting work_mem, because we don't know which nodes will actually use work_mem concurrently. We certainly don't know that during planning - e.g. because we don't actually know what nodes will be "above" the operation we're planning. And AFIAK we don't have any concept which parts of the plan may be active at the same time ... Let's say we have limits work_mem=4MB and query_mem=8MB (to keep it simple). And we have a query plan with 3 separate hash joins, each needing 4MB. I believe these things can happen: 1) The hash joins are in distant parts of the query plan, and will not actually run concurrently - we'll run each hashjoin till completion before starting the next one. So, no problem with query_mem. 2) The hash joins are active at the same time, initialized one by one. But we don't know what, so we'll init HJ #1, it'll get 4MB of memory. Then we'll init HJ #2, it'll also get 4MB of memory. And now we want to init HJ #3, but we've already exhausted query_mem, but the memory is already used and we can't reclaim it. How would the work_mem get adjusted in these cases? Ideally, we'd not adjust work_mem in (1), because it'd force hash joins to spill data to disk, affecting performance when there's enough memory to just run the query plan (but maybe that's a reasonable price for the memory limit?). While in (2) we'd need to start adjusting the memory from the beginning, but I don't think we know that so early. Can you explain / walk me thro
RE: Connection limits/permissions, slotsync workers, etc
On Saturday, December 28, 2024 1:31 AM Tom Lane wrote: > > "Zhijie Hou (Fujitsu)" writes: > > On Thursday, December 26, 2024 3:50 AM Tom Lane > >> I wonder if the AV launcher and slotsync worker could be reclassified > >> as "auxiliary processes" instead of being their own weird animal. > > > It appears that the current aux processes do not run transactions as > > stated in the comments[1], so we may need to somehow release this > > restriction to achieve the goal. > > Ah, right, I'd forgotten about that restriction. I agree that removing it > wouldn't > be very reasonable. However, I still would rather avoid making the slotsync > worker be its very own special snowflake, because that offers no support for > the next person who wants to invent a new sort of specialized > transaction-capable process. > > Attached is an alternative proposal that groups the autovac launcher and > slotsync worker into a new category of "special workers" (better name > welcome). I chose to put them into the existing autovacFreeProcs freelist, > partly because the autovac launcher lives there already but mostly because I > don't want to add another freelist in a patch we need to put into v17. (As > written, your patch is an ABI break. Right, thanks for pointing it out. The new version patch looks good to me. Best Regards, Hou zj
Re: Fix bank selection logic in SLRU
> On 19 Dec 2024, at 20:48, Yura Sokolov wrote: > > Here's version with type change bits16 -> uint16 Thanks! This version looks good to me. I’ll mark the CF entry as RfC. Best regards, Andrey Borodin.
Re: RFC: Allow EXPLAIN to Output Page Fault Information
On Fri, 27 Dec 2024 15:15:40 +0100 "Jelte Fennema-Nio" wrote: > On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote: > > torikoshia writes: > >> I have attached a PoC patch that modifies EXPLAIN to include page > >> fault information during both the planning and execution phases of > >> a query. > > > > Surely these numbers would be too unstable to be worth anything. > > What makes you think that? I'd expect them to be similarly stable to > the numbers we get for BUFFERS. i.e. Sure they won't be completely > stable, but I expect them to be quite helpful when debugging perf > issues, because large numbers indicate that the query is disk-bound > and small numbers indicate that it is not. > > These numbers seem especially useful for setups where shared_buffers > is significantly smaller than the total memory available to the > system. In those cases the output from BUFFERS might give the > impression that that you're disk-bound, but if your working set still > fits into OS cache then the number of page faults is likely still > low. Thus telling you that the numbers that you get back from BUFFERS > are not as big of a problem as they might seem. We'd probably need to combine both pg_buffercache_evict() and /proc/sys/vm/drop_caches to get stable numbers - which is something I have done in the past for testing. Another thought would be splitting out the IO timing information into two values - IO timing for reads that triggered major faults, versus IO timing for reads that did not. And system views like pg_stat_database seem worth considering too. -Jeremy
Remove support for OpenSSL *eay32 libs on Windows
The thread in [0] which adds Meson support for *eay32 OpenSSL library names on Windows reminded me that we should remove these from master since we no longer support OpenSSL 1.0.2 (which was the final version with those names). Attached is a small (as of yet untested on Windows/autoconf) diff removing these names from the Makefiles. [0] cahrt6656w9onfomqthbgydcm5ckz7hcgzft8l+n0ezbzfcn...@mail.gmail.com -- Daniel Gustafsson v1-0001-Remove-support-for-linking-with-libeay32-and-ssle.patch Description: Binary data
Re: RFC: Allow EXPLAIN to Output Page Fault Information
On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote: torikoshia writes: I have attached a PoC patch that modifies EXPLAIN to include page fault information during both the planning and execution phases of a query. Surely these numbers would be too unstable to be worth anything. What makes you think that? I'd expect them to be similarly stable to the numbers we get for BUFFERS. i.e. Sure they won't be completely stable, but I expect them to be quite helpful when debugging perf issues, because large numbers indicate that the query is disk-bound and small numbers indicate that it is not. These numbers seem especially useful for setups where shared_buffers is significantly smaller than the total memory available to the system. In those cases the output from BUFFERS might give the impression that that you're disk-bound, but if your working set still fits into OS cache then the number of page faults is likely still low. Thus telling you that the numbers that you get back from BUFFERS are not as big of a problem as they might seem.
Re: FileFallocate misbehaving on XFS
On Fri, Dec 20, 2024 at 01:25:41PM +0100, Jakub Wartak wrote: > On Thu, Dec 19, 2024 at 7:49 AM Michael Harris wrote: > No one else has responded, so I'll try. My take is that we got very limited > number of reports (2-3) of this stuff happening and it always seem to be >90% > space used, yet the adoption of PG16 is rising, so we may or may not see more > errors of those kind, but on another side of things: it's frequency is so rare My guess is that if you are seeing this with 90% full, and not lesser values, that something is being temporarily exhausted in XFS and the kernel errno API just doesn't allow returning enough detail to explain what is being exhausted. A traditional Unix file system only has limits for the inode table and free blocks, but I am sure XFS has many more areas of possible exhaustion. I didn't see any mention of checking the kernel log, which might have more details of what XFS is having problems with. I agree trying to modify Postgres for this now makes no sense since we don't even know the cause. Once we find the cause, and admit it can't be avoided or quickly fixed, we can reevaluate. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: cannot to compile extension by meson on windows
pá 27. 12. 2024 v 9:50 odesílatel Vladlen Popolitov < v.popoli...@postgrespro.ru> napsal: > Pavel Stehule писал(а) 2024-12-01 20:52: > > Hi > > > > Did somebody test compilation of any extension on the WIN platform by > > using meson? > > > > I prepared meson.build > > https://github.com/orafce/orafce/blob/master/meson.build > > > > I tested it successfully on Linux. > > > > But it fails on Windows - a lot of compilation fails on missing > > libintl.h > > > > DOCDIR = C:/PROGRA~1/POSTGR~1/16/doc > > HTMLDIR = C:/PROGRA~1/POSTGR~1/16/doc > > INCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include > > PKGINCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include > > INCLUDEDIR-SERVER = C:/PROGRA~1/POSTGR~1/16/include/server` > > > > looks so msvc cannot work with just this configuration. > > > > I can compile orafce when I use setup described by > > https://github.com/orafce/orafce/blob/master/README.msvc > > > > Regards > > > > Pavel > > Hi! > > In other thread > > https://www.postgresql.org/message-id/TYVPR01MB1133078C93F9FE432CA466573E40E2%40TYVPR01MB11330.jpnprd01.prod.outlook.com > Kohei Harikae makes good work to clarify meson documentation, especially > regarding additional libraries. > > I suppose in your case meson did not found gettext library, libintl.h > from this library. > > You can: > 1) install gettext, f.e by vcpkg package manager: > vcpkg.exe install gettext:x64-windows > --x-install-root=c:\postgres\gettext > > 2) You could add gettext .pc file directory to PKG_CONFIG_PATH ( ; > separated list, meson uses configarations in this order > SET > > PKG_CONFIG_PATH=c:\postgres\gettext\x64-windows\lib\pkgconfig;%PKG_CONFIG_PATH% > but this file is not created for this library (you can create it by > yourself, but exists other solution - in step 4) > 3) make all other steps to run meson, f.e. call "C:\Program > Files\Microsoft Visual > Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat" x64 > > 4) run meson setup with option -Dnls=enabled - it enables national > languages and uses gettext. > And you have to add options for gettext libraries and includes > directories: > meson setup c:\builddir -Dnsl=enabled > > > -Dextra_include_dirs=c:\postgres\gettext\x64-windows\include,c:\otherlibs\include > -Dextra_lib_dirs=c:\postgres\gettext\x64-windows\lib,c:\otherlibs\lib > ...other options... > > extra_include_dirs and extra_lib_dirs are options defined by PostgreSQL. > This options are > comma separated lists of directories, if every value does not contain > comma itself (if contains, it better > to read meson get_options() documentation, it is not easy to explain > shortly meson language syntax). > > meson using PKG_CONFIG_PATH detects all library and makes all work with > include and lib paths, > but if library does not have .pc file, you can define paths in options. > > If you build with -Dnlas=enabled, you have to be careful with tests. > Tests in this case run > with the default system language, and some tests will fail, as they are > check log files output > and compare it with English answers. > > I hope this helps you. > Thank you for the interesting information. I compiled extensions already. Maybe the problem is in pg_config if meson.get_compiler('c').get_id() == 'msvc' incdir = [includedir_server / 'port/win32_msvc', includedir_server / 'port/win32', includedir_server, includedir] postgres_lib = meson.get_compiler('c').find_library( 'postgres', dirs: libdir, static: true, required: true ) else incdir = [ includedir_server ] postgres_lib = '' endif looks so returned include dir is not enough for successful compilation, and there should be some extra magic. Maybe pg_config doesn't support specific msvc configuration. > > -- > Best regards, > > Vladlen Popolitov. >
Re: Add Postgres module info
On Dec 26, 2024, at 20:09, Andrei Lepikhov wrote: > We intentionally wrote a library, not an extension. According to user usage > and upgrade patterns, it works across the whole instance and in any database > or locally in a single backend and ends its impact at the end of its life. The same is true for the shared libraries included in many extensions. A shared library is just an extension that’s available in all databases and has no associated SQL interface. > Also, it doesn't maintain any object in the database and is managed by GUCs. Sure, but this is just a semantic argument. The Postgres developers get to decide what terms mean. I’m I argue it can be worthwhile to merge the idea of a library into extensions. > For example, my libraries add query tree transformations/path recommendations > to the planner. It doesn't depend on a database and doesn't maintain DSM > segments and users sometimes want to use it in specific backends, not > databases - in a backend dedicated to analytic queries without extra overhead > to backends, picked out for short queries. For what reason do I need to add > complexity and call 'CREATE EXTENSION' here and add version info only in a > specific database? Just because of a formal one-directory structure? Perhaps shared-library only extensions are not limited to a single database. Best, David signature.asc Description: Message signed with OpenPGP
Re: More reliable nbtree detection of unsatisfiable RowCompare quals involving a leading NULL key/element
On Mon, Dec 23, 2024 at 1:02 PM Peter Geoghegan wrote: > Attached patch fixes the problem by moving detection of RowCompare > unsatisfiable-due-to-NULL cases into _bt_fix_scankey_strategy. Attached is v2, which adds several new regression tests, giving certain relevant nbtree code paths test coverage. v2 also makes some small tweaks to _bt_check_rowcompare(), the actual comparator used by scans with a RowCompare qual. We no longer need to account for the possibility that the first row member from a RowCompare scan key contains a null once the scan is underway (we know that _bt_preprocess_keys would have recognized the qual as unsatisfiable had the RowCompare looked like that). -- Peter Geoghegan v2-0001-Improve-nbtree-unsatisfiable-RowCompare-detection.patch Description: Binary data
RE: Connection limits/permissions, slotsync workers, etc
On Thursday, December 26, 2024 3:50 AM Tom Lane Hi, > In connection with the discussion at [1], I started to look at exactly which > server > processes ought to be subject to connection limits (datconnlimit, > ACL_CONNECT, and related checks). The current situation seems to be an > inconsistent mess. > > Looking at InitPostgres, InitializeSessionUserId, and CheckMyDatabase, we > have several different permissions and resource-limiting checks that may be > enforced against an incoming process: > ReservedConnections/SuperuserReservedConnections > rolcanlogin > rolconnlimit > datallowconn > datconnlimit > database's ACL_CONNECT privilege > > I want to argue that ReservedConnections, rolconnlimit, and datconnlimit > should only be applied to regular backends. It makes no particular sense to > enforce them against autovac workers, background workers, or wal senders, > because each of those types of processes has its own resource-limiting > PGPROC pool. It's especially bad to enforce them against parallel workers, > since that creates edge-case failures that make the use of parallelism not > transparent to applications. We hacked around that at 73c9f91a1, but I think > that should be reverted in favor of not applying the check at all. > > I further argue that rolcanlogin, datallowconn, and ACL_CONNECT should not > be checked in a parallel worker, again primarily on the grounds that this > creates > parallelism-specific failures (cf [1]). > The two scenarios where this occurs are (1) permissions were revoked since > the leader process connected, or (2) leader is currently running as a role > that > wouldn't have permission to connect on its own. We don't attempt to kick out > the leader process when either of those things happen, so why should we > prevent it from using parallelism? > > The current situation for each of these checks is: > > ReservedConnections is enforced if !am_superuser && !am_walsender, so it is > enforced against non-superuser background workers, which is silly because > BG workers have their own PGPROC pool; moreover, what's the argument for > letting walsenders but not other kinds of background processes escape this? > I propose changing it to apply only to regular backends. > > rolcanlogin is enforced if IsUnderPostmaster and we reach > InitializeSessionUserId, which basically reduces to regular backends, parallel > workers, logrep workers, and walsenders. Seems reasonable for logrep > workers and walsenders which represent fresh logins, but not for parallel > workers. I propose fixing this by making ParallelWorkerMain pass > BGWORKER_BYPASS_ROLELOGINCHECK. > > rolconnlimit is enforced if IsUnderPostmaster and we reach > InitializeSessionUserId and it's not a superuser. So that applies to > non-superuser parallel workers, logrep workers, and walsenders, and I don't > think it's reasonable to apply it to any of them since those all come out of > other > PGPROC pools. I propose switching that to apply only to regular backends. > > BTW, I kind of wonder why rolconnlimit is ineffectual for superusers, > especially > when rolcanlogin does apply to them. Not a bug exactly, but it sure seems > inconsistent. If you've taken the trouble to set it you'd expect it to work. > Shall > we take out the !is_superuser check? > > datallowconn is enforced against all non-standalone, non-AV-worker > processes that connect to a specific database, except bgworkers that pass > BGWORKER_BYPASS_ALLOWCONN (nothing in core except test module). > So again that includes parallel workers, logrep workers, and walsenders. > Again this seems reasonable for logrep workers and walsenders but not for > parallel workers. I propose fixing this by making ParallelWorkerMain pass > BGWORKER_BYPASS_ALLOWCONN. > > datconnlimit is enforced against all non-superuser processes, including > per-DB walsenders and BG workers (see above). > This is fairly dubious given that they have their own PGPROC pools. > I propose switching that to apply only to regular backends, too. > > ACL_CONNECT is enforced against all non-superuser processes, including > per-DB walsenders and BG workers (includes parallel workers, subscription > apply workers, logrep workers). > Perhaps that's OK for most, but I argue not for parallel workers; maybe skip > it if > BGWORKER_BYPASS_ALLOWCONN? > > Also, the enforcement of datconnlimit and rolconnlimit is inconsistent in > another way: our counting of the pre-existing processes is pretty random. > CountDBConnections is not consistent with either the current set of processes > that datconnlimit is enforced against, or my proposal to enforce it only > against > regular backends. It counts anything that is not > AmBackgroundWorkerProcess, including AV workers and per-DB walsenders. > I think it should count only regular backends, because anything else leads to > weird inconsistencies in whether a rejection occurs. > > The same applies to CountUs
Re: Connection limits/permissions, slotsync workers, etc
"Zhijie Hou (Fujitsu)" writes: > On Thursday, December 26, 2024 3:50 AM Tom Lane >> I wonder if the AV launcher and slotsync worker could be reclassified as >> "auxiliary >> processes" instead of being their own weird animal. > It appears that the current aux processes do not run transactions as stated > in the > comments[1], so we may need to somehow release this restriction to achieve the > goal. Ah, right, I'd forgotten about that restriction. I agree that removing it wouldn't be very reasonable. However, I still would rather avoid making the slotsync worker be its very own special snowflake, because that offers no support for the next person who wants to invent a new sort of specialized transaction-capable process. Attached is an alternative proposal that groups the autovac launcher and slotsync worker into a new category of "special workers" (better name welcome). I chose to put them into the existing autovacFreeProcs freelist, partly because the autovac launcher lives there already but mostly because I don't want to add another freelist in a patch we need to put into v17. (As written, your patch is an ABI break. It'd probably be safe to add a new freelist at the end of the struct in v17, but I'm a little shy about that in view of recent bugs. In any case, a freelist having at most two members seems rather silly.) I was amused but not terribly surprised to notice that the comments in InitProcGlobal were *already* out of date, in that they didn't account for the walsender PGPROC pool. We have a remarkably bad track record for updating comments that are more than about two lines away from the code they describe :-( Thoughts? regards, tom lane diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 720ef99ee8..10d4fb4ea1 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -197,11 +197,12 @@ InitProcGlobal(void) /* * Create and initialize all the PGPROC structures we'll need. There are - * five separate consumers: (1) normal backends, (2) autovacuum workers - * and the autovacuum launcher, (3) background workers, (4) auxiliary - * processes, and (5) prepared transactions. Each PGPROC structure is - * dedicated to exactly one of these purposes, and they do not move - * between groups. + * six separate consumers: (1) normal backends, (2) autovacuum workers and + * special workers, (3) background workers, (4) walsenders, (5) auxiliary + * processes, and (6) prepared transactions. (For largely-historical + * reasons, we combine autovacuum and special workers into one category + * with a single freelist.) Each PGPROC structure is dedicated to exactly + * one of these purposes, and they do not move between groups. */ procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC)); MemSet(procs, 0, TotalProcs * sizeof(PGPROC)); @@ -269,12 +270,13 @@ InitProcGlobal(void) } /* - * Newly created PGPROCs for normal backends, autovacuum and bgworkers - * must be queued up on the appropriate free list. Because there can - * only ever be a small, fixed number of auxiliary processes, no free - * list is used in that case; InitAuxiliaryProcess() instead uses a - * linear search. PGPROCs for prepared transactions are added to a - * free list by TwoPhaseShmemInit(). + * Newly created PGPROCs for normal backends, autovacuum workers, + * special workers, bgworkers, and walsenders must be queued up on the + * appropriate free list. Because there can only ever be a small, + * fixed number of auxiliary processes, no free list is used in that + * case; InitAuxiliaryProcess() instead uses a linear search. PGPROCs + * for prepared transactions are added to a free list by + * TwoPhaseShmemInit(). */ if (i < MaxConnections) { @@ -282,13 +284,13 @@ InitProcGlobal(void) dlist_push_tail(&ProcGlobal->freeProcs, &proc->links); proc->procgloballist = &ProcGlobal->freeProcs; } - else if (i < MaxConnections + autovacuum_max_workers + 1) + else if (i < MaxConnections + autovacuum_max_workers + NUM_SPECIAL_WORKER_PROCS) { - /* PGPROC for AV launcher/worker, add to autovacFreeProcs list */ + /* PGPROC for AV or special worker, add to autovacFreeProcs list */ dlist_push_tail(&ProcGlobal->autovacFreeProcs, &proc->links); proc->procgloballist = &ProcGlobal->autovacFreeProcs; } - else if (i < MaxConnections + autovacuum_max_workers + 1 + max_worker_processes) + else if (i < MaxConnections + autovacuum_max_workers + NUM_SPECIAL_WORKER_PROCS + max_worker_processes) { /* PGPROC for bgworker, add to bgworkerFreeProcs list */ dlist_push_tail(&ProcGlobal->bgworkerFreeProcs, &proc->links); @@ -358,8 +360,11 @@ InitProcess(void) if (IsUnderPostmaster) RegisterPostmasterChildActive(); - /* Decide which list should supply our PGPROC. */ - if (AmAutoVacuumLauncherProcess() || AmAutoVacuumWorkerProcess()) + /* + * Decide which list sh
Re: add support for the old naming libs convention on windows (ssleay32.lib and libeay32.lib)
> On 9 Dec 2024, at 07:25, Darek Ślusarczyk wrote: > I've prepared another patch: > - it prioritizes libssl and libcrypto over ssleay32 and libeay32 > - it checks ssleay32 and libeay32 on windows only > - I tested it locally on both lnx/win enforcing various possible scenarios I'm neither a Windows or Meson expert but this seems correct to me and matches the autoconf stanza for OpenSSL libraries on Windows. CC:ing one of the project experts on Meson for a second opinion. -- Daniel Gustafsson
Re: POC: make mxidoff 64 bits
On Wed, 18 Dec 2024 at 13:21, Heikki Linnakangas wrote: > > Attached is some more cleanup on top of patch set v9, removing more dead > stuff related to wraparound. I also removed the oldestOffsetKnown > variable and related code. It was needed to deal with clusters upgraded > from buggy 9.3 and 9.4 era versions, but now that pg_upgrade will > rewrite the SLRUs, it's no longer needed. > Yep, multixact.c looks correct to me. As for "XXX could use SimpleLruTruncate()", yes, for sure. Actually, xl_multixact_truncate.startTruncMemb is also no longer needed, is it? > Does the pg_upgrade code work though, if you have that buggy situation > where oldestOffsetKnown == false ? > > > > > if (!TransactionIdIsValid(*xactptr)) > > { > > /* Corner case 3: we must be looking at unused > slot zero */ > > Assert(offset == 0); > > continue; > > } > > After upgrade, this corner case 3 would *not* happen on offset == 0. So > looks like we're still missing test coverage for this upgrade corner case. > Am I understanding correctly that you want to have a test corresponding to the buggy 9.3 and 9.4 era versions? Do you think we could imitate this scenario on a current master branch like that: 1) generate a couple of offsets segments for the first table; 2) generate more segments for a second table; 3) drop first table; 4) stop pg cluster; 5) remove pg_multixact/offsets/ 6) upgrade? PFA, v10-0016-TEST-try-to-replicate-buggy-oldest-offset.patch This test will fail now, for an obvious reason, but is this case a relevant one? -- Best regards, Maxim Orlov. <>
Re: [PATCHES] Post-special page storage TDE support
Re-reading this thread, and this has been nagging me: On Mon, Nov 13, 2023 at 3:03 PM David Christensen < david.christen...@crunchydata.com> wrote: > - so why do we need to recompute offsets on every single page? I'd >> instead > > add a distinct offset variable for each feature. >> > > This would work iff there is a single page feature set across all pages in > the cluster; I'm not sure we don't want more flexibility here. > I'm quite skeptical about the usefulness of this; seems to be at best a footgun to have per-page features. Also, none of the examples given so far (e.g. encryption, better checksums) are of a non-global nature, so did you have specific examples in mind? Cheers, Greg
Re: [PATCHES] Post-special page storage TDE support
On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian wrote: > The value of TDE is limited from a security value perspective, but high on > the list of security policy requirements. Our community is much more > responsive to actual value vs policy compliance value. > True. The number of forks, though, makes me feel this is a "when", not "if" feature. Has there been any other complex feature forked/implemented by so many? Maybe columnar storage? Cheers, Greg
Re: Test to dump and restore objects left behind by regression
> On 20 Dec 2024, at 11:01, Ashutosh Bapat wrote: > On Wed, Dec 18, 2024 at 7:39 PM Daniel Gustafsson wrote: >> >>> On 18 Dec 2024, at 12:28, Ashutosh Bapat >>> wrote: >> + if ( $ENV{PG_TEST_EXTRA} >> + && $ENV{PG_TEST_EXTRA} =~ /\bregress_dump_test\b/) >> Should this also test that $oldnode and $newnode have matching pg_version to >> keep this from running in a cross-version upgrade test? While it can be >> argued >> that running this in a cross-version upgrade is breaking it and getting to >> keep >> both pieces, it's also not ideal to run a resource intensive test we know >> will >> fail. (It can't be done at this exact callsite, just picked to illustrate.) > > You already wrote it in parenthesis. At the exact callsite $oldnode > and $newnode can not be of different versions. In fact newnode is yet > to be created at this point. But $oldnode has the same version as the > server run from the code. In a cross-version upgrade this test will > not be executed. I am confused as to what this comment is about. Sure, it can't be checked until $newnode is created, but it seems like a cheap test to ensure it's not executed as part of someones cross-version tests. >> + my $format_spec = substr($format, 0, 1); >> This doesn't seem great for readability, how about storing the formats and >> specfiers in an array of Perl hashes which can be iterated over with >> descriptive names, like $format{'name'} and $format{'spec'}? > > Instead of an array of hashes, I used a single hash with format > description as key and format spec as value. Hope that's acceptable. LGTM. -- Daniel Gustafsson
Bug and memory leaks with access to file links with long names (Windows, MSVS)
Hi! PostgreSQL build under Windows with MS Visual Studio has functions to work with links (unlike msys2 that has own functions). If a database has link pointing to location longer 130 chars, function pgreadlink fails to recognise this link and cancels query. The reason of the error - small buffer for link name - MAX_PATH symbols, though this buffer must have the place for at least 2 MAX_PATH : substitution and print names of the link. How to reproduce: If build directory has length ~100 chars or longer, pg_rewind/004_pg_xlog_symlink test will fail (Windows, MSVS build) Steps to reproduce: call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat" x64 SET LC_MESSAGES=C SET BUILDDIR=c:\postgresql-builds\length_30\length_40\length_50\length_60\length_70\length_80\length_90\length100\ meson setup %BUILDDIR% --prefix=c:\postgresql-builds\install c: cd %BUILDDIR% meson compile -C %BUILDDIR% meson install -C %BUILDDIR% meson test -C %BUILDDIR% Memory leak: In case of error the function in the code branch reporting the error does not return Windows file handle and Windows heap allocation for error message text. Solution: Proposed patch fixes link information buffer size changing it to the documented value MAXIMUM_REPARSE_DATA_BUFFER_SIZE, and fixes memory leaks - the message buffer copied to a buffer on stack with maximal message size 64Kb. -- Best regards, Vladlen Popolitov. From 4eb8afe77064a0cf3d815aa0ffa1ebf815fd1165 Mon Sep 17 00:00:00 2001 From: Vladlen Popolitov Date: Fri, 27 Dec 2024 15:23:30 +0300 Subject: [PATCH v1] Fix bug with access to file links with long name (Windows, MSVC) --- src/port/dirmod.c | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/src/port/dirmod.c b/src/port/dirmod.c index f98d5a7bf2..dbdd8a6e9f 100644 --- a/src/port/dirmod.c +++ b/src/port/dirmod.c @@ -310,7 +310,12 @@ pgreadlink(const char *path, char *buf, size_t size) { DWORD attr; HANDLE h; - char buffer[MAX_PATH * sizeof(WCHAR) + offsetof(REPARSE_JUNCTION_DATA_BUFFER, PathBuffer)]; +/* + * The buffer size described in documentation + * https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/fsctl-set-reparse-point + * It stores paths for SubstitutaName and PrintName. + */ + char buffer[MAXIMUM_REPARSE_DATA_BUFFER_SIZE]; REPARSE_JUNCTION_DATA_BUFFER *reparseBuf = (REPARSE_JUNCTION_DATA_BUFFER *) buffer; DWORD len; int r; @@ -350,6 +355,8 @@ pgreadlink(const char *path, char *buf, size_t size) NULL)) { LPSTR msg; + /* the maximal size of the message returned by FormatMessage is 64Kb */ + char msgBuffer[0x1]; errno = 0; FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER | @@ -358,17 +365,24 @@ pgreadlink(const char *path, char *buf, size_t size) NULL, GetLastError(), MAKELANGID(LANG_ENGLISH, SUBLANG_DEFAULT), (LPSTR) &msg, 0, NULL); + strncpy(msgBuffer, msg, sizeof(msgBuffer)); + if (strlen(msg) >= sizeof(msgBuffer)) + { + msgBuffer[sizeof(msgBuffer) - 1] = 0; + } #ifndef FRONTEND + LocalFree(msg); + CloseHandle(h); ereport(ERROR, (errcode_for_file_access(), errmsg("could not get junction for \"%s\": %s", - path, msg))); + path, msgBuffer))); #else fprintf(stderr, _("could not get junction for \"%s\": %s\n"), path, msg); -#endif LocalFree(msg); CloseHandle(h); +#endif errno = EINVAL; return -1; } -- 2.42.0.windows.2
Re: Exists pull-up application with JoinExpr
Hi Alena, Thank you for your work on subqueries with JOIN. Have you considered the scenario where in subquery includes a qual like (tc.aid = 1)? When I tried executing those queries I receive different results. In my opinion, to prevent this, we should add filters for such quals within the loop 'foreach (lc, all_clauses)' EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1); QUERY PLAN -- Hash Join (actual rows=1 loops=1) Hash Cond: (ta.id = tb.id) Buffers: local hit=3 -> Seq Scan on ta (actual rows=3 loops=1) Buffers: local hit=1 -> Hash (actual rows=3 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 33kB Buffers: local hit=2 -> HashAggregate (actual rows=3 loops=1) Group Key: tb.id Batches: 1 Memory Usage: 121kB Buffers: local hit=2 -> Nested Loop (actual rows=3 loops=1) Buffers: local hit=2 -> Seq Scan on tb (actual rows=3 loops=1) Buffers: local hit=1 -> Materialize (actual rows=1 loops=3) Storage: Memory Maximum Storage: 17kB Buffers: local hit=1 -> Seq Scan on tc (actual rows=1 loops=1) Filter: (aid = 1) Rows Removed by Filter: 1 Buffers: local hit=1 (23 rows) EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id = tb.id WHERE tc.aid = 1); QUERY PLAN --- Seq Scan on ta (actual rows=1 loops=1) Filter: EXISTS(SubPlan 1) Rows Removed by Filter: 2 Buffers: local hit=6 SubPlan 1 -> Nested Loop (actual rows=0 loops=3) Buffers: local hit=5 -> Index Only Scan using tb_pkey on tb (actual rows=0 loops=3) Index Cond: (id = ta.id) Heap Fetches: 1 Buffers: local hit=4 -> Seq Scan on tc (actual rows=1 loops=1) Filter: (aid = 1) Buffers: local hit=1 (14 rows) -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: [Bug Fix]standby may crash when switching-over in certain special cases
Hi, I inserted the following code in walsender.c:2509(v15.8) to reproduce the issue. { WalSnd *walsnd = MyWalSnd; SpinLockAcquire(&walsnd->mutex); if (walsnd->flush % wal_segment_size == 0 && walsnd->sentPtr == walsnd->flush && walsnd->flush > wal_segment_size) { for (int i = 0; i < max_wal_senders; i++) { WalSnd *walsnder = &WalSndCtl->walsnds[i]; if (walsnder->pid == walsnd->pid) continue; SpinLockAcquire(&walsnder->mutex); if (walsnder->pid == 0) { SpinLockRelease(&walsnder->mutex); continue; } if (walsnder->flush % wal_segment_size == 0 && walsnder->flush > wal_segment_size) elog(PANIC, "simulate primary crashed, flushedlsn %X/%X", LSN_FORMAT_ARGS(walsnder->flush)); SpinLockRelease(&walsnder->mutex); } } SpinLockRelease(&walsnd->mutex); } Regards, Pixian Shi px shi 于2024年10月10日周四 15:01写道: > Although I've not tested it because I don't have good way to reproduce >> the problem > > > > I use GDB to reproduce the issue by killing the primary node with kill -9 > when the standby’s flushedlsn was at the begining of a WAL segment. > > > However, this causes to request the primary to stream data from the >> requested location again, even if the standby has already valid data. It >> could be wasteful, for example, when applying WAL record is delayed due to >> conflict with recovery or recovery_min_apply_delay. > > > I had overlooked this problem, thanks for pointing that out. > > I think that comparing walrcv->receiveStart with recptr would be a better > approach. Specifically, if walrcv->receiveStart is greater than recptr, then > reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from > the primary when the standby already has valid data.A new patch has been > submitted. > > > Regards, > > Pixian Shi > > > Yugo NAGATA 于2024年10月9日周三 14:46写道: > >> On Mon, 30 Sep 2024 15:14:54 +0800 >> px shi wrote: >> >> > Thanks for responding. >> > >> > >> > > It is odd that the standby server crashes when >> > >> > replication fails because the standby would keep retrying to get the >> > >> > next record even in such case. >> > >> > >> > As I mentioned earlier, when replication fails, it retries to establish >> > streaming replication. At this point, the value of *walrcv->flushedUpto >> *is >> > not necessarily the data actually flushed to disk. However, the startup >> > process mistakenly believes that the latest flushed LSN is >> > *walrcv->flushedUpto* and attempts to open the corresponding WAL file, >> > which doesn't exist, leading to a file open failure and causing the >> startup >> > process to PANIC. >> >> Although I've not tested it because I don't have good way to reproduce >> the problem, >> the patch seems to fix the issue by rewinding walrcv->flushedUpto to the >> requested >> WAL location always when the streaming is requested to start. However, >> this causes >> to request the primary to stream data from the requested location again, >> even if >> the standby has already valid data. It could be wasteful, for example, >> when applying WAL >> record is delayed due to conflict with recovery or >> recovery_min_apply_delay. >> >> It might be better if if could notice that there is not requested record >> in the primary's >> timeline before requesting the streaming, but I don't have a good >> solution for now. >> >> Regards, >> Yugo Nagata >> >> > Regards, >> > Pixian Shi >> > >> > Yugo NAGATA 于2024年9月30日周一 13:47写道: >> > >> > > On Wed, 21 Aug 2024 09:11:03 +0800 >> > > px shi wrote: >> > > >> > > > Yugo Nagata 于2024年8月21日周三 00:49写道: >> > > > >> > > > > >> > > > > >> > > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the >> > > standbys >> > > > > of >> > > > > > the primary server respectively, it is not surprising that s2 >> has >> > > > > progressed >> > > > > > far than s1 when the primary fails. I believe that this is the >> case >> > > you >> > > > > should >> > > > > > use pg_rewind. Even if flushedUpto is reset as proposed in your >> > > patch, >> > > > > s2 might >> > > > > > already have applied a WAL record that s1 has not processed >> yet, and >> > > > > there >> > > > > > would be no gurantee that subsecuent applys suceed. >> > > > > >> > > > > >> > > > Thank you for your response. In my scenario, s1 and s2 is the >> standbys >> > > of >> > > > the primary server respectively, and s1 a synchronous standby and >> s2 is >> > > an >> > > > asynchronous standby. You mentioned that if s2's replay progress is >> ahead >> > > > of s1, pg_rewind should be used. However, what I'm trying to >> address is >> > > an >> > > > issue
Re: Connection limits/permissions, slotsync workers, etc
Also, here's a patch for the rest of what I was talking about. We'll need to back-patch this given that the CVE-2024-10978 changes caused these sorts of problems in all branches, but I've not yet attempted to back-patch. It looks like it might be a bit painful thanks to past code churn in these areas. I didn't do anything about the idea of making rolconnlimit applicable to superusers. If we do that at all, it should only be in HEAD. Also, I got a shade less enthusiastic about it after noting that this logic is parallel to that for datconnlimit, and it does seems sensible to allow superusers to ignore datconnlimit. Maybe it's fine for the two limits to operate differently, but I'm unsure. Also, it probably would make sense to rename PGPROC.isBackgroundWorker to isRegularBackend (inverting the sense of the boolean), but that doesn't seem like back-patch material either, so I didn't include it here. I think we can get away with a subtle adjustment of which processes that flag is set for in the back branches, but not with renaming it. regards, tom lane From 84017bf4da912cad44416b729860baf2edfe66fd Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Fri, 27 Dec 2024 15:36:31 -0500 Subject: [PATCH v1] Exclude parallel workers from connection privilege/limit checks. Cause parallel workers to not check datallowconn, rolcanlogin, and ACL_CONNECT privileges. The leader already checked these things (except for rolcanlogin which might have been checked for a different role). Re-checking can accomplish little except to induce unexpected failures in applications that might not even be aware that their query has been parallelized. We already had the principle that parallel workers rely on their leader to pass a valid set of authorization information, so this change just extends that a bit further. Also, modify the ReservedConnections, datconnlimit and rolconnlimit logic so that these limits are only enforced against regular backends, and only regular backends are counted while checking if the limits were already reached. Previously, background processes that had an assigned database or role were subject to these limits (with rather random exclusions for autovac workers and walsenders), and the set of existing processes that counted against each limit was quite haphazard as well. The point of these limits, AFAICS, is to ensure the availability of PGPROC slots for regular backends. Since all other types of processes have their own separate pools of PGPROC slots, it makes no sense either to enforce these limits against them or to count them while enforcing the limit. While edge-case failures of these sorts have been possible for a long time, the problem got a good deal worse with commit 5a2fed911 (CVE-2024-10978), which caused parallel workers to make these checks using the leader's current role instead of its AuthenticatedUserId, thus allowing parallel queries to fail after SET ROLE. That older behavior was fairly accidental and I have no desire to return to it. This patch allows reverting 73c9f91a1 which was an emergency hack to suppress these same checks in some cases. It wasn't complete, as shown by a recent bug report from Laurenz Albe. We can also revert fd4d93d26 and 492217301, which hacked around the problems in one regression test. Like 5a2fed911, back-patch to supported branches (which sadly no longer includes v12). Discussion: https://postgr.es/m/1808397.1735156...@sss.pgh.pa.us --- src/backend/access/transam/parallel.c| 10 -- src/backend/access/transam/twophase.c| 2 +- src/backend/postmaster/bgworker.c| 4 +-- src/backend/storage/ipc/procarray.c | 4 +-- src/backend/storage/lmgr/proc.c | 4 +-- src/backend/utils/init/miscinit.c| 8 - src/backend/utils/init/postinit.c| 40 src/include/miscadmin.h | 1 + src/include/storage/proc.h | 2 +- src/test/modules/worker_spi/worker_spi.c | 10 -- 10 files changed, 37 insertions(+), 48 deletions(-) diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 0a1e089ec1..60d95037b3 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -1419,10 +1419,16 @@ ParallelWorkerMain(Datum main_arg) fps->session_user_is_superuser); SetCurrentRoleId(fps->outer_user_id, fps->role_is_superuser); - /* Restore database connection. */ + /* + * Restore database connection. We skip connection authorization checks, + * reasoning that (a) the leader checked these things when it started, and + * (b) we do not want parallel mode to cause these failures, because that + * would make use of parallel query plans not transparent to applications. + */ BackgroundWorkerInitializeConnectionByOid(fps->database_id, fps->authenticated_user_id, - 0); + BGWORKER_BYPASS_ALLOWCONN | + BGWORKER_BYPAS
Re: [PATCHES] Post-special page storage TDE support
On Thu, Dec 12, 2024 at 09:15:55AM -0600, David Christensen wrote: > On Tue, Dec 10, 2024 at 12:54 AM Michael Paquier wrote: > > > > On Wed, Mar 13, 2024 at 11:26:48AM -0500, David Christensen wrote: > > > Enclosing v4 for this patch series, rebased atop the > > > constant-splitting series[1]. For the purposes of having cfbot happy, > > > I am including the prerequisites as a squashed commit v4-, however > > > this is not technically part of this series. > > > > The last update of this thread is from march 2024, with no replies and > > no reviews. Please note that this fails in the CI so I'd suggest a > > rebase for now, and I have marked the patch as waiting on author. If > > there is a lack of interest, well.. > > I can't say there is a lack of interest from the author per se :), but > not really seeing much in the way of community engagement makes me > think it's largely unwanted. I'd certainly be happy to rebase and > reengage, but if it's not wanted at the conceptual level it doesn't > seem worth the effort. It's hard to interpret lack of response as > "don't care, fine" vs "don't want" vs "haven't looked, -hackers is a > firehose". The value of TDE is limited from a security value perspective, but high on the list of security policy requirements. Our community is much more responsive to actual value vs policy compliance value. When I started focusing on TDE, it was going to require changes to buffer reads/writes, WAL, and require a way to store secret keys. I thought those changes would be acceptable given TDE's security value. Once the file I/O changes were required, I think the balance tilted to TDE requiring too many code changes given its security value (not policy compliance value). At least that is my analysis, and part of me wishes I was wrong. I know there are several commercial forks of TDE, mostly because companies are more sensitive to policy compliance value, which translates to monetary value for them. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: An improvement of ProcessTwoPhaseBuffer logic
On Friday, December 27, 2024 10:37 MSK, Michael Paquier wrote: > So please see the attached. You will note that RemoveTwoPhaseFile(), > ProcessTwoPhaseBuffer() and TwoPhaseFilePath() now require a > FullTransactionId, hence callers need to think about the epoch to use. > That should limit future errors compared to passing the file name as > optional argument. In general, I like your solution to use FullTransactionId. I haven't found any evident problems. I just would like to propose to create a couple of new functions like RemoveTwoPhaseFileInCurrentEpoch, TwoPhaseFilePathInCurrentEpoch which accept TransactionId instead FullTransactionId. It may make the code clearer and result into less boilerplate code. I tried to do some hand testing. It seems, the problem is gone with the patch. Thank you! As an idea, I would like to propose to store FullTransactionId in global transaction state instead of TransactionId. I'm not sure, it will consume significant additional memory, but it make the code more clear and probably result into less number of locks. With best regards, Vitaly From 1e74e01a2485e2b3a59f880a5eaf86af1af81cc5 Mon Sep 17 00:00:00 2001 From: Vitaly Davydov Date: Fri, 27 Dec 2024 18:03:28 +0300 Subject: [PATCH] Some cosmetic fixes --- src/backend/access/transam/twophase.c | 92 +++ 1 file changed, 37 insertions(+), 55 deletions(-) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 01f5d728be..9418fd74ee 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -228,6 +228,7 @@ static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid, TimestampTz prepared_at, Oid owner, Oid databaseid); static void RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning); +static void RemoveTwoPhaseFileInCurrentEpoch(TransactionId xid, bool giveWarning); static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len); /* @@ -934,6 +935,22 @@ TwoPhaseFilePath(char *path, FullTransactionId fxid) XidFromFullTransactionId(fxid)); } +static inline int +TwoPhaseFilePathInCurrentEpoch(char *path, TransactionId xid) +{ + uint32 epoch; + FullTransactionId fxid; + FullTransactionId nextFullXid; + + /* Use current epoch */ + nextFullXid = ReadNextFullTransactionId(); + epoch = EpochFromFullTransactionId(nextFullXid); + + fxid = FullTransactionIdFromEpochAndXid(epoch, xid); + + return TwoPhaseFilePath(path, fxid); +} + /* * 2PC state file format: * @@ -1279,16 +1296,8 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok) pg_crc32c calc_crc, file_crc; int r; - FullTransactionId fxid; - FullTransactionId nextFullXid; - uint32 epoch; - - /* get current epoch */ - nextFullXid = ReadNextFullTransactionId(); - epoch = EpochFromFullTransactionId(nextFullXid); - fxid = FullTransactionIdFromEpochAndXid(epoch, xid); - TwoPhaseFilePath(path, fxid); + TwoPhaseFilePathInCurrentEpoch(path, xid); fd = OpenTransientFile(path, O_RDONLY | PG_BINARY); if (fd < 0) @@ -1656,18 +1665,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit) * be from the current epoch. */ if (ondisk) - { - uint32 epoch; - FullTransactionId nextFullXid; - FullTransactionId fxid; - - /* get current epoch */ - nextFullXid = ReadNextFullTransactionId(); - epoch = EpochFromFullTransactionId(nextFullXid); - - fxid = FullTransactionIdFromEpochAndXid(epoch, xid); - RemoveTwoPhaseFile(fxid, true); - } + RemoveTwoPhaseFileInCurrentEpoch(xid, true); MyLockedGxact = NULL; @@ -1723,6 +1721,21 @@ RemoveTwoPhaseFile(FullTransactionId fxid, bool giveWarning) errmsg("could not remove file \"%s\": %m", path))); } +static void +RemoveTwoPhaseFileInCurrentEpoch(TransactionId xid, bool giveWarning) +{ + uint32 epoch; + FullTransactionId nextFullXid; + FullTransactionId fxid; + + /* get current epoch */ + nextFullXid = ReadNextFullTransactionId(); + epoch = EpochFromFullTransactionId(nextFullXid); + + fxid = FullTransactionIdFromEpochAndXid(epoch, xid); + RemoveTwoPhaseFile(fxid, true); +} + /* * Recreates a state file. This is used in WAL replay and during * checkpoint creation. @@ -1735,21 +1748,13 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len) char path[MAXPGPATH]; pg_crc32c statefile_crc; int fd; - FullTransactionId fxid; - FullTransactionId nextFullXid; - uint32 epoch; /* Recompute CRC */ INIT_CRC32C(statefile_crc); COMP_CRC32C(statefile_crc, content, len); FIN_CRC32C(statefile_crc); - /* Use current epoch */ - nextFullXid = ReadNextFullTransactionId(); - epoch = EpochFromFullTransactionId(nextFullXid); - - fxid = FullTransactionIdFromEpochAndXid(epoch, xid); - TwoPhaseFilePath(path, fxid); + TwoPhaseFilePathInCurrentEpoch(path, xid); fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY); @@ -2286,7 +2291,6 @@ ProcessTwoPhaseBuffer(Fu
Re: Re: proposal: schema variables
hi. + if (!OidIsValid(varid)) + AcceptInvalidationMessages(); + else if (OidIsValid(varid)) + LockDatabaseObject(VariableRelationId, varid, 0, AccessShareLock); we can change it to + if (!OidIsValid(varid)) + AcceptInvalidationMessages(); + else + LockDatabaseObject(VariableRelationId, varid, 0, AccessShareLock); inval_count didn't explain at all, other places did actually explain it. Can we add some text explaining inval_count? (i didn't fully understand this part, that is why i am asking..) seems IdentifyVariable all these three ereport(ERROR...) don't have regress tests, i think we should have it. Am I missing something? create variable v2 as int; let v2.a = 1; ERROR: type "integer" of target session variable "public.v2" is not a composite type LINE 1: let v2.a = 1; ^ the error messages look weird. IMO, it should either be "type of session variable "public.v2" is not a composite type" or "session variable "public.v2" don't have attribute "a" also, the above can be as a regress test for: + if (attrname && !is_rowtype) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("type \"%s\" of target session variable \"%s.%s\" is not a composite type", + format_type_be(typid), + get_namespace_name(get_session_variable_namespace(varid)), + get_session_variable_name(varid)), + parser_errposition(pstate, stmt->location))); since we don't have coverage tests for it. + if (coerced_expr == NULL) + ereport(ERROR, + (errcode(ERRCODE_DATATYPE_MISMATCH), + errmsg("variable \"%s.%s\" is of type %s," + " but expression is of type %s", + get_namespace_name(get_session_variable_namespace(varid)), + get_session_variable_name(varid), + format_type_be(typid), + format_type_be(exprtypid)), + errhint("You will need to rewrite or cast the expression."), + parser_errposition(pstate, exprLocation((Node *) expr; generally, errmsg(...) should put it into one line for better grep-ability so we can change it to: +errmsg("variable \"%s.%s\" is of type %s, but expression is of type %s"...) also no coverage tests? simple test case for it: create variable v2 as int; let v2 = '1'::jsonb; ---<<<>>>-- +let_target: + ColId opt_indirection + { + $$ = list_make1(makeString($1)); + if ($2) + $$ = list_concat($$, + check_indirection($2, yyscanner)); + } if you look closely, it seems the indentation level is wrong in line "$$ = list_concat($$,"? ---<<<>>>-- i did some refactoring in session_variables.sql for privilege check. make the tests more neat, please check attached. diff --git a/src/test/regress/expected/session_variables.out b/src/test/regress/expected/session_variables.out index 9f75128c42..4873ac5438 100644 --- a/src/test/regress/expected/session_variables.out +++ b/src/test/regress/expected/session_variables.out @@ -115,85 +115,126 @@ ALTER DEFAULT PRIVILEGES SET ROLE TO regress_variable_owner; CREATE VARIABLE svartest.var1 AS int; SET ROLE TO DEFAULT; -\dV+ svartest.var1 -List of variables - Schema | Name | Type | Collation | Owner |Access privileges | Description ---+--+-+---++--+- - svartest | var1 | integer | | regress_variable_owner | regress_variable_owner=rw/regress_variable_owner+| - | | | || regress_variable_reader=r/regress_variable_owner | +--should be ok. since ALTER DEFAULT PRIVILEGES +--allow regress_variable_reader to have SELECT priviledge +SELECT has_session_variable_privilege('regress_variable_reader', 'svartest.var1', 'SELECT'); -- t + has_session_variable_privilege + + t (1 row) DROP VARIABLE svartest.var1; DROP SCHEMA svartest; DROP ROLE regress_variable_reader; --- check WITH GRANT OPTION +-begin of check GRANT WITH GRANT OPTION and REVOKE GRANTED BY CREATE ROLE regress_variable_r1; CREATE ROLE regress_variable_r2; SET ROLE TO regress_variable_owner; -CREATE VARIABLE var1 AS int; +CREATE VARIABLE var1 AS int; --var1 will owned by regress_variable_owner GRANT SELECT ON VARIABLE var1 TO regress_variable_r1 WITH GRANT OPTION; SET ROLE TO regress_variable_r1; GRANT SELECT ON VARIABLE var1 TO regress_variable_r2 WITH GRANT OPTION; SET ROLE TO DEFAULT; -SELECT varacl FROM pg_variable WHERE varname = 'var1'; - varacl -- - {regress_variable_owner=rw/regress_variable_owner,regress_variable_r1=r*/regress_variable_owner,regress_variable_r2=r*/regress_variable_r1} +SELECT has_session_variable_privilege('regress_variable_r1', 'pu
Re: Add the ability to limit the amount of memory that can be allocated to backends.
Reviving this thread, because I am thinking about something related -- please ignore the "On Fri, Dec 27, 2024" date, this seems to be an artifact of me re-sending the message, from the list archive. The original message was from January 28, 2024. On Fri, Dec 27, 2024 at 11:02 AM Tomas Vondra wrote: > > Firstly, I agree with the goal of having a way to account for memory > used by the backends, and also ability to enforce some sort of limit. > It's difficult to track the memory at the OS level (interpreting RSS > values is not trivial), and work_mem is not sufficient to enforce a > backend-level limit, not even talking about a global limit. > > But as I said earlier, it seems quite strange to start by introducing > some sort of global limit, combining memory for all backends. I do > understand that the intent is to have such global limit in order to > prevent issues with the OOM killer and/or not to interfere with other > stuff running on the same machine. And while I'm not saying we should > not have such limit, every time I wished to have a memory limit it was a > backend-level one. Ideally a workmem-like limit that would "adjust" the > work_mem values used by the optimizer (but that's not what this patch > aims to do), or at least a backstop in case something goes wrong (say, a > memory leak, OLTP application issuing complex queries, etc.). I think what Tomas suggests is the right strategy. I am thinking of something like: 1. Say we have a backend_work_mem limit. Then the total amount of memory available on the entire system, for all queries, as work_mem, would be backend_work_mem * max_connections. 2. We use this backend_work_mem to "adjust" work_mem values used by the executor. (I don't care about the optimizer right now -- optimizer just does its best to predict what will happen at runtime.) At runtime, every node that uses work_mem currently checks its memory usage against the session work_mem (and possibly hash_mem_multiplier) GUC(s). Instead, now, every node would check against its node-specific "adjusted" work_mem. If it exceeds this limit, it spills, using existing logic. In other words -- existing logic spills based on comparison to global work_mem GUC. Let's make it spill, instead, based on comparison to an operator-local "PlanState.work_mem" field. And then let's set that "PlanState.work_mem" field based on a new "backend_work_mem" GUC. Then no queries will run OOM (at least, not due to work_mem -- we'll address other memory usages separately), so they won't need to be canceled. Instead, they'll spill. This strategy solves the ongoing problem of how to set work_mem, if some queries have lots of operators and others don't -- now we just set backend_work_mem, as a limit on the entire query's total work_mem. And a bit of integration with the optimizer will allow us to distribute the total backend_work_mem to individual execution nodes, with the goal of minimizing spilling, without exceeding the backend_work_mem limit. Anyway, I revived this thread to see if there's interest in this sort of strategy -- Thanks, James
Re: [PATCHES] Post-special page storage TDE support
On Fri, Dec 27, 2024 at 12:25:11PM -0500, Greg Sabino Mullane wrote: > On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian wrote: > > The value of TDE is limited from a security value perspective, but high on > the list of security policy requirements. Our community is much more > responsive to actual value vs policy compliance value. > > > True. The number of forks, though, makes me feel this is a "when", not "if" > feature. Has there been any other complex feature forked/implemented by so > many? Maybe columnar storage? That is a great question. We have TDE implementations from EDB, Fujitsu, Percona, Cybertec, and Crunchy Data, and perhaps others, and that is a lot of duplicated effort. As far as parallels, I think compatibility with Oracle and MSSQL are areas that several companies have developed that the community is unlikely to ever develop, I think because they are pure compatibility, not functionality. I think TDE having primarily policy compliance value also might make it something the community never develops. I think this blog post is the clearest I have seen about the technical value vs.policy compliance value of TDE: https://www.percona.com/blog/why-postgresql-needs-transparent-database-encryption-tde/ One possible way TDE could be added to community Postgres is if the code changes required were reduced due to an API redesign. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Re: Windows meson build
Vladlen Popolitov писал(а) 2024-12-27 09:57: Kohei Harikae (Fujitsu) писал(а) 2024-12-27 04:31: Hi, Thank you for your advice. I added a sample setting PKG_CONFIG_PATH based on your advice. How do you think about this? Regards, Kohei Harikae Hi, From my point of view it looks good. Actual information is added, excessive information is removed. I think in the phrase: "add the directory where the .pc file resides to PKG_CONFIG_PATH." better to add wording, that PKG_CONFIG_PATH is the list of directories separated by PATH separator ( ; in Windows, : in POSIX). Otherwise the reader has to search additional information about this variable in meson documentation. Hi I think, it is better to add the other options how to configure external libraries, if they do not have .pc file: add meson options defined by PostgreSQL for its configuration (gettext library example) -Dextra_include_dirs=c:\postgres\gettext\x64-windows\include,c:\otherlibs\include -Dextra_lib_dirs=c:\postgres\gettext\x64-windows\lib,c:\otherlibs\lib extra_include_dirs and extra_lib_dirs are options defined by PostgreSQL. This options are comma separated lists of directories, if every value does not contain comma itself. -- Best regards, Vladlen Popolitov.
Re: pure parsers and reentrant scanners
On 26/12/2024 20:27, Peter Eisentraut wrote: On 22.12.24 22:43, Andreas Karlsson wrote: On 12/19/24 9:57 PM, Peter Eisentraut wrote: Here is an updated patch set on top of what has been committed so far, with all the issues you pointed out addressed. Other than the discussion of how old versions of flex we should support I think this set of patches is ready to be committed. I looked at it again and everything looks good. I have committed these except the plpgsql one, which was still work in progress. But I have progressed on this now and also converted the parser and put the local state into yyextra. This gets rid of all internal global state now. The patches for this are attached. It's a lot of churn, but otherwise pretty standard stuff. Looks good to me. Along the way I noticed that the flex documentation now recommends a different way to set the yyextra type. So I have changed the ones we already have to that newer style. I inspected the generated C code and there wasn't any significant difference, so I'm not sure, but I figure if we're making changes in this area we might as well use the modern style. +1. According to the flex NEWS file, this syntax was added in flex 2.5.34, and we already require 2.5.35. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Logical replication timeout
> Right, because of the reason I posted [1]. > > I updated the patch which did the same approach. It could pass my CI. > Could you please apply on 17.2 and test it? > > [1]: > https://www.postgresql.org/message-id/OSCPR01MB14966B646506E0C9B81B3A4CFF5022%40OSCPR01MB14966.jpnprd01.prod.outlook.com This is a considerable improvement, the cleanup phase took less than 30s (compared to the former 200s). However, we are talking of a 12s transaction, that takes an overall 64s to be replicated. In this particular case, the replication system spends most of its time creating / deleting small files. Would not be possible to create just one spill file for the main transaction ?
RE: Logical Replication of sequences
Dear Vignesh, Thanks for updating the patch! Here are my comments. 01. SyncingRelationsState ``` * SYNC_RELATIONS_STATE_NEEDS_REBUILD The subscription relations state is no * longer valid and the subscription relations should be rebuilt. ``` Can we follow the style like other lines? Like: SYNC_RELATIONS_STATE_NEEDS_REBUILD indicates that the subscription relations state is no longer valid and the subscription relations should be rebuilt. 02. FetchRelationStates() ``` /* Fetch tables and sequences that are in non-ready state. */ rstates = GetSubscriptionRelations(MySubscription->oid, true, true, false); ``` I think rstates should be list_free_deep()'d after the foreach(). 03. LogicalRepSyncSequences ``` charslotname[NAMEDATALEN]; ... snprintf(slotname, NAMEDATALEN, "pg_%u_sync_sequences_" UINT64_FORMAT, subid, GetSystemIdentifier()); /* * Here we use the slot name instead of the subscription name as the * application_name, so that it is different from the leader apply worker, * so that synchronous replication can distinguish them. */ LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true, true, must_use_password, slotname, &err); ``` Hmm, IIUC the sequence sync worker does not require any replication slots. I feel the variable name should be "application_name" and the comment can be updated. 04. LogicalRepSyncSequences ``` /* Get the sequences that should be synchronized. */ sequences = GetSubscriptionRelations(subid, false, true, false); ``` I think rstates should be list_free_deep()'d after the foreach_ptr(). 05. LogicalRepSyncSequences ``` /* * COPY FROM does not honor RLS policies. That is not a problem for * subscriptions owned by roles with BYPASSRLS privilege (or * superuser, who has it implicitly), but other roles should not be * able to circumvent RLS. Disallow logical replication into RLS * enabled relations for such roles. */ if (check_enable_rls(RelationGetRelid(sequence_rel), InvalidOid, false) == RLS_ENABLED) ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("user \"%s\" cannot replicate into sequence with row-level security enabled: \"%s\"", GetUserNameFromId(GetUserId(), true), RelationGetRelationName(sequence_rel))); ``` Can we really set a row level security policy to sequences? I've tested but I couldn't. ``` postgres=# CREATE SEQUENCE s; CREATE SEQUENCE postgres=# ALTER TABLE s ENABLE ROW LEVEL SECURITY ; ERROR: ALTER action ENABLE ROW SECURITY cannot be performed on relation "s" DETAIL: This operation is not supported for sequences. ``` 06. copy_sequence ``` appendStringInfo(&cmd, "SELECT c.oid, c.relkind\n" "FROM pg_catalog.pg_class c\n" "INNER JOIN pg_catalog.pg_namespace n\n" " ON (c.relnamespace = n.oid)\n" "WHERE n.nspname = %s AND c.relname = %s", quote_literal_cstr(nspname), quote_literal_cstr(relname)); ``` I feel the function is not so efficient because it can obtain only a tuple, i.e., information for one sequence at once. I think you basically copied from fetch_remote_table_info(), but it was OK for it because the tablesync worker handles only a table. Can you obtain all sequences at once and check whether each sequences match or not? 07. LogicalRepSyncSequences ``` /* LOG all the sequences synchronized during current batch. */ for (int i = 0; i < curr_batch_seq; i++) { SubscriptionRelState *done_seq; done_seq = (SubscriptionRelState *) lfirst(list_nth_cell(sequences_not_synced, (curr_seq - curr_batch_seq) + i)); ereport(DEBUG1, errmsg_internal("logical replication synchronization for subscription \"%s\", sequence \"%s\" has finished", get_subscr
RE: Memory leak in pg_logical_slot_{get,peek}_changes
On Thursday, December 26, 2024 8:01 PM vignesh C wrote: > > Here is an updated version which includes registers to reset the memory > context that is in-line with a recently committed patch at [1]. Thanks for updating patches ! They look good to me. Just to confirm, would the other stuff (streamed_txns) that allocated under CacheMemoryContext also leaks memory ? I think it's OK to change them separately if it does but just would like to confirm if there is a risk. > [1] - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=cfd6cbc > f9be078fbdf9587014231a5772039b386 Best Regards, Hou zj
Re: Crash: invalid DSA memory alloc request
On Wed, Dec 18, 2024 at 2:42 AM Andreas 'ads' Scherbaum wrote: > On 17/12/2024 22:32, Nathan Bossart wrote: > > Committed. > > > > Thanks, I see you backpatched it all the way to 13. > Will see how far back I can test this, will take a while. > Was able to test HEAD in all branches back to 13, no crash seen. -- Andreas 'ads' Scherbaum German PostgreSQL User Group European PostgreSQL User Group - Board of Directors Volunteer Regional Contact, Germany - PostgreSQL Project
Re: cannot to compile extension by meson on windows
Pavel Stehule писал(а) 2024-12-01 20:52: Hi Did somebody test compilation of any extension on the WIN platform by using meson? I prepared meson.build https://github.com/orafce/orafce/blob/master/meson.build I tested it successfully on Linux. But it fails on Windows - a lot of compilation fails on missing libintl.h DOCDIR = C:/PROGRA~1/POSTGR~1/16/doc HTMLDIR = C:/PROGRA~1/POSTGR~1/16/doc INCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include PKGINCLUDEDIR = C:/PROGRA~1/POSTGR~1/16/include INCLUDEDIR-SERVER = C:/PROGRA~1/POSTGR~1/16/include/server` looks so msvc cannot work with just this configuration. I can compile orafce when I use setup described by https://github.com/orafce/orafce/blob/master/README.msvc Regards Pavel Hi! In other thread https://www.postgresql.org/message-id/TYVPR01MB1133078C93F9FE432CA466573E40E2%40TYVPR01MB11330.jpnprd01.prod.outlook.com Kohei Harikae makes good work to clarify meson documentation, especially regarding additional libraries. I suppose in your case meson did not found gettext library, libintl.h from this library. You can: 1) install gettext, f.e by vcpkg package manager: vcpkg.exe install gettext:x64-windows --x-install-root=c:\postgres\gettext 2) You could add gettext .pc file directory to PKG_CONFIG_PATH ( ; separated list, meson uses configarations in this order SET PKG_CONFIG_PATH=c:\postgres\gettext\x64-windows\lib\pkgconfig;%PKG_CONFIG_PATH% but this file is not created for this library (you can create it by yourself, but exists other solution - in step 4) 3) make all other steps to run meson, f.e. call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat" x64 4) run meson setup with option -Dnls=enabled - it enables national languages and uses gettext. And you have to add options for gettext libraries and includes directories: meson setup c:\builddir -Dnsl=enabled -Dextra_include_dirs=c:\postgres\gettext\x64-windows\include,c:\otherlibs\include -Dextra_lib_dirs=c:\postgres\gettext\x64-windows\lib,c:\otherlibs\lib ...other options... extra_include_dirs and extra_lib_dirs are options defined by PostgreSQL. This options are comma separated lists of directories, if every value does not contain comma itself (if contains, it better to read meson get_options() documentation, it is not easy to explain shortly meson language syntax). meson using PKG_CONFIG_PATH detects all library and makes all work with include and lib paths, but if library does not have .pc file, you can define paths in options. If you build with -Dnlas=enabled, you have to be careful with tests. Tests in this case run with the default system language, and some tests will fail, as they are check log files output and compare it with English answers. I hope this helps you. -- Best regards, Vladlen Popolitov.
Re: Convert macros to static inline functions
On 16.05.22 10:27, Peter Eisentraut wrote: Inspired by [0], I looked to convert more macros to inline functions. This is an older thread where I left something unfinished: Note 2: Many macros in htup_details.h operate both on HeapTupleHeader and on MinimalTuple, so converting them to a function doesn't work in a straightforward way. I have some in-progress work in that area, but I have not included any of that here. Here is the patch set for this. There are actually only two macros that operate on both HeapTupleHeader and MinimalTuple, so it wasn't as much as I had written above. I just left those as macros. I converted the rest to inline functions in a straightforward way as before. A small amount of reordering was necessary. But just for language-nerd fun, I'm including here an additional patch showing how the remaining ones could be done with C11 generic selection. I'm not planning to commit that one at this time. From 422181163bfe1a292ee51941a007d54bc127ccec Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 27 Dec 2024 11:07:35 +0100 Subject: [PATCH 1/3] Add some const decorations (htup.h) --- src/backend/access/heap/heapam.c | 6 +++--- src/backend/utils/time/combocid.c | 6 +++--- src/include/access/htup.h | 8 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 329e727f80d..75171e8f2c0 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -7419,10 +7419,10 @@ MultiXactIdGetUpdateXid(TransactionId xmax, uint16 t_infomask) * checking the hint bits. */ TransactionId -HeapTupleGetUpdateXid(HeapTupleHeader tuple) +HeapTupleGetUpdateXid(const HeapTupleHeaderData *tup) { - return MultiXactIdGetUpdateXid(HeapTupleHeaderGetRawXmax(tuple), - tuple->t_infomask); + return MultiXactIdGetUpdateXid(HeapTupleHeaderGetRawXmax(tup), + tup->t_infomask); } /* diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c index f85510b74ff..1ecdc93b7e2 100644 --- a/src/backend/utils/time/combocid.c +++ b/src/backend/utils/time/combocid.c @@ -101,7 +101,7 @@ static CommandId GetRealCmax(CommandId combocid); */ CommandId -HeapTupleHeaderGetCmin(HeapTupleHeader tup) +HeapTupleHeaderGetCmin(const HeapTupleHeaderData *tup) { CommandId cid = HeapTupleHeaderGetRawCommandId(tup); @@ -115,7 +115,7 @@ HeapTupleHeaderGetCmin(HeapTupleHeader tup) } CommandId -HeapTupleHeaderGetCmax(HeapTupleHeader tup) +HeapTupleHeaderGetCmax(const HeapTupleHeaderData *tup) { CommandId cid = HeapTupleHeaderGetRawCommandId(tup); @@ -150,7 +150,7 @@ HeapTupleHeaderGetCmax(HeapTupleHeader tup) * changes the tuple in shared buffers. */ void -HeapTupleHeaderAdjustCmax(HeapTupleHeader tup, +HeapTupleHeaderAdjustCmax(const HeapTupleHeaderData *tup, CommandId *cmax, bool *iscombo) { diff --git a/src/include/access/htup.h b/src/include/access/htup.h index 116cb1bb273..a5f1ef2441d 100644 --- a/src/include/access/htup.h +++ b/src/include/access/htup.h @@ -78,12 +78,12 @@ typedef HeapTupleData *HeapTuple; #define HeapTupleIsValid(tuple) PointerIsValid(tuple) /* HeapTupleHeader functions implemented in utils/time/combocid.c */ -extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup); -extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup); -extern void HeapTupleHeaderAdjustCmax(HeapTupleHeader tup, +extern CommandId HeapTupleHeaderGetCmin(const HeapTupleHeaderData *tup); +extern CommandId HeapTupleHeaderGetCmax(const HeapTupleHeaderData *tup); +extern void HeapTupleHeaderAdjustCmax(const HeapTupleHeaderData *tup, CommandId *cmax, bool *iscombo); /* Prototype for HeapTupleHeader accessors in heapam.c */ -extern TransactionId HeapTupleGetUpdateXid(HeapTupleHeader tuple); +extern TransactionId HeapTupleGetUpdateXid(const HeapTupleHeaderData *tup); #endif /* HTUP_H */ base-commit: d85ce012f99f63249bb45a78fcecb7a2383920b1 -- 2.47.1 From 89308e0446259a1ae619e39603324d2c2b8ed66c Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Fri, 27 Dec 2024 11:07:35 +0100 Subject: [PATCH 2/3] Convert macros to static inline functions (htup_details.h, itup.h) --- contrib/pageinspect/heapfuncs.c | 15 +- src/include/access/htup_details.h | 543 ++ src/include/access/itup.h | 20 +- 3 files changed, 352 insertions(+), 226 deletions(-) diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c index 41ff597199c..4961a3c866c 100644 --- a/contrib/pageinspect/heapfuncs.c +++ b/contrib/p
Re: Proposal: add new API to stringinfo
Michael said: > New APIs are materials for HEAD, so recompilation needs to happen > anyway. Using a macro makes things slightly simpler and it would not > break unnecessarily the compilation of existing extensions. Ok. David said: > I didn't review the patch in detail, but I think "initsize" would be a > better parameter name than "size". Ok, will change to "initsize". With opinions from Michael and David , what about following additional APIs? #define STRINGINFO_DEFAULT_SIZE 1024/* default initial allocation size */ #define STRINGINFO_SMALL_SIZE 64 /* small initial allocation size */ #define makeStringInfo makeStringInfoExtended(STRINGINFO_DEFAULT_SIZE) #define initStringInfo(str) initStringInfoExtended(str, STRINGINFO_DEFAULT_SIZE) extern StringInfo makeStringInfoExtended(int initsize); extern void initStringInfoExtended(StringInfo str, int initsize); Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Should fix a comment referring to stats collector?
On Tue, Oct 31, 2023 at 11:02 PM Bruce Momjian wrote: > > On Mon, Aug 1, 2022 at 09:05:45PM +0900, torikoshia wrote: > > On 2022-07-30 02:53, Alvaro Herrera wrote: > > > > > I don't think this refers to the statistics collector process; I > > > understand it to refer to ANALYZE that captures the data being stored. > > > > Thanks for the explanation! > > > > > Maybe it should just say "K may be chosen at ANALYZE time". > > > > It seems clearer than current one. > > Change made in master. I believe some redundant wording has been committed. K may be chosen may be chosen at ANALYZE time. Attached patch fixes it and with some line adjustments. > > -- > Bruce Momjian https://momjian.us > EDB https://enterprisedb.com > > Only you can decide what is important to you. > > -- Regards Junwang Zhao v1-0001-remove-redundant-wording-in-pg_statistic.h.patch Description: Binary data
Re: [PATCHES] Post-special page storage TDE support
On Fri, Dec 27, 2024 at 1:58 PM Bruce Momjian wrote: > > On Fri, Dec 27, 2024 at 12:25:11PM -0500, Greg Sabino Mullane wrote: > > On Fri, Dec 27, 2024 at 10:12 AM Bruce Momjian wrote: > > > > The value of TDE is limited from a security value perspective, but high > > on > > the list of security policy requirements. Our community is much more > > responsive to actual value vs policy compliance value. > > > > > > True. The number of forks, though, makes me feel this is a "when", not "if" > > feature. Has there been any other complex feature forked/implemented by so > > many? Maybe columnar storage? > > That is a great question. We have TDE implementations from EDB, > Fujitsu, Percona, Cybertec, and Crunchy Data, and perhaps others, and > that is a lot of duplicated effort. > > As far as parallels, I think compatibility with Oracle and MSSQL are > areas that several companies have developed that the community is > unlikely to ever develop, I think because they are pure compatibility, > not functionality. I think TDE having primarily policy compliance value > also might make it something the community never develops. > > I think this blog post is the clearest I have seen about the technical > value vs.policy compliance value of TDE: > > > https://www.percona.com/blog/why-postgresql-needs-transparent-database-encryption-tde/ > > One possible way TDE could be added to community Postgres is if the code > changes required were reduced due to an API redesign. A couple big pieces here could be modifying the API to add PreparePageForWrite()/PreparePageFromRead() hooks to transform the data page once read from disk or getting ready to write to disk. I think I have a (not yet rebased atop bulk read/write API and various incremental backup pieces) patch version which basically refactors things somewhat to support that, but basically making a single call point that we can add things like checksums, page encryption, etc. I think there was also a thread floating around moving various arbitrary read/write file APIs (for temporary files) into a common API; my recollection is there was something along the lines of 4 or 5 different file read abstractions we used at various pieces in the code base, so making a common one that could also be hooked would give us the ability to make a read/write transient file API that we could then "do stuff" with. (My recollection is we could support encryption and compression, but don't have the thread in front of me.) Some form of early init pluggability would be required, since we'd need to support reading encrypted WAL before full startup is accomplished. This seems harder, at least given the bits I was originally plugging into. Obviously existing forks would want to support reading their existing clusters, so not sure there is an all-in-one solution here. Just some musing here... David
Re: Logging parallel worker draught
Thanks for rebasing. I took a look and have some high level comments. 1/ The way the patches stand now, different parallel commands will result in a different formatted log line. This may be OK for reading the logs manually, but will be hard to consume via tools that parse logs and if we support other parallel commands in the future. Thinking about this further, it seems to me this logging only makes sense for parallel query and not maintenance commands. This is because for the latter case, the commands are executed manually and a user can issue VACUUM VERBOSE ( for the case of vacuum ). For CREATE INDEX, one can enable DEBUG1. For example: postgres=# CREATE INDEX tbl_c1 ON tbl(c1); DEBUG: building index "tbl_c1" on table "tbl" with request for 1 parallel workers DEBUG: index "tbl_c1" can safely use deduplication CREATE INDEX It does appear though the debug1 message is missing the number of workers actually used, but this could be taken up in a separate discussion. 2/ For logging parallel query workers, the log line could be simpler. Instead of: + elog(LOG, "%i parallel nodes planned (%i obtained all their workers, %i obtained none), " + "%i workers planned to be launched (%i workers launched)", what about something like: "launched %d parallel workers (planned: %)" For example, if I have 2 gather nodes and they each plan and launch 2 workers, the log line should be as simple as "launched 4 parallel workers (planned: 4)" This behavior in which workers planned and launched are aggregated in the log can be described in the documentation. 3/ Also, log_parallel_query_workers as the name of the GUC will make more sense if we go logging parallel query only. What do you think? Regards, Sami Imseih Amazon Web Services (AWS) On Thu, Dec 26, 2024 at 8:25 AM Benoit Lobréau wrote: > > This is just a rebase. > > As stated before, I added some information to the error message for > parallel queries as an experiment. It can be removed, if you re not > convinced. > > --- > Benoit Lobréau > Consultant > http://dalibo.com
Re: Expand applicability of aggregate's sortop optimization
On 12/11/24 07:22, Michael Paquier wrote: On Fri, Nov 08, 2024 at 01:05:04PM +0700, Andrei Lepikhov wrote: Rebase onto current master and slight improvement. Your patch is failing in the CI, please rebase. I have it to the next CF for now, waiting on author. Thanks, I overlooked the fact that regression tests can be launched in parallel. Therefore, adding an index to the frequently used tenk1 table was a bad idea. In the attachment - the new patch rebased on the current master. -- regards, Andrei LepikhovFrom 880ddf7170a0e01fb852172ba99a32e2494f50c6 Mon Sep 17 00:00:00 2001 From: "Andrei V. Lepikhov" Date: Fri, 8 Nov 2024 10:36:06 +0700 Subject: [PATCH v4] Add prosupport helpers to aggregate functions. Following the spirit of ed1a88d, add a prosupport call for aggregates. Here, we introduce a new support function node type to allow support functions to be called for aggregate functions. This code introduces only min/max trivial optimisation in the core. However, a user can design alternative support functions on their own. --- src/backend/optimizer/plan/planagg.c | 76 + src/backend/optimizer/prep/prepagg.c | 26 + src/include/catalog/pg_proc.dat | 92 +++ src/include/nodes/supportnodes.h | 7 ++ src/test/regress/expected/aggregates.out | 135 ++- src/test/regress/sql/aggregates.sql | 62 +++ src/tools/pgindent/typedefs.list | 1 + 7 files changed, 352 insertions(+), 47 deletions(-) diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c index dbcd3b9a0a..845b17f01b 100644 --- a/src/backend/optimizer/plan/planagg.c +++ b/src/backend/optimizer/plan/planagg.c @@ -33,6 +33,7 @@ #include "catalog/pg_type.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" +#include "nodes/supportnodes.h" #include "optimizer/cost.h" #include "optimizer/optimizer.h" #include "optimizer/pathnode.h" @@ -43,6 +44,7 @@ #include "parser/parse_clause.h" #include "parser/parsetree.h" #include "rewrite/rewriteManip.h" +#include "utils/fmgrprotos.h" #include "utils/lsyscache.h" #include "utils/syscache.h" @@ -512,3 +514,77 @@ fetch_agg_sort_op(Oid aggfnoid) return aggsortop; } + +Datum +minmax_support(PG_FUNCTION_ARGS) +{ + Node *rawreq = (Node *) PG_GETARG_POINTER(0); + Oid aggsortop; + + if (IsA(rawreq, SupportRequestAggregate)) + { + SupportRequestAggregate *req = (SupportRequestAggregate *) rawreq; + Aggref *aggref = req->aggref; + + if (list_length(aggref->args) != 1 || aggref->aggfilter != NULL) + PG_RETURN_POINTER(NULL); + + aggsortop = fetch_agg_sort_op(aggref->aggfnoid); + if (!OidIsValid(aggsortop)) + PG_RETURN_POINTER(NULL); /* not a MIN/MAX aggregate */ + + if (aggref->aggorder != NIL) + { + SortGroupClause *orderClause; + TargetEntry *curTarget; + + curTarget = (TargetEntry *) linitial(aggref->args); + + /* + * If the order clause is the same column as the one we're + * aggregating, we can still use the index: It is undefined which + * value is MIN() or MAX(), as well as which value is first or + * last when sorted. So, we can still use the index IFF the + * aggregated expression equals the expression used in the + * ordering operation. + */ + + /* + * We only accept a single argument to min/max aggregates, + * orderings that have more clauses won't provide correct results. + */ + Assert(list_length(aggref->aggorder) == 1); + + orderClause = castNode(SortGroupClause, linitial(aggref->aggorder)); + + if (orderClause->tleSortGroupRef != curTarget->ressortgroupref) +elog(ERROR, "Aggregate order clause isn't found in target list"); + + if (orderClause->sortop != aggsortop) + { +List *btclasses; +ListCell *lc; + +btclasses = get_op_btree_interpretation(orderClause->sortop); + +foreach(lc, btclasses) +{ + OpBtreeInterpretation *interpretation; + + interpretation = (OpBtreeInterpretation *) lfirst(lc); + if (op_in_opfamily(aggsortop, interpretation->opfamily_id)) + { + aggref->aggorder = NIL; + break; + } +} + +list_free_deep(btclasses); + } + else +aggref->aggorder = NIL; + } + } + + PG_RETURN_POINTER(NULL); +} diff --git a/src/backend/optimizer/prep/prepagg.c b/src/backend/optimizer/prep/prepagg.c index e935c9d094..74347b3d7e 100644 --- a/src/backend/optimizer/prep/prepagg.c +++ b/src/backend/optimizer/prep/prepagg.c @@ -39,6 +39,7 @@ #include "catalog/pg_type.h" #include "nodes/nodeFuncs.h" #include "nodes/pathnodes.h" +#include "nodes/supportnodes.h" #include "optimizer/cost.h" #include "optimizer/optimizer.h" #include "optimizer/plancat.h" @@ -64,6 +65,25 @@ static int find_compatible_trans(PlannerInfo *root, Aggref *newagg, List *transnos); static Datum GetAggInitVal(Datum textInitVal, Oid transtype); +static void +optimize_aggregates(Aggref *aggref) +{ + SupportRequestAggregate
Re: PoC: history of recent vacuum/checkpoint runs (using new hooks)
On 12/27/24 05:00, Michael Paquier wrote: > On Thu, Dec 26, 2024 at 06:58:11PM +0100, Tomas Vondra wrote: >> If 128MB is insufficient, why would 256MB be OK? A factor of 2x does not >> make a fundamental difference ... >> >> Anyway, the 128MB value is rather arbitrary. I don't mind increasing the >> limit, or possibly removing it entirely (and accepting anything the >> system can handle). > > +DefineCustomIntVariable("stats_history.size", > +"Sets the amount of memory available for past > events.", > > How about some time-based retention? Data size can be hard to think > about for the end user, while there would be a set of users that would > want to retain data for the past week, month, etc? If both size and > time upper-bound are define, then entries that match one condition or > the other are removed. > Right. In my response [1] I suggested replacing the simple memory limit with a time-based limit, but I haven't done anything about that yet. And the more I think about it the more I'm convinced we don't need to keep the data about past runs in memory, a file should be enough (except maybe for a small buffer). That would mean we don't need to worry about dynamic shared memory, etc. I initially rejected this because it seemed like a regression to how pgstat worked initially (sharing data through files), but I don't think that's actually true - this data is different (almost append-only), etc. The one case when we may need co read the data is in response to DROP of a table, when we need to discard entries for that object. Or we could handle that by recording OIDs of dropped objects ... not sure how complex this would need to be. We'd still want to enforce some limits, of course - a time-based limit of the minimum amount of time to cover, and maximum amount of disk space to use (more as a safety). FWIW there's one "issue" with enforcing the time-based limit - we can only do that for the "end time", because that's when we see the entry. If you configure the limit to keep "1 day" history, and then a vacuum completes after running for 2 days, we want to keep it, so that anyone can actually see it. [1] https://www.postgresql.org/message-id/8df7cee1-31aa-4db3-bbb7-83157ca139da%40vondra.me > +checkpoint_log_hook( > + CheckpointStats.ckpt_start_t,/* start_time */ > + CheckpointStats.ckpt_end_t,/* end_time */ > + (flags & CHECKPOINT_IS_SHUTDOWN),/* is_shutdown */ > + (flags & CHECKPOINT_END_OF_RECOVERY),/* is_end_of_recovery */ > + (flags & CHECKPOINT_IMMEDIATE),/* is_immediate */ > + (flags & CHECKPOINT_FORCE),/* is_force */ > + (flags & CHECKPOINT_WAIT),/* is_wait */ > + (flags & CHECKPOINT_CAUSE_XLOG),/* is_wal */ > + (flags & CHECKPOINT_CAUSE_TIME),/* is_time */ > + (flags & CHECKPOINT_FLUSH_ALL),/* is_flush_all */ > + CheckpointStats.ckpt_bufs_written,/* buffers_written */ > + CheckpointStats.ckpt_slru_written,/* slru_written */ > + CheckpointStats.ckpt_segs_added,/* segs_added */ > + CheckpointStats.ckpt_segs_removed,/* segs_removed */ > + CheckpointStats.ckpt_segs_recycled,/* segs_recycled */ > > That's a lot of arguments. CheckpointStatsData and the various > CHECKPOINT_* flags are exposed, why not just send these values to the > hook? > > For v1-0001 as well, I'd suggest some grouping with existing > structures, or expose these structures so as they can be reused for > out-of-core code via the proposed hook. More arguments lead to more > mistakes that could be easily avoided. Yes, I admit the number of parameters seemed a bit annoying to me too, and maybe we could reduce that somewhat. Certainly for checkpoints, where we already have a reasonable CheckpointStats struct and flags, wrapping most of the fields. With vacuum it's a bit more complicated, for two reasons: (a) the counters are simply in LVRelState, mixed with all kinds of other info, and it seems "not great" to pass it to a "log" hook, and (b) there are some calculated values, so I guess the hook would need to do that calculation on it's own, and it'd be easy to diverge over time. For (a) we could introduce some "stats" struct to keep these counters for vacuum (the nearby parallel vacuum patch actually does something like that, I think). Not sure if (b) is actually a problem in practice, but I guess we could add those fields to the new "stats" struct too. regards -- Tomas Vondra