Re: using index to speedup add not null constraints to a table
On Wed, Feb 5, 2025 at 4:24 PM jian he wrote: > > rebased new patch attached. > I also did some cosmetic changes. comments refined. > make sure using index_scan mechanism to fast check column not-null can > only be used via btree index. > isolation tests are simplified. I realized that my previous patch was quite wrong, we should not do indexscan verify individual not-null constraints on phase2. So a new patch is attached, the main idea is Phase2 collects all to be added not-null constraints to AlteredTableInfo->constraints. then in Phase3 check, can we use index to fast check not-null constraint or not. To minimize concurrency issues, using an index scan to quickly validate NOT NULL constraints requires strict conditions in Phase3: * No table rewrite * No table scan * Each NOT NULL constraint must have a suitable supporting index for fast checking * The table must already hold an AccessExclusiveLock * The DDL must not involve creating any new indexes I don't have any good ideas to do the regress tests. I use ereport(NOTICE, errmsg("all not-null constraints on relation \"%s\" are validated by index scan", RelationGetRelationName(oldrel))); to do the tests. for example: create temp table t2 (x int, y int, z int, primary key (x, y)); create unique index t2_z_uidx on t2(z); alter table t2 alter column z set not null; NOTICE: all not-null constraints on relation "t2" are validated by index scan ALTER TABLE From 0b4ad35e8fbc46f93ba6df2ef7013ff2f3055216 Mon Sep 17 00:00:00 2001 From: jian he Date: Fri, 18 Apr 2025 16:05:42 +0800 Subject: [PATCH v4 1/1] using indexscan to speedup add not null constraints This patch tries to use index_beginscan() / index_getnext() / index_endscan() mentioned in [1] to speedup adding not-null constraints to the existing table. the main logic happen in phase3 ATRewriteTable 1. collect all not-null constraints. 2. does each not-null constraint have a corresponding index to vertify it, if not then do the normal processing. 3. if table scan, table rewrite, table was not in AccessExclusiveLock. not-null constraint was on virtual generated column then we can not use indexcan to fast check not-null constraint. 3. If a table scan or rewrite occurs and the table is not held with an AccessExclusiveLock, and the NOT NULL constraint is on a virtual generated column, then we can not use indexscan to fast validate not-null constraints. concurrency concern: ALTER TABLE SET NOT NULL will take an ACCESS EXCLUSIVE lock, so there is less variant of racing issue can occur? to prove accurate, I wrote some isolation tests. see[2] performance: It will only be slower than usual to add a NOT NULL constraint if your index is bloated and a significant portion of that bloat is due to rows with NULL values. demo: drop table if exists t; create unlogged table t(a int, b int, c int) with (autovacuum_enabled = off); insert into t select g, g+1 from generate_series(1,1_000_000) g; create index t_idx_a on t(a); COMMAND: alter table t add constraint t1 not null a; patch Time: 1.178 ms master Time: 15.137 ms 100% table bloat: via (delete from t) COMMAND: alter table t add constraint t1 not null a; patch Time: 0.962 ms master Time: 16.404 ms case when: %20 percent values are NULL and have been deleted from heap but they still on the index. drop table if exists t; create unlogged table t(a int, b int) with (autovacuum_enabled = off); insert into t select case when g % 5 = 0 then null else g end, g+1 from generate_series(1,1_000_000) g; create index t_idx_a on t(a); delete from t where a is null; alter table t add constraint t1 not null a; patch Time:: 17.443 ms master Time: 23.045 ms references: [1] https://postgr.es/m/CA%2BTgmoa5NKz8iGW_9v7wz%3D-%2BzQFu%3DE4SZoaTaU1znLaEXRYp-Q%40mail.gmail.com [2] https://postgr.es/m/900056D1-32DF-4927-8251-3E0C0DC407FD%40anarazel.de discussion: https://postgr.es/m/CACJufxFiW=4k1is=F1J=r-cx1rubyxqpurwb331u47rsngz...@mail.gmail.com commitfest entry: https://commitfest.postgresql.org/patch/5444/ --- src/backend/commands/tablecmds.c | 129 ++- src/backend/executor/execIndexing.c | 200 ++ src/include/executor/executor.h | 2 + .../expected/indexscan-check-notnull.out | 102 + src/test/isolation/isolation_schedule | 1 + .../specs/indexscan-check-notnull.spec| 45 src/test/regress/expected/aggregates.out | 1 + src/test/regress/expected/alter_table.out | 42 src/test/regress/expected/indexing.out| 2 + src/test/regress/expected/publication.out | 1 + src/test/regress/expected/tablespace.out | 1 + src/test/regress/sql/alter_table.sql | 46 12 files changed, 566 insertions(+), 6 deletions(-) create mode 100644 src/test/isolation/expected/indexscan-check-notnull.out create mode 100644 src/test/isolation/specs/indexscan-check-
[BUG] temporary file usage report with extended protocol and unnamed portals
Hi, It seems there's a bug in the logging of temporary file usage when the extended protocol is used with unnamed portals. For example, with the attached Java / pgJDBC programs, we get the following logs: [...] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp525566.0", size 2416640 [..] STATEMENT: SELECT 1 but it should be: [...] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp538230.0", size 2416640 [...] STATEMENT: SELECT * FROM foo ORDER BY a OFFSET $1 LIMIT 2 It has been tested with HEAD and REL_17_STABLE. My guess is that there's a race somewhere, probably when the global variable "debug_query_string" is set. The unnamed portal is closed when the BIND of the next query arrives (here: SELECT 1), and I suspect that the variable is set before the temporary file is deleted (and logged). pgJDBC uses unnamed portals, but I don't think this is specific to JDBC. I see the same problem with the attached Python / psycopg3 program. I think it would be better if the drivers used named portals all the time (and an explicit close message), but this seems to be a postgres bug. What do you think? Best regards, Frédéric PS : the dataset is created like this on the server: CREATE UNLOGGED TABLE foo(a int); INSERT INTO foo SELECT * FROM generate_series(1, 20); ALTER SYSTEM SET log_temp_files = 0; ALTER SYSTEM SET log_min_duration_statement = -1 SELECT pg_reload_conf();import psycopg with psycopg.connect("dbname=postgres user=postgres port=5434 host=localhost") as conn: conn.execute("SELECT * FROM foo ORDER BY a OFFSET %s LIMIT %s", [10,2]) conn.execute("SELECT 1") conn.close() // CREATE UNLOGGED TABLE foo(a int); // INSERT INTO foo SELECT * FROM generate_series(1, 20); public class Babar { public static void main(String[] args) throws ClassNotFoundException, java.sql.SQLException { Class.forName("org.postgresql.Driver"); java.sql.Connection conn = java.sql.DriverManager.getConnection( "jdbc:postgresql://127.0.0.1:5434/postgres?user=postgres" ); conn.setAutoCommit(false); java.sql.ResultSet rs; java.sql.PreparedStatement stmt = conn.prepareStatement( "SELECT * FROM foo ORDER BY a OFFSET ? LIMIT 2" ); java.sql.PreparedStatement stmt2 = conn.prepareStatement( "SELECT 1" ); stmt.setInt(1, 10); rs = stmt.executeQuery(); rs.close(); stmt.close(); rs = stmt2.executeQuery(); rs.close(); stmt2.close(); conn.commit(); conn.close(); } }
Re: Make prep_status() message translatable
On 2025/04/17 13:12, Fujii Masao wrote: I've updated the patch to reflect this comment. Barring any objections, I'm thinking to commit this patch. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Align memory context level numbering in pg_log_backend_memory_contexts()
On 2025/04/18 18:23, David Rowley wrote: On Fri, 18 Apr 2025 at 20:54, Fujii Masao wrote: Shouldn't the example output of pg_log_backend_memory_contexts() in the documentation also be updated to use 1-based numbering for consistency? Patch attached. Yeah. I failed to notice we had an example of the output. Want to take care of it? Yeah, I will if you're okay with that! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
RE: Parallel heap vacuum
Dear Sawada-san, Thanks for updating the patch. I have been reviewing and below are comments for now. 01. Sorry if I forgot something, but is there a reason that parallel_vacuum_compute_workers() is mandatory? My fresh eye felt that the function can check and regards zero if the API is NULL. 02. table_parallel_vacuum_estimate We can assert that state is not NULL. 03. table_parallel_vacuum_initialize Same as 02.. Also, I think we should ensure here that table_parallel_vacuum_estimate() has already been called before, and current assertion might not enough because others can set random value here. Other functions like table_rescan_tidrange() checks the internal flag of the data structure, which is good for me. How do you feel to check pcxt->estimator has non-zero value? 04. table_parallel_vacuum_initialize_worker Same as 02. Also, can we esure that table_parallel_vacuum_initialize() has been done? 05. table_parallel_vacuum_collect_dead_items Same as 02. Also, can we esure that table_parallel_vacuum_initialize_worker() has been done? 06. table_parallel_vacuum_initialize_worker Comments atop function needs to be updated. 07. While playing, I found that the parallel vacuum worker can be launched more than pages: ``` postgres=# CREATE TABLE var (id int); CREATE TABLE postgres=# INSERT INTO var VALUES (generate_series(1, 1)); INSERT 0 1 postgres=# VACUUM (parallel 80, verbose) var ; INFO: vacuuming "postgres.public.var" INFO: launched 80 parallel vacuum workers for collecting dead tuples (planned: 80) INFO: finished vacuuming "postgres.public.var": index scans: 0 pages: 0 removed, 45 remain, 45 scanned (100.00% of total), 0 eagerly scanned ... ``` I hope the minimum chunk size is a page so that this situation can reduce the performance. How do you feel to cap the value with rel::rd_rel::relpages in heap_parallel_vacuum_compute_workers()? This value is not always up-to-date but seems good candidate. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: Changing shared_buffers without restart
On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote: > > > > In an offlist chat Thomas Munro mentioned that just ftruncate() would > > be enough to resize the shared memory without touching address maps > > using mmap and munmap(). > > > > ftruncate man page seems to concur with him > > > >If the effect of ftruncate() is to decrease the size of a memory > >mapped file or a shared memory object and whole pages beyond the > >new end were previously mapped, then the whole pages beyond the > >new end shall be discarded. > > > >References to discarded pages shall result in the generation of a > >SIGBUS signal. > > > >If the effect of ftruncate() is to increase the size of a memory > >object, it is unspecified whether the contents of any mapped pages > >between the old end-of-file and the new are flushed to the > >underlying object. > > > > ftruncate() when shrinking memory will release the extra pages and > > also would cause segmentation fault when memory outside the size of > > file is accessed even if the actual address map is larger than the > > mapped file. The expanded memory is allocated as it is written to, and > > those pages also become visible in the underlying object. > > Thanks for sharing. I need to do more thorough tests, but after a quick > look I'm not sure about that. ftruncate will take care about the memory, > but AFAICT the memory mapping will stay the same, is that what you mean? > In that case if the segment got increased, the memory still can't be > used because it's beyond the mapping end (at least in my test that's > what happened). If the segment got shrinked, the memory couldn't be > reclaimed, because, well, there is already a mapping. Or do I miss > something? I was imagining that you might map some maximum possible size at the beginning to reserve the address space permanently, and then adjust the virtual memory object's size with ftruncate as required to provide backing. Doesn't that achieve the goal with fewer steps, using only portable* POSIX stuff, and keeping all pointers stable? I understand that pointer stability may not be required (I can see roughly how that argument is constructed), but isn't it still better to avoid having to prove that and deal with various other problems completely? Is there a downside/cost to having a large mapping that is only partially backed? I suppose choosing that number might offend you but at least there is an obvious upper bound: physical memory size. *You might also want to use fallocate after ftruncate on Linux to avoid SIGBUS on allocation failure on first touch page fault, which raises portability questions since it's unspecified whether you can do that with shm fds and fails on some systems, but it let's call that an independent topic as it's not affected by this choice.
Re: Align memory context level numbering in pg_log_backend_memory_contexts()
On Fri, 18 Apr 2025 at 00:25, Rahila Syed wrote: > Regarding v2 patch, > - int level = 0; > > Retaining the level variable will enhance the code readability, IMO. When I read that, I suspected it might have been leftover from a refactor during the development that was forgotten about. There'd be thousands of places in our code base that you could make the readability argument for, including the max_level and max_children parameters at the same call-site. But those didn't get the same treatment. I've now pushed the latest patch. Thanks for the reviews. David
Re: Align memory context level numbering in pg_log_backend_memory_contexts()
On 2025/04/18 6:11, David Rowley wrote: On Fri, 18 Apr 2025 at 00:25, Rahila Syed wrote: Regarding v2 patch, - int level = 0; Retaining the level variable will enhance the code readability, IMO. When I read that, I suspected it might have been leftover from a refactor during the development that was forgotten about. There'd be thousands of places in our code base that you could make the readability argument for, including the max_level and max_children parameters at the same call-site. But those didn't get the same treatment. I've now pushed the latest patch. Thanks for the reviews. Shouldn't the example output of pg_log_backend_memory_contexts() in the documentation also be updated to use 1-based numbering for consistency? Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION From 7b3aa9de467005e540b2246ed89c9e43fe08f6bd Mon Sep 17 00:00:00 2001 From: Fujii Masao Date: Fri, 18 Apr 2025 17:45:29 +0900 Subject: [PATCH v1] doc: Fix memory context level in pg_log_backend_memory_contexts() example. Commit d9e03864b6b changed the memory context level numbers shown by pg_log_backend_memory_contexts() to be 1-based. However, the example in the documentation was not updated and still used 0-based numbering. This commit updates the example to match the current 1-based output. --- doc/src/sgml/func.sgml | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 1c5cfee25d1..574a544d9fa 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -28922,16 +28922,16 @@ One message for each memory context will be logged. For example: LOG: logging memory contexts of PID 10377 STATEMENT: SELECT pg_log_backend_memory_contexts(pg_backend_pid()); -LOG: level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 chunks); 66368 used -LOG: level: 1; pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used -LOG: level: 1; TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used -LOG: level: 1; RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used -LOG: level: 1; MessageContext: 16384 total in 2 blocks; 5152 free (0 chunks); 11232 used -LOG: level: 1; Operator class cache: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used -LOG: level: 1; smgr relation table: 16384 total in 2 blocks; 4544 free (3 chunks); 11840 used -LOG: level: 1; TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used +LOG: level: 1; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 chunks); 66368 used +LOG: level: 2; pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used +LOG: level: 2; TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used +LOG: level: 2; RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used +LOG: level: 2; MessageContext: 16384 total in 2 blocks; 5152 free (0 chunks); 11232 used +LOG: level: 2; Operator class cache: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used +LOG: level: 2; smgr relation table: 16384 total in 2 blocks; 4544 free (3 chunks); 11840 used +LOG: level: 2; TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used ... -LOG: level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 264 used +LOG: level: 2; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 264 used LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 used If there are more than 100 child contexts under the same parent, the first -- 2.49.0
Re: Changing shared_buffers without restart
> On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote: > > In an offlist chat Thomas Munro mentioned that just ftruncate() would > be enough to resize the shared memory without touching address maps > using mmap and munmap(). > > ftruncate man page seems to concur with him > >If the effect of ftruncate() is to decrease the size of a memory >mapped file or a shared memory object and whole pages beyond the >new end were previously mapped, then the whole pages beyond the >new end shall be discarded. > >References to discarded pages shall result in the generation of a >SIGBUS signal. > >If the effect of ftruncate() is to increase the size of a memory >object, it is unspecified whether the contents of any mapped pages >between the old end-of-file and the new are flushed to the >underlying object. > > ftruncate() when shrinking memory will release the extra pages and > also would cause segmentation fault when memory outside the size of > file is accessed even if the actual address map is larger than the > mapped file. The expanded memory is allocated as it is written to, and > those pages also become visible in the underlying object. Thanks for sharing. I need to do more thorough tests, but after a quick look I'm not sure about that. ftruncate will take care about the memory, but AFAICT the memory mapping will stay the same, is that what you mean? In that case if the segment got increased, the memory still can't be used because it's beyond the mapping end (at least in my test that's what happened). If the segment got shrinked, the memory couldn't be reclaimed, because, well, there is already a mapping. Or do I miss something? > > > I might have not noticed it, but are we putting two mappings one > > > reserved and one allocated in the same address space, so that when the > > > allocated mapping shrinks or expands, the reserved mapping continues > > > to prohibit any other mapping from appearing there? I looked at some > > > of the previous emails, but didn't find anything that describes how > > > the reserved mapped space is managed. > > > > I though so, but this turns out to be incorrect. Just have done a small > > experiment -- looks like when reserving some space, mapping and > > unmapping a small segment from it leaves a non-mapped gap. That would > > mean for shrinking the new available space has to be reserved again. > > Right. That's what I thought. But I didn't see the corresponding code. > So we have to keep track of two mappings for every segment - 1 for > allocation and one for reserving space and resize those two while > shrinking and expanding buffers. Am I correct? Not necessarily, depending on what we want. Again, I'll do a bit more testing, but after a quick check it seems that it's possible to "plug" the gap with a new reservation mapping, then reallocate it to another mapping or unmap both reservations (main and the "gap" one) at once. That would mean that for the current functionality we don't need to track reservation in any way more than just start and the end of the "main" reserved space. The only consequence I can imagine is possible fragmentation of the reserved space in case of frequent increase/decrease of a segment with even decreasing size. But since it's only reserved space, which will not really be used, it's probably not going to be a problem.
Re: Changing shared_buffers without restart
> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote: > > 1. Performance of Postgres CLOCK page eviction algorithm depends on number > of shared buffers. My first native attempt just to mark unused buffers as > invalid cause significant degrade of performance Thanks for sharing! Right, but it concerns the case when the number of shared buffers is high, independently from whether it was changed online or with a restart, correct? In that case it's out of scope for this patch. > 2. There are several data structures i Postgres which size depends on number > of buffers. > In my patch I used in some cases dynamic shared buffer size, but if this > structure has to be allocated in shared memory then still maximal size has > to be used. We have the buffers themselves (8 kB per buffer), then the main > BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's > CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). > 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be > quote noticeable with size differences larger than 2 orders of magnitude: > E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd > have ~2GiB of static overhead on only 0.5GiB of actual buffers. Not sure what do you mean by using a maximal size, can you elaborate. In the current patch those structures are allocated as before, except each goes into a separate segment -- without any extra memory overhead as far as I see. > 3. `madvise` is not portable. The current implementation doesn't rely on madvise so far (it might for shared memory shrinking), but yeah there are plenty of other not very portable things (MAP_FIXED, memfd_create). All of that is mentioned in the corresponding patches as a limitation.
Re: Align memory context level numbering in pg_log_backend_memory_contexts()
On Fri, 18 Apr 2025 at 20:54, Fujii Masao wrote: > Shouldn't the example output of pg_log_backend_memory_contexts() in > the documentation also be updated to use 1-based numbering for consistency? > Patch attached. Yeah. I failed to notice we had an example of the output. Want to take care of it? David
Re: n_ins_since_vacuum stats for aborted transactions
On Sat, 12 Apr 2025 at 07:33, Sami Imseih wrote: > What do you think of the attached? I looked at the v3 patch and I'm having trouble getting excited about it. I'd say this part is misleading: @@ -3956,7 +3961,8 @@ description | Waiting for a newly initialized WAL file to reach durable storage n_dead_tup bigint - Estimated number of dead rows + Estimated number of dead rows (updated by committed transactions, or by + VACUUM and VACUUM FULL) An aborted insert will contribute to that counter, but that's not mentioned. Would it be ok just to adjust n_mod_since_analyze's "Estimated number of rows modified since this table was last analyzed" and inject "by committed transactions" after "modified", then call it a day? David
Re: Fixup some appendPQExpBuffer() calls
On Thu, 17 Apr 2025 at 19:50, Daniel Gustafsson wrote: > > On 17 Apr 2025, at 01:44, David Rowley wrote: > > 1) Commit the attached to master > > 2) Do nothing. > > > > I'd like to do #1. > > I vote for #1 as well. Thanks for the judgment sense check. I suspect anyone who thought #2 would have come forward by now, so I've pushed the changes. David
Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
On Wed, 16 Apr 2025 at 10:09, Ilia Evdokimov wrote: > I've prepared the updated patches as discussed, including the addition > of estimated lookups in the EXPLAIN output for Memoize. Please let me > know if you have any feedback or further suggestions. While this is fresh, as I might forget before the July CF... I know I did suggest that the hit_ratio should be a percent and it should be multiplied by 100.0. This seems fine for the text format as you can have the % unit there. However, looking at ExplainPropertyFloat(), we don't print the unit for non-text formats. I wonder if printing a number between 0 and 100 there will be confusing. I don't believe we have anything else in EXPLAIN that shows a percentage. I don't quite know what to do about this. One thought I had was to only * 100 for text format, but that might be more confusing. Aside from that, I'd prefer the new fields in struct Memoize to be prefixed with "est_" the same as the existing "est_entries" field. I'm unsure why MemoizePath.calls becomes Memoize.lookups. Seems unnecessary and just means more brain space being used to maintain a mental model. David
Re: Changing shared_buffers without restart
On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro wrote: > On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Thanks for sharing. I need to do more thorough tests, but after a quick > > look I'm not sure about that. ftruncate will take care about the memory, > > but AFAICT the memory mapping will stay the same, is that what you mean? > > In that case if the segment got increased, the memory still can't be > > used because it's beyond the mapping end (at least in my test that's > > what happened). If the segment got shrinked, the memory couldn't be > > reclaimed, because, well, there is already a mapping. Or do I miss > > something? > > I was imagining that you might map some maximum possible size at the > beginning to reserve the address space permanently, and then adjust > the virtual memory object's size with ftruncate as required to provide > backing. Doesn't that achieve the goal with fewer steps, using only > portable* POSIX stuff, and keeping all pointers stable? I understand > that pointer stability may not be required (I can see roughly how that > argument is constructed), but isn't it still better to avoid having to > prove that and deal with various other problems completely? Is there > a downside/cost to having a large mapping that is only partially > backed? I suppose choosing that number might offend you but at least > there is an obvious upper bound: physical memory size. TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side effect on Linux, as if you had called ftruncate on it (fully allocated huge pages I expected up to the object's size, just not magical size changes beyond that when I merely asked to map it). That doesn't happen for regular page size, or for any page size on my local OS's shm objects and doesn't seem to fit mmap's job description given an fd*, but maybe I'm just confused. Anyway, a workaround seems to be to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ | PROT_WRITE) new regions after extending with ftruncate(), at least in simple tests... (*Hmm, wiild uninformed speculation: perhap the size-setting behaviour needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is being exposed also when a hugetlbfs fd is given explicitly to mmap, generating this bizarro side effect?)
Re: magical eref alias names
On Fri, Jan 3, 2025 at 8:44 AM Robert Haas wrote: > > Actually, I noticed that we are failing to honor that in the places > > where we inject "*SELECT*" and "*SELECT* %d" names, because that > > code puts those names into RTE->alias not only RTE->eref. > > I experimented with the attached patch to not do that anymore, > > which is sort of a subset of what you did but just focused on > > not lying about what's generated versus user-written. We could > > alternatively keep the current generated names by extending > > addRangeTableEntryForSubquery's API so that alias and generated eref > > are passed separately. (I didn't look to see if anyplace else > > is messing up this distinction similarly.) > > Hmm, I definitely like not lying about what is generated vs. what is > user-written. I don't have a strong opinion right now on the best way > of accomplishing that. I rediscovered, or re-encountered, this problem today, which motivated me to have a closer look at your (Tom's) patch. My feeling is that it's the right approach. I agree that we could try to keep the current generated names by extending addRangeTableEntryForSubquery, but I'm tentatively inclined to think that we shouldn't. Perhaps that's partly a stylistic preference on my part: I think unnamed_subquery_N looks nicer than "*SELECT * N", but there's also something to be said for keeping the code simple. I think it would be reasonable to instead extend addRangeTableEntryForSubquery if we find that the naming change breaks something, but if it doesn't then I like what you've done better. There's also an argument from consistency: even without the patch, unnamed_subquery leaks out in some contexts, and I think it's nicer to use unnamed_subquery everywhere than to use that in some places and *SELECT* in others. I then went looking for other places that have similar issues. I pretty quickly discovered that convert_ANY_sublink_to_join is another caller of addRangeTableEntryForSubquery that is fabricating an alias when it could just pass NULL; in that case, the fabricated name is ANY_subquery. Also, it seems like the recursive CTE stuff can just set the alias to NULL and the eref as it's currently doing, instead of setting both alias and eref as the code does currently. So, PFA 0001, a rebased version of Tom's patch; 0002, addressing ANY_subquery; and 0003, addressing *TLOCRN* and *TROCRN*. If we decide to adopt all of these, we may want to do some squashing before commit, but we have a little time to figure that out since I think this is v19 material anyway. Apart from those cases, and at least AFAICS, everything that's using a wholly fabricated name seems to be either (1) a case where we intend for the name to be referenceable (old, new, excluded) or (2) a value that is assigned only to eref and not to alias (e.g. *GROUP*). However, I did come across one other mildly interesting case. expand_single_inheritance_child has this: /* * We just duplicate the parent's table alias name for each child. If the * plan gets printed, ruleutils.c has to sort out unique table aliases to * use, which it can handle. */ childrte->alias = childrte->eref = makeAlias(parentrte->eref->aliasname, child_colnames); What I find curious about this is that we're assigning the parent's eref to both the child's eref and the child's alias. Maybe there's something I don't understand here, or maybe it just doesn't matter, but why wouldn't we assign eref to eref and alias to alias? Or even alias to alias and generate a new eref? -- Robert Haas EDB: http://www.enterprisedb.com v2-0002-Don-t-generate-fake-ANY_subquery-aliases-either.patch Description: Binary data v2-0003-Don-t-generate-fake-TLOCRN-or-TROCRN-aliases-eith.patch Description: Binary data v2-0001-Don-t-generate-fake-SELECT-or-SELECT-d-subquery-a.patch Description: Binary data
Re: Changing shared_buffers without restart
On 18/04/2025 12:26 am, Dmitry Dolgov wrote: On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote: 1. Performance of Postgres CLOCK page eviction algorithm depends on number of shared buffers. My first native attempt just to mark unused buffers as invalid cause significant degrade of performance Thanks for sharing! Right, but it concerns the case when the number of shared buffers is high, independently from whether it was changed online or with a restart, correct? In that case it's out of scope for this patch. 2. There are several data structures i Postgres which size depends on number of buffers. In my patch I used in some cases dynamic shared buffer size, but if this structure has to be allocated in shared memory then still maximal size has to be used. We have the buffers themselves (8 kB per buffer), then the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be quote noticeable with size differences larger than 2 orders of magnitude: E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers. Not sure what do you mean by using a maximal size, can you elaborate. In the current patch those structures are allocated as before, except each goes into a separate segment -- without any extra memory overhead as far as I see. Thank you for explanation. I am sorry that I have not precisely investigated your patch before writing: it seems to be that you are are placing in separate segment only content of shared buffers. Now I see that I was wrong and it is actually the main difference with memory ballooning approach I have used. As far as you are are allocating buffers descriptors and hash table in the same segment, there is no extra memory overhead. The only drawback is that we are loosing content of shared buffers in case of resize. It may be sadly, but not looks like there is no better alternative. But there are still some dependencies on shared buffers size which are not addressed in this PR. I am not sure how critical they are and is it possible to do something here, but at least I want to enumerate them: 1. Checkpointer: maximal number of checkpointer requests depends on NBuffers. So if we start with small shared buffers and then upscale, it may cause the too frequent checkpoints: Size CheckpointerShmemSize(void) ... size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest))); CheckpointerShmemInit(void) CheckpointerShmem->max_requests = NBuffers; 2. XLOG: number of xlog buffers is calculated depending on number of shared buffers: XLOGChooseNumBuffers(void) { ... xbuffers = NBuffers / 32; Should not cause some errors, but may be not so efficient if once again we start we tiny shared buffers. 3. AIO: AIO max concurrency is also calculated based on number of shared buffers: AioChooseMaxConcurrency(void) { ... max_proportional_pins = NBuffers / max_backends; For small shared buffers (i.e. 1Mb, there will be no concurrency at all). So none of this issues can cause some error, just some inefficient behavior. But if we want to start with very small shared buffers and then increase them on demand, then it can be a problem. In all this three cases NBuffers is used not just to calculate some threshold value, but also determine size of the structure in shared memory. The straightforward solution is to place them in the same segment as shared buffers. But I am not sure how difficult it will be to implement.
Re: disabled SSL log_like tests
I wrote: > What I think happened here is that a previous backend hadn't exited > yet when we start the test, and when its report does come out, > connect_fails prematurely stops waiting. (In the above, evidently > the child process we want to wait for is 599, but we're fooled by > a delayed report for 25401.) So my v1 patch needs work. > Maybe we can make the test compare the PIDs in the "forked new client > backend" and "client backend exited" log messages. Stay tuned... Okay, this version seems more reliable. regards, tom lane diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl index 756b4146050..37d96d95a1a 100644 --- a/src/test/authentication/t/001_password.pl +++ b/src/test/authentication/t/001_password.pl @@ -66,6 +66,8 @@ sub test_conn my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; # Test behavior of log_connections GUC diff --git a/src/test/authentication/t/003_peer.pl b/src/test/authentication/t/003_peer.pl index 69ba73bd2b9..2879800eacf 100644 --- a/src/test/authentication/t/003_peer.pl +++ b/src/test/authentication/t/003_peer.pl @@ -72,6 +72,8 @@ sub test_role my $node = PostgreSQL::Test::Cluster->new('node'); $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; # Set pg_hba.conf with the peer authentication. diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl index 6748b109dec..2dc6bec9b89 100644 --- a/src/test/kerberos/t/001_auth.pl +++ b/src/test/kerberos/t/001_auth.pl @@ -66,6 +66,7 @@ $node->append_conf( listen_addresses = '$hostaddr' krb_server_keyfile = '$krb->{keytab}' log_connections = on +log_min_messages = debug2 lc_messages = 'C' }); $node->start; diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl index 352b0fc1fa7..d1315ed5351 100644 --- a/src/test/ldap/t/001_auth.pl +++ b/src/test/ldap/t/001_auth.pl @@ -48,6 +48,8 @@ note "setting up PostgreSQL instance"; my $node = PostgreSQL::Test::Cluster->new('node'); $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; $node->safe_psql('postgres', 'CREATE USER test0;'); diff --git a/src/test/modules/oauth_validator/t/001_server.pl b/src/test/modules/oauth_validator/t/001_server.pl index d88994abc24..4f035417a40 100644 --- a/src/test/modules/oauth_validator/t/001_server.pl +++ b/src/test/modules/oauth_validator/t/001_server.pl @@ -48,6 +48,8 @@ $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); $node->append_conf('postgresql.conf', "oauth_validator_libraries = 'validator'\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; $node->safe_psql('postgres', 'CREATE USER test;'); diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 8759ed2cbba..4b858900139 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -2618,13 +2618,19 @@ to fail. =item expected_stderr => B -If this regular expression is set, matches it with the output generated. +If this regular expression is set, matches it to the output generated +by B. =item log_like => [ qr/required message/ ] =item log_unlike => [ qr/prohibited message/ ] -See C. +See C. CAUTION: use of either option requires that +the server's log_min_messages be at least DEBUG2, and that no other +client backend is launched concurrently. These requirements allow +C to wait to see the postmaster-log report of backend +exit, without which there is a race condition as to whether we will +see the expected backend log output. =back @@ -2652,7 +2658,14 @@ sub connect_fails like($stderr, $params{expected_stderr}, "$test_name: matches"); } - $self->log_check($test_name, $log_location, %params); + if (defined($params{log_like}) or defined($params{log_unlike})) + { + $self->wait_for_log( + qr/DEBUG: (?:0: )?forked new client backend, pid=(\d+) socket.*DEBUG: (?:0: )?client backend \(PID \1\) exited with exit code \d/s, + $log_location); + + $self->log_check($test_name, $log_location, %params); + } } =pod diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl index 086abf3b8b3..8b0de2d8e7e 100644 --- a/src/test/ssl/t/001_ssltests.pl +++ b/src/test/ssl/t/001_ssltests.pl @@ -60,6 +60,8 @@ my $common_connstr; note "setting up data dire
Re: Changing shared_buffers without restart
Hi, On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro wrote: > Doesn't that achieve the goal with fewer steps, using only >portable* POSIX stuff, and keeping all pointers stable? I understand >that pointer stability may not be required (I can see roughly how that >argument is constructed), but isn't it still better to avoid having to >prove that and deal with various other problems completely? I think we should flat out reject any approach that does not maintain pointer stability. It would restrict future optimizations a lot if we can't rely on that (e.g. not materializing tuples when transporting them from worker to leader; pointering datastructures in shared buffers). Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [BUG] temporary file usage report with extended protocol and unnamed portals
On 4/18/25 10:49, Frédéric Yhuel wrote: Hi, It seems there's a bug in the logging of temporary file usage when the extended protocol is used with unnamed portals. FWIW, the attached patch seems to fix the problem. Best regards, FrédéricFrom afb228f07f847c467ba05dbe204861e7be2ffc32 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= Date: Fri, 18 Apr 2025 13:20:52 +0200 Subject: [PATCH] fix reporting of temp files usage when extended protocol is used with unnamed portals --- src/backend/utils/mmgr/portalmem.c | 4 1 file changed, 4 insertions(+) diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index e3526e78064..246e711db81 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -27,6 +27,7 @@ #include "utils/memutils.h" #include "utils/snapmgr.h" #include "utils/timestamp.h" +#include "tcop/tcopprot.h" /* * Estimate of the maximum number of open portals a user would have, @@ -488,6 +489,9 @@ PortalDrop(Portal portal, bool isTopCommit) (errcode(ERRCODE_INVALID_CURSOR_STATE), errmsg("cannot drop active portal \"%s\"", portal->name))); + if (portal->queryDesc) + debug_query_string = portal->queryDesc->sourceText; + /* * Allow portalcmds.c to clean up the state it knows about, in particular * shutting down the executor if still active. This step potentially runs -- 2.47.2
Re: ALTER COLUMN SET DATA TYPE does not change the generation expression's collation
On Wed, Mar 26, 2025 at 1:01 PM jian he wrote: > > hi. > > ATPrepAlterColumnType forbids us to ALTER COLUMN SET DATA TYPE USING (expr) > for generated columns. > however we can still change the generated column type from non-text to text > or text type from one collation to another collation. > > In ATExecAlterColumnType, we also need to set the generation > expression collation? > > We can do this by adding exprSetCollation: > > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -14115,6 +14115,7 @@ ATExecAlterColumnType(AlteredTableInfo *tab, > Relation rel, > errmsg("default for > column \"%s\" cannot be cast automatically to type %s", > > colName, format_type_be(targettype; > } > + exprSetCollation(defaultexpr, targetcollid); > > > - > CREATE TABLE x1(a int, > b int GENERATED ALWAYS AS (a * 2) stored, > c text GENERATED ALWAYS AS ('1') stored ); > ALTER TABLE x1 alter column b set data type text collate "C"; > ALTER TABLE x1 alter column c set data type text collate "C"; > > SELECT pg_get_expr(d.adbin, d.adrelid) AS default_value, d.adbin > FROM pg_catalog.pg_attributea > JOIN pg_catalog.pg_attrdef d ON (a.attrelid, a.attnum) = (d.adrelid, > d.adnum) > ANDa.attrelid = 'x1'::regclass > ANDa.attname in ('b', 'c'); > by adding exprSetCollation, the output is > > default_value | (a * 2) > adbin | {COERCEVIAIO :arg {OPEXPR :opno 514 :opfuncid 141 > :opresulttype 23 :opretset false :opcollid 0 :inputcollid 0 :args > ({VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 > :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 > :varattnosyn 1 :location -1} {CONST :consttype 23 :consttypmod -1 > :constcollid 0 :constlen 4 :constbyval true :constisnull false > :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]}) :location -1} > :resulttype 25 :resultcollid 950 :coerceformat 2 :location -1} > -[ RECORD 2 > ]-+ > default_value | '1'::text COLLATE "C" > adbin | {CONST :consttype 25 :consttypmod -1 :constcollid 950 > :constlen -1 :constbyval false :constisnull false :location -1 > :constvalue 5 [ 20 0 0 0 49 ]} > > > master behavior: > > default_value | (a * 2) > adbin | {COERCEVIAIO :arg {OPEXPR :opno 514 :opfuncid 141 > :opresulttype 23 :opretset false :opcollid 0 :inputcollid 0 :args > ({VAR :varno 1 :varattno 1 :vartype 23 :vartypmod -1 :varcollid 0 > :varnullingrels (b) :varlevelsup 0 :varreturningtype 0 :varnosyn 1 > :varattnosyn 1 :location -1} {CONST :consttype 23 :consttypmod -1 > :constcollid 0 :constlen 4 :constbyval true :constisnull false > :location -1 :constvalue 4 [ 2 0 0 0 0 0 0 0 ]}) :location -1} > :resulttype 25 :resultcollid 0 :coerceformat 2 :location -1} > -[ RECORD 2 > ]-+-- > default_value | '1'::text > adbin | {CONST :consttype 25 :consttypmod -1 :constcollid 100 > :constlen -1 :constbyval false :constisnull false :location -1 > :constvalue 5 [ 20 0 0 0 49 ]} I still think this is a bug, so i put it on the https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items#Older_bugs_affecting_stable_branches
Re: Built-in Raft replication
Hi Konstantin On Wed, 16 Apr 2025 at 15:07, Konstantin Osipov wrote: > * Alastair Turner [25/04/16 15:58]: > > > > > If you use build-in failover you have to resort to 3 big Postgres > > > machines because you need 2/3 majority. Of course, you can install > > > MySQL-stype arbiter - host that had no real PGDATA, only participates > in > > > voting. But this is a solution to problem induced by built-in > autofailover. > > > > > > Users find it a waste of resources to deploy 3 big PostgreSQL > > > instances just for HA where 2 suffice even if they deploy 3 > > > lightweight DCS instances. Having only some of the nodes act as DCS > > > and others purely PostgreSQL nodes will reduce waste of resources. > > > > > > The experience of other projects/products with automated failover > based on > > quorum shows that this is a critical issue for adoption. In the In-memory > > Data Grid space (Coherence, Geode/GemFire) the question of how to ensure > > that some nodes didn't carry any data comes up early in many architecture > > discussions. When RabbitMQ shipped their Quorum Queues feature, the first > > and hardest area of pushback was around all nodes hosting message > content. > > > > It's not just about the requirement for compute resources, it's also > about > > bandwidth and latency. Many large organisations have, for historical > > reasons, pairs of data centres with very good point-to-point > connectivity. > > As the requirement for quorum witnesses has come up for all sorts of > > things, including storage arrays, they have built arbiter/witness sites > at > > branches, colocation providers or even on the public cloud. More than not > > holding user data or processing queries, the arbiter can't even be sent > the > > replication stream for the user data in the database, it just won't fit > > down the pipe. > > > > Which feels like a very difficult requirement to meet if the replication > > model for all data is being changed to a quorum model. > > I agree master/replica deployment layouts are very popular and are > not going to directly benefit from raft. They'll still work, but no > automation will be available, just like today with Patroni. > > Users of Patroni and etcd setups can get automation for two-site primary/replica pairs by putting a third etcd node on a third site. Which only requires moving the membership/leadership data to the arbiter site, not all database activity. > However, if the storage cost is an argument, then the logical path is to > disaggregate storage/compute altogether, i.e. use projects like > neon. > > The issue is not generally storage, but network. There may simply not be enough bandwidth available to transmit the whole WAL to the arbiter site. Many on-premises IT setups have this limitation in some form. If your proposal would leave these large, traditional user organisations (which account for thousands of Postgres HA pairs or DR pairs) doing what they currently do with wraparound tooling like Patroni, and create a new, in core, option for balanced 3, 5, 7... member groups, then I don't think it's worth doing. Regards, Alastair
Re: jsonapi: scary new warnings with LTO enabled
Jacob Champion writes: > On Wed, Apr 16, 2025 at 4:04 PM Tom Lane wrote: >> Looking through all of the callers of freeJsonLexContext, quite >> a lot of them use local JsonLexContext structs, and probably some >> of them are more performance-critical than these. So that raises >> the question of why are we seeing warnings for only these call >> sites? > Yeah, I had the same question... After making another pass through the callers of freeJsonLexContext, I observe that the warnings appear in callers that use a local variable *and* contain goto statements. So I'm betting that the presence of goto's causes the LTO optimizer to pull in its horns quite a bit and thereby fail to detect the flag correlation. >> Maybe there is a more elegant way to suppress them. > Can we brute-force ignore this particular warning site with a #pragma > (suggested in [1])? That's surely not elegant :-(. However, I don't especially want to rewrite away the goto's in these callers ... regards, tom lane
Re: jsonapi: scary new warnings with LTO enabled
On Thu, Apr 17, 2025 at 8:20 AM Tom Lane wrote: > I confirm this silences those warnings on my Fedora 41 box. Instead of doing lex = calloc(...); /* (error out on NULL return) */ makeJsonLexContextCstringLen(lex, ...); we need to do lex = makeJsonLexContextCstringLen(NULL, ...); /* (error out on NULL return) */ so that JSONLEX_FREE_STRUCT is set correctly. Otherwise we'll leak the main allocation: ==12550==ERROR: LeakSanitizer: detected memory leaks Direct leak of 120 byte(s) in 1 object(s) allocated from: #0 0xe34d2a84 in __interceptor_calloc (/home/jacob/src/postgres/worktree-oauth/build-clang/src/interfaces/libpq/fuzz_libpq_handle_oauth_sasl_error+0x112a84) (BuildId: 359bf20b63a97771ccb3bd2c238485920485521f) #1 0xe3510ff0 in handle_oauth_sasl_error /home/jacob/src/postgres/worktree-oauth/build-clang/../src/interfaces/libpq/fe-auth-oauth.c:511:8 > I'm content to do it like this, but maybe Jacob wants to > investigate alternatives? I was more worried about it when you said you wanted to get rid of the stack allocation API. (I like having the flexibility to choose between the two forms, not just for performance but also for struct embedding.) But I'm perfectly happy with just adjusting these sites. Thanks! --Jacob
SQL functions: avoid making a tuplestore unnecessarily
(I'm not proposing this for v18, but I wanted to get the patch written while functions.c is still fresh in mind.) The attached patch changes functions.c so that we only make a tuplestore object if we're actually going to return multiple rows in it, that is if it's a set-returning function and we cannot use "lazyEval" mode. Otherwise, we can just rely on the result slot we made anyway to hold the one tuple we need. This saves a small number of cycles by not shoving a tuple into the tuplestore only to pull it right back out. But the real reason for changing it is not speed but resource-management worries. The existing code (as it was before 0dca5d68d and is again as of 0313c5dc6) makes the tuplestore in the potentially long-lived fn_mcxt. This'd be merely a slight annoyance if a tuplestore were purely a memory object, but it isn't: it might contain an open file. If it does, that file reference will be backed by a ResourceOwner, specifically whichever ResourceOwner was CurrentResourceOwner at the time of creating the tuplestore. What I am afraid of is that I don't think there's any guarantee that that ResourceOwner is as long-lived as the FmgrInfo. It should be fine within normal SQL query execution, where the FmgrInfo will be part of the executor's state tree. But there are long-lived FmgrInfos, such as those within the typcache, or within an index's relcache entry. What if somebody tries to use a SQL function to implement functionality that's reached through those mechanisms? Given the lack of field complaints over the many years it's been like this, there doesn't seem to be a live problem. I think the explanation for that is (1) those mechanisms are never used to call set-returning functions, (2) therefore, the tuplestore will not be called on to hold more than one result row, (3) therefore, it won't get large enough that it wants to spill to disk, (4) therefore, its potentially dangling resowner pointer is never used. However, this is an uncomfortably shaky chain of assumptions. I want to cut it down by fixing things so that there is no tuplestore, period, in a non-set-returning SQL function. Furthermore, this patch changes things so that when we do make a tuplestore, it lives in the per-query context not the fn_mcxt, providing additional surety that it won't outlive its ResourceManager. This is a change I made in 0dca5d68d and then had to undo for performance reasons, but because of these resource-lifetime considerations I really want it back that way again. Since I think there is probably not a reachable bug here today, I'm going to park this for v19. regards, tom lane diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c index e0bca7cb81c..8d4d062d579 100644 --- a/src/backend/executor/functions.c +++ b/src/backend/executor/functions.c @@ -44,7 +44,7 @@ typedef struct { DestReceiver pub; /* publicly-known function pointers */ - Tuplestorestate *tstore; /* where to put result tuples */ + Tuplestorestate *tstore; /* where to put result tuples, or NULL */ JunkFilter *filter; /* filter to convert tuple type */ } DR_sqlfunction; @@ -145,11 +145,13 @@ typedef struct SQLFunctionCache bool lazyEvalOK; /* true if lazyEval is safe */ bool shutdown_reg; /* true if registered shutdown callback */ bool lazyEval; /* true if using lazyEval for result query */ + bool randomAccess; /* true if tstore needs random access */ bool ownSubcontext; /* is subcontext really a separate context? */ ParamListInfo paramLI; /* Param list representing current args */ - Tuplestorestate *tstore; /* where we accumulate result tuples */ + Tuplestorestate *tstore; /* where we accumulate result for a SRF */ + MemoryContext tscontext; /* memory context that tstore should be in */ JunkFilter *junkFilter; /* will be NULL if function returns VOID */ int jf_generation; /* tracks whether junkFilter is up-to-date */ @@ -1250,7 +1252,7 @@ static void postquel_start(execution_state *es, SQLFunctionCachePtr fcache) { DestReceiver *dest; - MemoryContext oldcontext; + MemoryContext oldcontext = CurrentMemoryContext; Assert(es->qd == NULL); @@ -1296,12 +1298,27 @@ postquel_start(execution_state *es, SQLFunctionCachePtr fcache) fcache->ownSubcontext = false; } + /* + * Build a tuplestore if needed, that is if it's a set-returning function + * and we're producing the function result without using lazyEval mode. + */ + if (es->setsResult) + { + Assert(fcache->tstore == NULL); + if (fcache->func->returnsSet && !es->lazyEval) + { + MemoryContextSwitchTo(fcache->tscontext); + fcache->tstore = tuplestore_begin_heap(fcache->randomAccess, + false, work_mem); + } + } + /* Switch into the selected subcontext (might be a no-op) */ - oldcontext = MemoryContextSwitchTo(fcache->subcontext); + MemoryContextSwitchTo(fcache->subcontext); /* - * If this query produces the function result, send its output to the -
Re: Missing comma in libpq.sgml
> On 17 Apr 2025, at 13:30, Tatsuo Ishii wrote: > > In the "Asynchronous Command Processing" section of libpq.sgml, > there's a line which misses a comma at the end. Attached patch should > fix it. I am going to push attached patch into master and > REL_17_STABLE branches if there's no objection. +1 -- Daniel Gustafsson
Re: Changing shared_buffers without restart
On 25/02/2025 11:52 am, Dmitry Dolgov wrote: On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via changing shared memory mapping layout. Any feedback is appreciated. Hi Dmitry, I am sorry that I have not participated in the discussion in this thread from the very beginning, although I am also very interested in dynamic shared buffer resizing and evn proposed my own implementation of it: https://github.com/knizhnik/postgres/pull/2 based on memory ballooning and using `madvise`. And it really works (returns unused memory to the system). This PoC allows me to understand the main drawbacks of this approach: 1. Performance of Postgres CLOCK page eviction algorithm depends on number of shared buffers. My first native attempt just to mark unused buffers as invalid cause significant degrade of performance pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S (here shared_buffers - is maximal shared buffers size and `available_buffers` - is used part: | shared_buffers | available_buffers | TPS | | --| | | | 128MB | -1 | 280k | | 1GB | -1 | 324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 2Gb | 348k | My first thought is to replace clock with LRU based in double-linked list. As far as there is no lockless double-list implementation, it need some global lock. This lock can become bottleneck. The standard solution is partitioning: use N LRU lists instead of 1. Just as partitioned has table used by buffer manager to lockup buffers. Actually we can use the same partitions locks to protect LRU list. But it not clear what to do with ring buffers (strategies).So I decided not to perform such revolution in bufmgr, but optimize clock to more efficiently split reserved buffers. Just add|skip_count|field to buffer descriptor. And it helps! Now the worst case shared_buffer/available_buffers = 32Gb/128Mb shows the same performance 280k as shared_buffers=128Mb without ballooning. 2. There are several data structures i Postgres which size depends on number of buffers. In my patch I used in some cases dynamic shared buffer size, but if this structure has to be allocated in shared memory then still maximal size has to be used. We have the buffers themselves (8 kB per buffer), then the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be quote noticeable with size differences larger than 2 orders of magnitude: E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers. 3. `madvise` is not portable. Certainly you have moved much further in your proposal comparing with my PoC (including huge pages support). But it is still not quite clear to me how you are going to solve the problems with large memory overhead in case of ~100x times variation of shared buffers size. I
Re: Changing shared_buffers without restart
Hi Ashutosh / Dmitry, Thanks for the information and discussions, it's been very helpful. I also have a related question about how ftruncate() is used in the patch. In my testing I also see that when using ftruncate to shrink a shared segment, the memory is freed immediately after the call, even if other processes still have that memory mapped, and they will hit SIGBUS if they try to access that memory again as the manpage says. So am I correct to think that, to support the bufferpool shrinking case, it would not be safe to call ftruncate in AnonymousShmemResize as-is, since at that point other processes may still be using pages that belong to the truncated memory? It appears that for shrinking we should only call ftruncate when we're sure no process will access those pages again (eg, all processes have handled the resize interrupt signal barrier). I suppose this can be done by the resize coordinator after synchronizing with all the other processes. But in that case it seems we cannot use the postmaster as the coordinator then? b/c I see some code comments saying the postmaster does not have waiting infrastructure... (maybe even if the postmaster has waiting infra we don't want to use it anyway since it can be blocked for a long time and won't be able to serve other requests). Regards, Jack Ng
Re: Changing shared_buffers without restart
On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut wrote: > On 19.11.24 14:29, Dmitry Dolgov wrote: > >> I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how > >> that interacts with the MAP_HUGETLB flag for mmap(). Do you need to > >> specify > >> both of them if you want huge pages? > > Correct, both (one flag in memfd_create and one for mmap) are needed to > > use huge pages. > > I was worried because the FreeBSD man page says > > MFD_HUGETLB This flag is currently unsupported. > > It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant. > > But you should make sure in your patch that the right set of flags for > huge pages is passed. MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't admit it (guessing an oversight, not sure, will see). And you don't need the corresponding (non-existent) mmap flag. You also have to specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but other than that quirk I see it definitely working with eg procstat -v. That might be because FreeBSD doesn't have a default huge page size concept? On Linux that's a boot time setting, I guess rarely changed. I contemplated that once before, when I wrote a quick demo patch[1] to implement huge_pages=on for FreeBSD (ie explicit rather than transparent). I used a different function, not the Linuxoid one but it's the same under the covers, and I wrote: + /* + * Find the matching page size index, or if huge_page_size wasn't set, + * then skip the smallest size and take the next one after that. + */ Swapping that topic back in, I was left wondering: (1) how to choose between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause ftruncate() to try to defragment physical memory to fulfil your request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT, and (2) if it's the second thing, well Linux is like that in respect of failing fast, but for it to succeed you have to configure nr_hugepages in the OS as a separate administrative step and *that's* when it does any defragmentation required, and that's another concept FreeBSD doesn't have. It's a bit of a weird concept too, I mean those pages are not reserved for you in any way and anyone could nab them, which is undeniably practical but it lacks a few qualities one might hope for in a kernel facility... IDK. Anyway, the Linux-like memfd_create() always does it the _DEFAULT way. EIther way, we can't have identical "try" semantics: it'll actually put some effort into trying, perhaps burning many seconds of CPU. I took a peek at what we're doing for Windows and the man pages tell me that it's like that too. I don't recall hearing any complaints about that, but it's gated on a Windows permission that I assume very few enabled, so "try" probably isn't trying for most systems. Quoting: "Large-page memory regions may be difficult to obtain after the system has been running for a long time because the physical space for each large page must be contiguous, but the memory may have become fragmented. Allocating large pages under these conditions can significantly affect system performance. Therefore, applications should avoid making repeated large-page allocations and instead allocate all large pages one time, at startup." For Windows we also interpret "on" with GetLargePageMinimum(), which sounds like my "second known page size" idea. To make Windows do the thing that this thread wants, I found a thread saying that calling VirtualAlloc(..., MEM_RESET) and then convincing every process to call VirtualUnlock(...) might work: https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ I'm not sure what to do about the other Unixen. One option is nothing, no feature, patches welcome. Another is to use shm_open(), like DSM segments, except we never need to reopen these ones so we could immediately call shm_unlink() to leave only a very short window to crash and leak a name. It'd be low risk name pollution in a name space that POSIX forgot to provide any way to list. The other idea is non-standard madvise tricks but they seem far too squishy to be part of a "portable" fallback if they even work at all, so it might be better not to have the feature than that I think.
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Tue, Apr 15, 2025 at 2:38 PM Jelte Fennema-Nio wrote: > libpq_append_conn_error(conn, "no custom OAuth flows are available, > and libpq-oauth could not be loaded library could not be loaded. Try > installing the libpq-oauth package from the same source that you > installed libpq from"); Thanks! I think that's a little too prescriptive for packagers, personally, but I agree that the current message isn't correct anymore. I've gone with "no custom OAuth flows are available, and the builtin flow is not installed". (I suppose packagers could patch in a platform-specific message if they really wanted?) -- Other changes in v7: - The option name remains --with-libcurl. - Daniel and I have tweaked the documentation, and a draft commit message is up - Removed the ENABLE_NLS-mismatch assertion in oauth-utils.c; we don't need to care anymore - Added an initialization mutex I was feeling paranoid about injecting dependency pointers concurrently to their use in another thread. They're _supposed_ to be constant... but I have no doubt that someone somewhere knows of a platform/compiler/linker combo where that blows up anyway. Initialization is now run once, under pthread_mutex protection. - Fixed module load on macOS The green CI was masking a bug with its use of DYLD_LIBRARY_PATH: we don't make use of RPATH on macOS, so after installing libpq, it lost the ability to find libpq-oauth. (A stale installation due to SIP weirdness was masking this on my local machine; sorry for not catching it before.) I have swapped to using an absolute path on Mac only, because unlike LD_LIBRARY_PATH on *nix, DYLD_LIBRARY_PATH can still override absolute paths in dlopen()! Whee. I could use a sanity check from a native Mac developer, but I believe this mirrors the expected behavior for a "typical" runtime dependency: libraries point directly to the things they depend on. With those, I have no more TODOs and I believe this is ready for a final review round. Thanks, --Jacob v7-0001-oauth-Move-the-builtin-flow-into-a-separate-modul.patch Description: Binary data
Re: disabled SSL log_like tests
On Fri, Apr 18, 2025 at 12:46 PM Tom Lane wrote: > * The commented-out tests in 001_ssltests.pl contained hard-wired > expectations as to certificate serial numbers, which are obsolete now. > I just replaced them with "\d+", but if anyone feels like that's not > good enough, we could continue to check for exact serial numbers and > eat the ensuing maintenance effort. No, I think \d+ should be fine. (I haven't stared closely at the racing code yet, but I like the concept of the patch.) Thanks! --Jacob
SQL function to access to `creating_extension`
Hi, I propose to introduce `pg_creating_extension()` function that would return the OID of the extension being currently created (or `null` if none is). The core motivation for it is to be able to establish data provenance in tables created by extensions to be used in `pg_extension_config_dump` configuration. This way, even if a secondary extension inserts data, it can still be excluded from the dump by tracking data provenance in a column with a default. Something like ``` create table my_config ( --- ... extension_provenance oid default pg_creating_extension() ) ``` This would allow for a generalized exclusion in pg_extension_config_dump. I've attached a draft patch for this simple function. I am happy to finalize it with tests and documentation if there is a consensus on the shape. -- Founder at Omnigres https://omnigres.com v1-0001-pg_creating_extension-to-inspect-if-an-extension-is-.patch Description: Binary data
Re: disabled SSL log_like tests
I wrote: > * I was more than slightly surprised to find that there are a bunch of > other connect_fails callers that are testing log_like or log_unlike > and thereby risking the same type of race condition. Some of those > tests are relatively new and perhaps just haven't failed *yet*, > but I wonder if we changed something since 2022 that solves this > problem in a different way? Nope, apparently not. The failure depends on the kernel's scheduler, so unsurprisingly it's quite platform-dependent. But I can reproduce it reliably on mamba's host (NetBSD 10/macppc) if I enable the 001_ssltests.pl log_like tests without applying the connect_fails changes. The fact that mamba hasn't been failing on the other affected tests is a bit puzzling. mamba isn't running kerberos or ldap or oauth_validator, so the lack of failures there is expected. authentication/t/001_password.pl appears accidentally not vulnerable: it's not using log_like in any connect_fails tests. (It is using log_unlike, so those tests could be giving silent false negatives.) But authentication/t/003_peer.pl has 8 test cases that look vulnerable. I guess there must be some extra dollop of timing weirdness in the ssl tests that's not there in 003_peer.pl. Unfortunately ... it sometimes fails even with the connect_fails changes, for example # Failed test 'intermediate client certificate is missing: log matches' # at /home/tgl/pgsql/src/test/ssl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2666. # '2025-04-18 17:59:19.358 EDT [1460] DEBUG: assigned pm child slot 2 for backend # 2025-04-18 17:59:19.359 EDT [1460] DEBUG: forked new client backend, pid=599 socket=8 # 2025-04-18 17:59:19.369 EDT [599] [unknown] LOG: connection received: host=localhost port=63709 # 2025-04-18 17:59:19.436 EDT [1460] DEBUG: releasing pm child slot 1 # 2025-04-18 17:59:19.436 EDT [1460] DEBUG: client backend (PID 25401) exited with exit code 0 # ' # doesn't match '(?^:Client certificate verification failed at depth 0: unable to get local issuer certificate)' What I think happened here is that a previous backend hadn't exited yet when we start the test, and when its report does come out, connect_fails prematurely stops waiting. (In the above, evidently the child process we want to wait for is 599, but we're fooled by a delayed report for 25401.) So my v1 patch needs work. Maybe we can make the test compare the PIDs in the "forked new client backend" and "client backend exited" log messages. Stay tuned... regards, tom lane
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Thu, Apr 17, 2025 at 5:47 PM Jacob Champion wrote: > With those, I have no more TODOs and I believe this is ready for a > final review round. Some ABI self-review. These references to conn->errorMessage also need the indirection treatment, which I'm working on now: > if (actx->errctx) > { > appendPQExpBufferStr(&conn->errorMessage, > libpq_gettext(actx->errctx)); > appendPQExpBufferStr(&conn->errorMessage, ": "); > ... I was searching backwards through history to confirm that we don't rearrange struct pg_conn in back branches; turns out that was a false assumption. See e8f60e6fe2: While at it, fix some places where parameter-related infrastructure was added with the aid of a dartboard, or perhaps with the aid of the anti-pattern "add new stuff at the end". It should be safe to rearrange the contents of struct pg_conn even in released branches, since that's private to libpq (and we'd have to move some fields in some builds to fix this, anyway). So that means, I think, the name needs to go back to --, unless anyone can think of a clever way around it. (Injecting conn->errorMessage to avoid the messiness around ENABLE_GSS et al is still useful, but injecting every single offset doesn't seem maintainable to me.) Sorry, Christoph; I know that's not what you were hoping for. --Jacob
Re: [BUG] temporary file usage report with extended protocol and unnamed portals
> [...] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp525566.0", > size 2416640 > [..] STATEMENT: SELECT 1 > but it should be: > [...] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp538230.0", > size 2416640 > [...] STATEMENT: SELECT * FROM foo ORDER BY a OFFSET $1 LIMIT 2 hmm, looking at the repro and your patch, I was really confused by why the STATEMENT: ends up being SELECT 1 even though the query with temp being tracked is the ORDER BY query. So, I ran a slowed down test where there is a 15 second thread sleep between the ORDER BY and SELECT 1 ( see attached ) and I also ran it with the following GUCs enabled ``` log_temp_files = '0' work_mem = '64kB' max_parallel_workers_per_gather = '0' log_min_duration_statement = '0' ``` 2025-04-18 13:31:45.572 CDT [95918] LOG: duration: 0.046 ms bind : select pg_sleep(0) 2025-04-18 13:31:45.572 CDT [95918] LOG: duration: 0.009 ms execute : select pg_sleep(0) 2025-04-18 13:31:45.574 CDT [95918] LOG: duration: 0.454 ms parse : SELECT * FROM foo ORDER BY a 2025-04-18 13:31:45.575 CDT [95918] LOG: duration: 0.278 ms bind : SELECT * FROM foo ORDER BY a 2025-04-18 13:31:46.396 CDT [95918] LOG: duration: 821.076 ms execute : SELECT * FROM foo ORDER BY a 2025-04-18 13:32:01.403 CDT [95918] LOG: duration: 0.438 ms parse : SELECT 1 2025-04-18 13:32:01.454 CDT [95918] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp95918.0", size 202039296 2025-04-18 13:32:01.454 CDT [95918] STATEMENT: SELECT 1 2025-04-18 13:32:01.454 CDT [95918] LOG: duration: 50.818 ms bind : SELECT 1 2025-04-18 13:32:01.454 CDT [95918] LOG: duration: 0.022 ms execute : SELECT 1 2025-04-18 13:32:01.457 CDT [95918] LOG: duration: 0.064 ms parse S_1: COMMIT 2025-04-18 13:32:01.457 CDT [95918] LOG: duration: 0.017 ms bind S_1: COMMIT 2025-04-18 13:32:01.457 CDT [95918] LOG: duration: 0.105 ms execute S_1: COMMIT In the log above, it shows that the temporary file logging does not actually occur until the next query is executed, so at that point the query string is already "SELECT 1" what is even more interesting is when running the commands directly from psql using extended query protocol, the STATEMENT is not logged after the temporary file message. ``` SELECT FROM foo ORDER BY a \bind ; SELECT 1 \bind ; ``` ``` istance=0 kB, estimate=0 kB; lsn=0/E55A088, redo lsn=0/E55A030 2025-04-18 13:36:39.275 CDT [96237] LOG: duration: 1.447 ms parse : SELECT FROM foo ORDER BY a ; 2025-04-18 13:36:39.275 CDT [96237] LOG: duration: 0.719 ms bind : SELECT FROM foo ORDER BY a ; 2025-04-18 13:36:39.822 CDT [96237] LOG: duration: 546.803 ms execute : SELECT FROM foo ORDER BY a ; 2025-04-18 13:36:39.852 CDT [96237] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp96237.0", size 202039296 2025-04-18 13:36:40.736 CDT [96237] LOG: duration: 0.394 ms parse : SELECT 1 ; 2025-04-18 13:36:40.736 CDT [96237] LOG: duration: 0.133 ms bind : SELECT 1 ; 2025-04-18 13:36:40.736 CDT [96237] LOG: duration: 0.023 ms execute : SELECT 1 ; ``` I think the DropPortal is happening later in the JDBC case, and only after the next query is executed, maybe? So your patch does take care of the problem because it ensures that the drop portal, when it occurs, is referencing the correct sql. I am not yet sure if the patch the right solution yet, it maybe the best answer. I don't have a better answer, but wanted to share this research as well. -- Sami Imseih Amazon Web Services (AWS) Test.java Description: Binary data
Re: n_ins_since_vacuum stats for aborted transactions
On Thu, Apr 17, 2025 at 11:13 PM David Rowley wrote: > > On Sat, 12 Apr 2025 at 07:33, Sami Imseih wrote: > > What do you think of the attached? > > I looked at the v3 patch and I'm having trouble getting excited about it. > > I'd say this part is misleading: > > @@ -3956,7 +3961,8 @@ description | Waiting for a newly initialized > WAL file to reach durable storage > n_dead_tup bigint > > > - Estimated number of dead rows > + Estimated number of dead rows (updated by committed transactions, or > by > + VACUUM and VACUUM FULL) > > > An aborted insert will contribute to that counter, but that's not mentioned. I thought the first line "An aborted transaction will also increment tuple-related counters, unless otherwise noted." makes it clear that it will be updated in an aborted transaction, but after re-reading I can see it being confusing. What about this wording to make it more clear when the field is updated? n_dead_tup bigint - Estimated number of dead rows + Estimated number of dead rows (updated either by committed or aborted transactions, + or by VACUUM and VACUUM FULL) > Would it be ok just to adjust n_mod_since_analyze's "Estimated number > of rows modified since this table was last analyzed" and inject "by > committed transactions" after "modified", then call it a day? That works also. -- Sami Imseih Amazon Web Services (AWS)
Re: Fix slot synchronization with two_phase decoding enabled
On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) wrote: > > > -- > Approach 2 > -- > > Instead of disallowing the use of two-phase and failover together, a more > flexible strategy could be only restrict failover for slots with two-phase > enabled when there's a possibility of existing prepared transactions before > the > two_phase_at that are not yet replicated. During slot creation with two-phase > and failover, we could check for any decoded prepared transactions when > determining the decoding start point (DecodingContextFindStartpoint). For > subsequent attempts to alter failover to true, we ensure that two_phase_at is > less than restart_lsn, indicating that all prepared transactions have been > committed and replicated, thus the bug would not happen. > > pros: > > This method minimizes restrictions for users. Especially during slot creation > with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare > during consistent snapshot creation, the restriction becomes almost > unnoticeable. I think this approach can work for the transactions that are prepared while the slot is created. But if I understand the problem correctly, while the initial table sync is performing, the slot's two_phase is still false, so we need to deal with the transactions that are prepared during the initial table sync too. What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: ALTER COLUMN SET DATA TYPE does not change the generation expression's collation
jian he writes: >> ATPrepAlterColumnType forbids us to ALTER COLUMN SET DATA TYPE USING (expr) >> for generated columns. >> however we can still change the generated column type from non-text to text >> or text type from one collation to another collation. I don't really understand why we allow SET DATA TYPE on a generated column at all. However ... >> In ATExecAlterColumnType, we also need to set the generation >> expression collation? Don't believe that would matter in the slightest. The generation expression is not exposed anywhere --- we don't incorporate it in the plan tree, just evaluate it in ExecComputeStoredGenerated. It could matter in the case of a virtual generated column, but it appears that that works already: regression=# CREATE TABLE x1(a int, b int GENERATED ALWAYS AS (a * 2) virtual, c text GENERATED ALWAYS AS ('1') stored ); CREATE TABLE regression=# insert into x1 values (11); INSERT 0 1 regression=# ALTER TABLE x1 alter column b set data type text collate "C"; ALTER TABLE regression=# select pg_collation_for(b) from x1; pg_collation_for -- "C" (1 row) regression=# ALTER TABLE x1 alter column b set data type text collate "POSIX"; ALTER TABLE regression=# select pg_collation_for(b) from x1; pg_collation_for -- "POSIX" (1 row) (It looks like the reason that works is that build_generation_expression inserts the necessary coercion.) So I don't see a bug here. If you want to claim that this is a bug deserving of being an open item, I think you need to demonstrate some observable misbehavior. If you want to say it'd be cleaner to fix the stored expression and get rid of the extra step in build_generation_expression, I'd probably agree, but that seems like cleanup that could wait for v19. It's certainly not a bug affecting any stable branches. regards, tom lane
Re: n_ins_since_vacuum stats for aborted transactions
> I can see it being confusing. What about this wording to make it more > clear when the field is > updated? here are both of the changes in v4. -- Sami Imseih Amazon Web Services (AWS) v4-0001-Clarify-when-aborted-rows-are-tracked-for-tuple-r.patch Description: Binary data
Re: disabled SSL log_like tests
Andrew Dunstan writes: > On 2025-04-17 Th 10:56 AM, Tom Lane wrote: >> However, I wonder whether Andres' work at 8b886a4e3 could be used >> to improve this, either directly or as inspiration? > I don't think so - these tests are about checking log file contents, not > a psql problem. Well, I was mainly leaning on the "inspiration" part: the idea is to wait for something that must come out after the text we want to look for. After digging around a bit I noticed that 002_connection_limits.pl's connect_fails_wait is doing something that's almost what we want: that test cranks log_min_messages up to DEBUG2 and then waits for the postmaster's report of backend exit before believing that it's done. Awhile later I had the attached patch. Some notes: * The commented-out tests in 001_ssltests.pl contained hard-wired expectations as to certificate serial numbers, which are obsolete now. I just replaced them with "\d+", but if anyone feels like that's not good enough, we could continue to check for exact serial numbers and eat the ensuing maintenance effort. * I was more than slightly surprised to find that there are a bunch of other connect_fails callers that are testing log_like or log_unlike and thereby risking the same type of race condition. Some of those tests are relatively new and perhaps just haven't failed *yet*, but I wonder if we changed something since 2022 that solves this problem in a different way? Anyway, after this change any such caller must set log_min_messages = debug2 or fail. I think I got all the relevant test scripts in the attached. regards, tom lane diff --git a/src/test/authentication/t/001_password.pl b/src/test/authentication/t/001_password.pl index 756b4146050..37d96d95a1a 100644 --- a/src/test/authentication/t/001_password.pl +++ b/src/test/authentication/t/001_password.pl @@ -66,6 +66,8 @@ sub test_conn my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; # Test behavior of log_connections GUC diff --git a/src/test/authentication/t/003_peer.pl b/src/test/authentication/t/003_peer.pl index 69ba73bd2b9..2879800eacf 100644 --- a/src/test/authentication/t/003_peer.pl +++ b/src/test/authentication/t/003_peer.pl @@ -72,6 +72,8 @@ sub test_role my $node = PostgreSQL::Test::Cluster->new('node'); $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; # Set pg_hba.conf with the peer authentication. diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl index 6748b109dec..2dc6bec9b89 100644 --- a/src/test/kerberos/t/001_auth.pl +++ b/src/test/kerberos/t/001_auth.pl @@ -66,6 +66,7 @@ $node->append_conf( listen_addresses = '$hostaddr' krb_server_keyfile = '$krb->{keytab}' log_connections = on +log_min_messages = debug2 lc_messages = 'C' }); $node->start; diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl index 352b0fc1fa7..d1315ed5351 100644 --- a/src/test/ldap/t/001_auth.pl +++ b/src/test/ldap/t/001_auth.pl @@ -48,6 +48,8 @@ note "setting up PostgreSQL instance"; my $node = PostgreSQL::Test::Cluster->new('node'); $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; $node->safe_psql('postgres', 'CREATE USER test0;'); diff --git a/src/test/modules/oauth_validator/t/001_server.pl b/src/test/modules/oauth_validator/t/001_server.pl index d88994abc24..4f035417a40 100644 --- a/src/test/modules/oauth_validator/t/001_server.pl +++ b/src/test/modules/oauth_validator/t/001_server.pl @@ -48,6 +48,8 @@ $node->init; $node->append_conf('postgresql.conf', "log_connections = on\n"); $node->append_conf('postgresql.conf', "oauth_validator_libraries = 'validator'\n"); +# Needed to allow connect_fails to inspect postmaster log: +$node->append_conf('postgresql.conf', "log_min_messages = debug2"); $node->start; $node->safe_psql('postgres', 'CREATE USER test;'); diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 8759ed2cbba..7877cc8d994 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -2624,7 +2624,12 @@ If this regular expression is set, matches it with the output generated. =item log_unlike => [ qr/prohibited message/ ] -See C. +See C. CAUTION: use of either option requires that +the server's log_min_messages be at least DEBUG2, and that no other +client backend is launched concurrently. These requirements allow +C to wait to see the postmaster-l
Re: ZStandard (with dictionaries) compression support for TOAST compression
On Tue, Apr 15, 2025 at 2:13 PM Nikhil Kumar Veldanda wrote: > Addressing Compressed Datum Leaks problem (via CTAS, INSERT INTO ... SELECT > ...) > > As compressed datums can be copied to other unrelated tables via CTAS, > INSERT INTO ... SELECT, or CREATE TABLE ... EXECUTE, I’ve introduced a > method inheritZstdDictionaryDependencies. This method is invoked at > the end of such statements and ensures that any dictionary > dependencies from source tables are copied to the destination table. > We determine the set of source tables using the relationOids field in > PlannedStmt. With the disclaimer that I haven't opened the patch or thought terribly deeply about this issue, at least not yet, my fairly strong suspicion is that this design is not going to work out, for multiple reasons. In no particular order: 1. I don't think users will like it if dependencies on a zstd dictionary spread like kudzu across all of their tables. I don't think they'd like it even if it were 100% accurate, but presumably this is going to add dependencies any time there MIGHT be a real dependency rather than only when there actually is one. 2. Inserting into a table or updating it only takes RowExclusiveLock, which is not even self-exclusive. I doubt that it's possible to change system catalogs in a concurrency-safe way with such a weak lock. For instance, if two sessions tried to do the same thing in concurrent transactions, they could both try to add the same dependency at the same time. 3. I'm not sure that CTAS, INSERT INTO...SELECT, and CREATE TABLE...EXECUTE are the only ways that datums can creep from one table into another. For example, what if I create a plpgsql function that gets a value from one table and stores it in a variable, and then use that variable to drive an INSERT into another table? I seem to recall there are complex cases involving records and range types and arrays, too, where the compressed object gets wrapped inside of another object; though maybe that wouldn't matter to your implementation if INSERT INTO ... SELECT uses a sufficiently aggressive strategy for adding dependencies. When Dilip and I were working on lz4 TOAST compression, my first instinct was to not let LZ4-compressed datums leak out of a table by forcing them to be decompressed (and then possibly recompressed). We spent a long time trying to make that work before giving up. I think this is approximately where things started to unravel, and I'd suggest you read both this message and some of the discussion before and after: https://www.postgresql.org/message-id/20210316185455.5gp3c5zvvvq66...@alap3.anarazel.de I think we could add plain-old zstd compression without really tackling this issue, but if we are going to add dictionaries then I think we might need to revisit the idea of preventing things from leaking out of tables. What I can't quite remember at the moment is how much of the problem was that it was going to be slow to force the recompression, and how much of it was that we weren't sure we could even find all the places in the code that might need such handling. I'm now also curious to know whether Andres would agree that it's bad if zstd dictionaries are un-droppable. After all, I thought it would be bad if there was no way to eliminate a dependency on a compression method, and he disagreed. So maybe he would also think undroppable dictionaries are fine. But maybe not. It seems even worse to me than undroppable compression methods, because you'll probably not have that many compression methods ever, but you could have a large number of dictionaries eventually. -- Robert Haas EDB: http://www.enterprisedb.com
Re: NUMA shared memory interleaving
Hi, On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote: > On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak > wrote: > > Normal pgbench workloads tend to be not affected, as each backend > > tends to touch just a small partition of shm (thanks to BAS > > strategies). Some remaining questions are: > > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the > > first option, as we could potentially in future add more optimizations > > behind that GUC. > > I wonder whether the GUC needs to support interleaving between a > designated set of nodes rather than only being able to do all nodes. > For example, suppose someone is pinning the processes to a certain set > of NUMA nodes; perhaps then they wouldn't want to use memory from > other nodes. +1. That could be used for instances consolidation on the same host. One could ensure that numa nodes are not shared across instances (cpu and memory resource isolation per instance). Bonus point, adding Direct I/O into the game would ensure that the OS page cache is not shared too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: NUMA shared memory interleaving
Hi, On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote: > On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak > wrote: > > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an > > expert on DSA/DSM at all) > > I have no answers but I have speculated for years about a very > specific case (without any idea where to begin due to lack of ... I > guess all this sort of stuff): in ExecParallelHashJoinNewBatch(), > workers split up and try to work on different batches on their own to > minimise contention, and when that's not possible (more workers than > batches, or finishing their existing work at different times and going > to help others), they just proceed in round-robin order. A beginner > thought is: if you're going to help someone working on a hash table, > it would surely be best to have the CPUs and all the data on the same > NUMA node. During loading, cache line ping pong would be cheaper, and > during probing, it *might* be easier to tune explicit memory prefetch > timing that way as it would look more like a single node system with a > fixed latency, IDK (I've shared patches for prefetching before that > showed pretty decent speedups, and the lack of that feature is > probably a bigger problem than any of this stuff, who knows...). > Another beginner thought is that the DSA allocator is a source of > contention during loading: the dumbest problem is that the chunks are > just too small, but it might also be interesting to look into per-node > pools. Or something. IDK, just some thoughts... I'm also thinking that could be beneficial for parallel workers. I think the ideal scenario would be to have the parallel workers spread across numa nodes and accessing their "local" memory first (and help with "remote" memory access if there is still more work to do "remotely"). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pending_since assertion failure on skink
Hi, On Sat, Apr 12, 2025 at 10:33:33AM -0400, Andres Freund wrote: > I suspect that this is related to > > commit 039549d70f6aa2daa3714a13752a08fa8ca2fb05 +1 > We might just have to give up on that assertion, I guess? Probably. I'll have a look at it next week. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com