Re: meson: Specify -Wformat as a common warning flag for extensions
On 29.05.24 08:47, Sutou Kouhei wrote: In <4707d4ed-f268-43c0-b4dd-cdbc7520f...@eisentraut.org> "Re: meson: Specify -Wformat as a common warning flag for extensions" on Tue, 28 May 2024 23:31:05 -0700, Peter Eisentraut wrote: On 07.04.24 18:01, Sutou Kouhei wrote: +# We don't have "warning_level == 3" and "warning_level == +# 'everything'" here because we don't use these warning levels. +if warning_level == '1' + common_builtin_flags += ['-Wall'] +elif warning_level == '2' + common_builtin_flags += ['-Wall', '-Wextra'] +endif I would trim this even further and always export just '-Wall'. The other options aren't really something we support. OK. How about the v6 patch? It always uses '-Wall'. I have committed this. Thanks.
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On 2024-Jun-07, Thomas Munro wrote: > static void > -ZeroBuffer(Buffer buffer, ReadBufferMode mode) > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero) This change makes the API very strange. Should the function be called ZeroAndLockBuffer() instead? Then the addition of a "bool zero" argument makes a lot more sense. In passing, I noticed that WaitReadBuffers has zero comments, which seems an insufficient number of them. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
using __func__ to locate and distinguish some error messages
hi. we have 450 appearance of `cache lookup failed .*` we have 141 appearance of `could not open file .*` so when it actually happens, it cannot quickly locate which function where the error has happened. maybe under certain conditions (e.g. certain build type or certain log_min_messages), we can also print out the function name by using gcc __func__. or we can just do like: if (!HeapTupleIsValid(tuple)) elog(ERROR, "cache lookup failed for relation %u %s", RelationGetRelid(rel), __func__); given that these errors are very unlikely to happen, if it happens, printing out the function name seems not that inversive?
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent wrote: > > On Fri, 7 Jun 2024 at 07:18, Dilip Kumar wrote: > > > > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > > wrote: > >> > >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela wrote: > >>> > >>> Why not use it too, if not binary_upgrade? > >> > >> Because in the normal case (not during binary_upgrade) you don't want > >> to have to generate 2 checkpoints for every created database, > >> especially not when your shared buffers are large. Checkpoints' costs > >> scale approximately linearly with the size of shared buffers, so being > >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot > >> of performance in the systems where this performance impact matters > >> most. > > > > I agree with you that we introduced the WAL_LOG strategy to avoid > > these force checkpoints. However, in binary upgrade cases where no > > operations are happening in the system, the FILE_COPY strategy should > > be faster. > > While you would be correct if there were no operations happening in > the system, during binary upgrade we're still actively modifying > catalogs; and this is done with potentially many concurrent jobs. I > think it's not unlikely that this would impact performance. Maybe, but generally, long checkpoints are problematic because they involve a lot of I/O, which hampers overall system performance. However, in the case of a binary upgrade, the concurrent operations are only performing a schema restore, not a real data restore. Therefore, it shouldn't have a significant impact, and the checkpoints should also not do a lot of I/O during binary upgrade, right? > Now that I think about it, arguably, we shouldn't need to run > checkpoints during binary upgrade for the FILE_COPY strategy after > we've restored the template1 database and created a checkpoint after > that: All other databases use template1 as their template database, > and the checkpoint is there mostly to guarantee the FS knows about all > changes in the template database before we task it with copying the > template database over to our new database, so the protections we get > from more checkpoints are practically useless. > If such a change were implemented (i.e. no checkpoints for FILE_COPY > in binary upgrade, with a single manual checkpoint after restoring > template1 in create_new_objects) I think most of my concerns with this > patch would be alleviated. Yeah, I think that's a valid point. The second checkpoint is to ensure that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for binary upgrades, we don't need that guarantee because a checkpoint will be performed during shutdown at the end of the upgrade anyway. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: using __func__ to locate and distinguish some error messages
On 2024-Jun-07, jian he wrote: > so when it actually happens, it cannot quickly locate which function > where the error has happened. > maybe under certain conditions (e.g. certain build type or certain > log_min_messages), > we can also print out the function name by using gcc __func__. That information is already in the error data, so you don't need it in the message text. You can change your log_error_verbosity if you want it to show up in the log; in psql you can use \errverbose to have it shown to you after the error is thrown, or you can use \pset VERBOSITY verbose to have it printed for every error message. Tools other than psql would need their own specific ways to display those. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentro de él no son, por desgracia, nada idílicas" (Ijon Tichy)
Re: Bug: PGTYPEStimestamp_from_asc() in ECPG pgtypelib
# I'm sorry for my late response. I confirmed that the error of regression is caused by my code inserting setlocale() into ecpglib of local branch. No other tests occur error in non-C locale. The following is about other topics. 1. About regression test We should test the followings: - PGTYPEStimestamp_from_asc("1994-02-11 26:10:35", NULL) returns 0. - PGTYPEStimestamp_fmt_asc() can accept format string including %x and %X. ecpglib should be affected by only setlocale() called by user application and dt_test.pgc does not call it. So the following test is the best, I think. Please see attached patch for detail (fix_pgtypeslib_regress.patch). ts1 = PGTYPEStimestamp_from_asc("1994-02-11 3:10:35", NULL); text = PGTYPEStimestamp_to_asc(ts1); printf("timestamp_to_asc2: %s\n", text); PGTYPESchar_free(text); /* abc-03:10:35-def-02/11/94-gh */ /* 12345678901234567890123456789 */ out = (char*) malloc(32); i = PGTYPEStimestamp_fmt_asc(&ts1, out, 31, "abc-%X-def-%x-ghi%%"); printf("timestamp_fmt_asc: %d: %s\n", i, out); free(out); ts1 = PGTYPEStimestamp_from_asc("1994-02-11 26:10:35", NULL); text = PGTYPEStimestamp_to_asc(ts1); printf("timestamp_to_asc3: %s\n", text); PGTYPESchar_free(text); We should also add tests that check PGTYPEStimestamp_*() set errno for invalid input correctly, but I want to leave the improvement to the next timing when implement for timestamp is changed. (Maybe the timing will not come.) 2. About document of PGTYPEStimestamp_from_asc() and PGTYPESInvalidTimestamp 0 returned by PGTYPEStimestamp_from_asc () is a valid timestamp as you commented and we should not break compatibility. So we should remove the document for PGTYPESInvalidTimestamp and add one for checking errno to description of PGTYPEStimestamp_from_asc(). Please see attached patch for detail (fix_PGTYPESInvalidTimestamp_doc.patch). 3. About endptr of *_from_asc() > PGTYPESdate_from_asc(ParseDate) > PGTYPEStimestamp_from_asc(ParseDate) > PGTYPESinterval_from_asc(ParseDate) > PGTYPESnumeric_from_asc Basically, they return immediately just after detecting invalid format. However, after passing the narrow parse, they could fails (e.g. failure of DecodeInterval(), DecodeISO8601Interval(), malloc(), and so on). So we should write as follows: If the function detects invalid format, then it stores the address of the first invalid character in endptr. However, don't assume it successed if endptr points to end of input because other processing(e.g. memory allocation) could fails. Therefore, you should check return value and errno for detecting error. You can safely endptr to NULL. I also found pandora box that description of the followings don't show their behavior when it fails. I fix doc including them. Please see attached patch(fix_pgtypeslib_funcs_docs.patch). - PGTYPESdate_from_asc()# sets errno. (can not check return value) - PGTYPESdate_defmt_asc() # returns -1 and sets errno - PGTYPEStimestamp_to_asc() # returns NULL and sets errno - PGTYPEStimestamp_defmt_asc() # just returns 1 and doesn't set errno! - PGTYPESinterval_new() # returns NULL and sets errno - PGTYPESinterval_from_asc()# returns NULL and sets errno - PGTYPESinterval_to_asc() # returns NULL and sets errno - PGTYPESinterval_copy # currently always return 0 - PGTYPESdecimal_new() # returns NULL and sets errno 4. Bug of PGTYPEStimestamp_defmt_asc() PGTYPEStimestamp_defmt_asc() doesn't set errno on failure. I didn't make a patch for it yet. Best Regards Ryo Matsumura fix_pgtypeslib_regress.patch Description: fix_pgtypeslib_regress.patch fix_pgtypeslib_funcs_docs.patch Description: fix_pgtypeslib_funcs_docs.patch fix_PGTYPESInvalidTimestamp_doc.patch Description: fix_PGTYPESInvalidTimestamp_doc.patch
Re: using __func__ to locate and distinguish some error messages
On Fri, Jun 7, 2024 at 4:28 PM Alvaro Herrera wrote: > > On 2024-Jun-07, jian he wrote: > > > so when it actually happens, it cannot quickly locate which function > > where the error has happened. > > maybe under certain conditions (e.g. certain build type or certain > > log_min_messages), > > we can also print out the function name by using gcc __func__. > > That information is already in the error data, so you don't need it in > the message text. You can change your log_error_verbosity if you want > it to show up in the log; in psql you can use \errverbose to have it > shown to you after the error is thrown, or you can use > \pset VERBOSITY verbose > to have it printed for every error message. Tools other than psql would > need their own specific ways to display those. > Thanks for pointing this out.
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera wrote: > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. +1 > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. > I think the compression option should be supported at the CREATE SUBSCRIPTION level instead of being controlled by a GUC. This way, we can decide on compression for each subscription individually rather than applying it to all subscribers. It makes more sense for the subscriber to control this, especially when we are planning to compress the data sent downstream. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Jun 06, 2024 at 04:00:23PM -0400, Robert Haas wrote: > On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot > wrote: > > v9 is more invasive (as it changes code in much more places) than v8 but it > > is > > easier to follow (as it is now clear where the new lock is acquired). > > Hmm, this definitely isn't what I had in mind. Possibly that's a sign > that what I had in mind was dumb, but for sure it's not what I > imagined. What I thought you were going to do was add calls like > LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock) > in various places, or perhaps LockRelationOid(reloid, > AccessShareLock), or whatever the case may be. I see what you’re saying, doing things like: LockDatabaseObject(TypeRelationId, returnType, 0, AccessShareLock); in ProcedureCreate() for example. > Here you've got stuff > like this: > > - record_object_address_dependencies(&conobject, addrs_auto, > -DEPENDENCY_AUTO); > + lock_record_object_address_dependencies(&conobject, addrs_auto, > + DEPENDENCY_AUTO); > > ...which to me looks like the locking is still pushed down inside the > dependency code. Yes but it’s now located in places where, I think, it’s easier to understand what’s going on (as compare to v8), except maybe for: recordDependencyOnExpr() makeOperatorDependencies() GenerateTypeDependencies() makeParserDependencies() makeDictionaryDependencies() makeTSTemplateDependencies() makeConfigurationDependencies() but probably for: heap_create_with_catalog() StorePartitionKey() index_create() AggregateCreate() CastCreate() CreateConstraintEntry() ProcedureCreate() RangeCreate() InsertExtensionTuple() CreateTransform() CreateProceduralLanguage() The reasons I keep it linked to the dependency code are: - To ensure we don’t miss anything (well, with the new Assert in place that’s probably a tangential argument) - It’s not only about locking the object: it’s also about 1) verifying the object is pinned, 2) checking it still exists and 3) provide a description in the error message if we can (in case the object does not exist anymore). Relying on an already build object (in the dependency code) avoid to 1) define the object(s) one more time or 2) create new functions that would do the same as isObjectPinned() and getObjectDescription() with a different set of arguments. That may sounds like weak arguments but it has been my reasoning. Do you still find the code hard to maintain with v9? > > And you also have stuff like this: > > ObjectAddressSet(referenced, RelationRelationId, childTableId); > + depLockAndCheckObject(&referenced); > recordDependencyOn(&depender, &referenced, DEPENDENCY_PARTITION_SEC); > > But in depLockAndCheckObject you have: > > + if (object->classId == RelationRelationId || object->classId == > AuthMemRelationId) > + return; > > That doesn't seem right, because then it seems like the call isn't > doing anything, but there isn't really any reason for it to not be > doing anything. If we're dropping a dependency on a table, then it > seems like we need to have a lock on that table. Presumably the reason > why we don't end up with dangling dependencies in such cases now is > because we're careful about doing LockRelation() in the right places, Yeah, that's what I think: we're already careful when we deal with relations. > but we're not similarly careful about other operations e.g. > ConstraintSetParentConstraint is called by DefineIndex which calls > table_open(childRelId, ...) first, but there's no logic in DefineIndex > to lock the constraint. table_open(childRelId, ...) would lock any "ALTER TABLE DROP CONSTRAINT" already. Not sure I understand your concern here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add support to TLS 1.3 cipher suites and curves lists
On 07.06.24 08:10, Erica Zhang wrote: I’m a Postgres user and I’m looking into restricting the set of allowed ciphers on Postgres and configure a concrete set of curves on our postgres instances. Out of curiosity, why is this needed in practice? Could you please help to review to see if you are interested in having this change in upcoming Postgres major release(It's should be PG17)? It would be targetting PG18 now.
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
Hello Thomas, 07.06.2024 09:06, Thomas Munro wrote: On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro wrote: On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin wrote: My bisect run ended with: 210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit Author: Thomas Munro Date: Wed Apr 3 00:03:08 2024 +1300 Provide vectored variant of ReadBuffer(). Other buildfarm failures with this Assert I could find kind of confirm this: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18 (presumably a first failure of this sort) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08 Looking... What Noah described[1] is what should be happening already, I think, but 210622c6 unconditionally zeroed the page. Oops. The attached seems to cure his repro for me. Does it also cure your test? I couldn't see that variant myself for some reason, but it seems to make sense as the explanation. I would probably adjust the function name or perhaps consider refactoring slightly, but first let's confirm that this is the same issue and fix. Thank you for looking and for the fix! Using the same testing procedure (applying patch for checking lpp, multiplying 026_overwrite_contrecord.pl tests and running 30 tests in parallel, with fsync=on) which I used for bisecting, I got failures on iterations 8, 19, 4 without the fix, but with the fix applied, 125 iterations passed. I think The Cure is sound. Best regards, Alexander
Re: relfilenode statistics
Hi, On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote: > Hi, > > On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote: > > The main argument is that we currently don’t have writes counters for > > relations. > > The reason is that we don’t have the relation OID when writing buffers out. > > Tracking writes per relfilenode would allow us to track/consolidate writes > > per > > relation (example in the v1 patch and in the message up-thread). > > > > I think that adding instrumentation in this area (writes counters) could be > > beneficial (like it is for the ones we currently have for reads). > > > > Second argument is that this is also beneficial for the "Split index and > > table statistics into different types of stats" thread (mentioned in the > > previous > > message). It would allow us to avoid additional branches in some situations > > (like > > the one mentioned by Andres in the link I provided up-thread). > > I think there's another *very* significant benefit: > > Right now physical replication doesn't populate statistics fields like > n_dead_tup, which can be a huge issue after failovers, because there's little > information about what autovacuum needs to do. > > Auto-analyze *partially* can fix it at times, if it's lucky enough to see > enough dead tuples - but that's not a given and even if it works, is often > wildly inaccurate. > > > Once we put things like n_dead_tup into per-relfilenode stats, Hm - I had in mind to populate relfilenode stats only with stats that are somehow related to I/O activities. Which ones do you have in mind to put in relfilenode stats? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Fri, Jun 7, 2024 at 7:57 AM Zhijie Hou (Fujitsu) wrote: > > Thanks for the comments! Here is the V6 patch that addressed the these. > I have pushed this after making minor changes in the wording. I have also changed one of the queries in docs to ignore the NULL slot_name values. -- With Regards, Amit Kapila.
Re: Compress ReorderBuffer spill files using LZ4
On 2024-Jun-07, Dilip Kumar wrote: > I think the compression option should be supported at the CREATE > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > can decide on compression for each subscription individually rather > than applying it to all subscribers. It makes more sense for the > subscriber to control this, especially when we are planning to > compress the data sent downstream. True. (I think we have some options that are in GUCs for the general behavior and can be overridden by per-subscription options for specific tailoring; would that make sense here? I think it does, considering that what we mostly want is to save disk space in the publisher when spilling to disk.) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Fri, 7 Jun 2024 at 10:28, Dilip Kumar wrote: > > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent > wrote: >> >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar wrote: >>> >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent >>> wrote: >>> >>> I agree with you that we introduced the WAL_LOG strategy to avoid >>> these force checkpoints. However, in binary upgrade cases where no >>> operations are happening in the system, the FILE_COPY strategy should >>> be faster. >> >> While you would be correct if there were no operations happening in >> the system, during binary upgrade we're still actively modifying >> catalogs; and this is done with potentially many concurrent jobs. I >> think it's not unlikely that this would impact performance. > > Maybe, but generally, long checkpoints are problematic because they > involve a lot of I/O, which hampers overall system performance. > However, in the case of a binary upgrade, the concurrent operations > are only performing a schema restore, not a real data restore. > Therefore, it shouldn't have a significant impact, and the checkpoints > should also not do a lot of I/O during binary upgrade, right? My primary concern isn't the IO, but the O(shared_buffers) that we have to go through during a checkpoint. As I mentioned upthread, it is reasonably possible the new cluster is already setup with a good fraction of the old system's shared_buffers configured. Every checkpoint has to scan all those buffers, which IMV can get (much) more expensive than the IO overhead caused by the WAL_LOG strategy. It may be a baseless fear as I haven't done the performance benchmarks for this, but I wouldn't be surprised if shared_buffers=8GB would measurably impact the upgrade performance in the current patch (vs the default 128MB). I'll note that the documentation for upgrading with pg_upgrade has the step for updating postgresql.conf / postgresql.auto.conf only after pg_upgrade has run already, but that may not be how it's actually used: after all, we don't have full control in this process, the user is the one who provides the new cluster with initdb. >> If such a change were implemented (i.e. no checkpoints for FILE_COPY >> in binary upgrade, with a single manual checkpoint after restoring >> template1 in create_new_objects) I think most of my concerns with this >> patch would be alleviated. > > Yeah, I think that's a valid point. The second checkpoint is to ensure > that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for > binary upgrades, we don't need that guarantee because a checkpoint > will be performed during shutdown at the end of the upgrade anyway. Indeed. Kind regards, Matthias van de Meent Neon (https://neon.tech)
XACT_EVENT for 'commit prepared'
Hi hackers, I found that in enum XactEvent, there is 'XACT_EVENT_PREPARE' for 'prepare transaction', but there is no event for 'commit prepared' or 'rollback prepared'. For the following SQL: begin; create table test(a int); PREPARE TRANSACTION 'foo'; rollback prepared 'foo'; - When executing ' rollback prepared 'foo'; ', I expected to get 'XACT_EVENT_ABORT', but actually, the event type is 'XACT_EVENT_COMMIT'. I think XACT_EVENT_COMMIT_PREPARED and XACT_EVENT_ROLLBACK_PREPARED can be added in function 'FinishPreparedTransaction' I'm confused why there are no related events for them.
Re: Compress ReorderBuffer spill files using LZ4
On Fri, Jun 7, 2024 at 2:39 PM Alvaro Herrera wrote: > > On 2024-Jun-07, Dilip Kumar wrote: > > > I think the compression option should be supported at the CREATE > > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > > can decide on compression for each subscription individually rather > > than applying it to all subscribers. It makes more sense for the > > subscriber to control this, especially when we are planning to > > compress the data sent downstream. > > True. (I think we have some options that are in GUCs for the general > behavior and can be overridden by per-subscription options for specific > tailoring; would that make sense here? I think it does, considering > that what we mostly want is to save disk space in the publisher when > spilling to disk.) Yeah, that makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent wrote: > > On Fri, 7 Jun 2024 at 10:28, Dilip Kumar wrote: > > > > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent > > wrote: > >> > >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar wrote: > >>> > >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent > >>> wrote: > >>> > >>> I agree with you that we introduced the WAL_LOG strategy to avoid > >>> these force checkpoints. However, in binary upgrade cases where no > >>> operations are happening in the system, the FILE_COPY strategy should > >>> be faster. > >> > >> While you would be correct if there were no operations happening in > >> the system, during binary upgrade we're still actively modifying > >> catalogs; and this is done with potentially many concurrent jobs. I > >> think it's not unlikely that this would impact performance. > > > > Maybe, but generally, long checkpoints are problematic because they > > involve a lot of I/O, which hampers overall system performance. > > However, in the case of a binary upgrade, the concurrent operations > > are only performing a schema restore, not a real data restore. > > Therefore, it shouldn't have a significant impact, and the checkpoints > > should also not do a lot of I/O during binary upgrade, right? > > My primary concern isn't the IO, but the O(shared_buffers) that we > have to go through during a checkpoint. As I mentioned upthread, it is > reasonably possible the new cluster is already setup with a good > fraction of the old system's shared_buffers configured. Every > checkpoint has to scan all those buffers, which IMV can get (much) > more expensive than the IO overhead caused by the WAL_LOG strategy. It > may be a baseless fear as I haven't done the performance benchmarks > for this, but I wouldn't be surprised if shared_buffers=8GB would > measurably impact the upgrade performance in the current patch (vs the > default 128MB). Okay, that's a valid point. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: speed up a logical replica setup
Dear Euler, Thanks for making the follow-up patch! I was looking forward to your updates. I think this patch set is the solution for the found buildfarm error. However, there are remained claims raised by others. You should reply what you think for them. At least: 1) There are some misleading messages [1]. I think v3-0005 patch set can solve the issue. 2) pg_createsubscriber may fail If the primary has subscriptions [2]. IIUC possible approaches are A)"keep subscriptions disabled at the end", B)"by default drop the pre-existing subscriptions", C) "do nothing, just document the risk". > Before sending this email I realized that I did nothing about physical > replication slots on the standby. I think we should also remove them too > unconditionally. I also considered around here, but it might be difficult to predict the expectation by users. Can we surely say that it's not intentional one? Regarding the failover slot, it is OK because that's meaningful only on the standby, but not sure other slots. I personally think we can keep current spec, but how do other think? Below parts are comments for each patches. 0001 Basically LGTM. I was bit confused because the default timeout is not set, but it seemed to follow the suggestion by Tomas [3]. 0002 If you want to improve the commit message, please add that sync_replication_slots is disabled during the conversion. 0003 Confirmed it followed the discussion [4]. 0004 Basically LGTM. Other minor comments are included by the attached diff file. It contains changes to follow conventions and pgindent/pgperltidy. [1]: https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com [3]: https://www.postgresql.org/message-id/5d5dd4cd-6359-4109-88e8-c8e13035ae16%40enterprisedb.com [4]: https://www.postgresql.org/message-id/CAA4eK1LZxYxcbeiOn3Q5hjXVtZKhJWj-fQtndAeTCvZrPev8BA%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/ minor_fix_by_kuroda.diff Description: minor_fix_by_kuroda.diff
Re: relfilenode statistics
Hi, On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote: > Hi, > > On 2024-06-06 12:27:49 -0400, Robert Haas wrote: > > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot > > wrote: > > > I think we should keep the stats in the relation during relfilenode > > > changes. > > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes > > > in > > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you > > > can > > > see in the example provided up-thread the new heap_blks_written statistic > > > has > > > been preserved during the TRUNCATE. > > > > Yeah, I think there's something weird about this design. Somehow we're > > ending up with both per-relation and per-relfilenode counters: > > > > + pg_stat_get_blocks_written(C.oid) + > > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN > > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END, > > C.relfilenode) AS heap_blks_written, > > > > I'll defer to Andres if he thinks that's awesome, but to me it does > > not seem right to track some blocks written in a per-relation counter > > and others in a per-relfilenode counter. > > It doesn't immediately sound awesome. Nor really necessary? > > If we just want to keep prior stats upon arelation rewrite, we can just copy > the stats from the old relfilenode. Agree, that's another option. But I think that would be in another field like "cumulative_XXX" to ensure one could still retrieve stats that are "dedicated" to this particular "new" relfilenode. Thoughts? > Or we can decide that those stats don't > really make sense anymore, and start from scratch. > > > I *guess* I could see an occasional benefit in having both counter for "prior > relfilenodes" and "current relfilenode" - except that stats get reset manually > and upon crash anyway, making this less useful than if it were really > "lifetime" stats. Right but currently they are not lost during a relation rewrite. If we decide to not keep the relfilenode stats during a rewrite then things like heap_blks_read would stop surviving a rewrite (if we move it to relfilenode stats) while it currently does. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Compress ReorderBuffer spill files using LZ4
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera wrote: > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. > Fair enough. it would be an interesting feature if we see the wider usefulness of compression/decompression of logical changes. For example, if this can improve the performance of applying large transactions (aka reduce the apply lag for them) even when the 'streaming' option is 'parallel' then it would have a much wider impact. -- With Regards, Amit Kapila.
Re: Compress ReorderBuffer spill files using LZ4
On Fri, Jun 7, 2024 at 2:08 PM Dilip Kumar wrote: > > I think the compression option should be supported at the CREATE > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > can decide on compression for each subscription individually rather > than applying it to all subscribers. It makes more sense for the > subscriber to control this, especially when we are planning to > compress the data sent downstream. > Yes, that makes sense. However, we then need to provide this option via SQL APIs as well for other plugins. -- With Regards, Amit Kapila.
Re: Reordering DISTINCT keys to match input path's pathkeys
On Mon, Feb 5, 2024 at 11:18 AM Richard Guo wrote: > cfbot reminds that this patch does not apply any more. So I've rebased > it on master, and also adjusted the test cases a bit. This patch does not apply any more, so here is a new rebase, with some tweaks to the comments. Thanks Richard v3-0001-Reordering-DISTINCT-keys-to-match-input-path-s-pathkeys.patch Description: Binary data
Re:Re: Add support to TLS 1.3 cipher suites and curves lists
Hi Peter, Thanks a lot for the quick response. We are using Postgres instance in our product. For some security consideration, we prefer to use TLS1.3 cipher suites in our product with some customization values instead of default value "HIGH:MEDIUM:+3DES:!aNULL". Moreover we prefer to set a group of ecdh keys instead of a single value. I see the https://commitfest.postgresql.org/48/ is still open, could it be possible to target for PG17? As I know PG17 is going to be release this year so that we can upgrade our instances to this new version accodingly. Original Email Sender:"Peter Eisentraut"< pe...@eisentraut.org >; Sent Time:2024/6/7 16:55 To:"Erica Zhang"< ericazhangy2...@qq.com >;"pgsql-hackers"< pgsql-hackers@lists.postgresql.org >; Subject:Re: Add support to TLS 1.3 cipher suites and curves lists On 07.06.24 08:10, Erica Zhang wrote: > I’m a Postgres user and I’m looking into restricting the set of allowed > ciphers on Postgres and configure a concrete set of curves on our > postgres instances. Out of curiosity, why is this needed in practice? > Could you please help to review to see if you are interested in having > this change in upcoming Postgres major release(It's should be PG17)? It would be targetting PG18 now.
Re: Proposal to include --exclude-extension Flag in pg_dump
On Tue, 19 Mar 2024 at 11:53, Daniel Gustafsson wrote: > > I did notice a few mistakes in the --filter > documentation portion for other keywords but that's unrelated to this patch, > will fix them once this is in to avoid conflicts. > Attached is a patch for the --filter docs, covering the omissions I can see. Regards, Dean diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml new file mode 100644 index 08d7753..b1dfa21 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -866,13 +866,14 @@ PostgreSQL documentation same rules as the corresponding options: -t/--table, --table-and-children, ---exclude-table-and-children or --T for tables, --n/--schema for schemas, +-T/--exclude-table, and +--exclude-table-and-children for tables, +-n/--schema and +-N/--exclude-schema for schemas, --include-foreign-data for data on foreign servers, ---exclude-table-data, +--exclude-table-data and --exclude-table-data-and-children for table data, and --e/--extension or +-e/--extension and --exclude-extension for extensions. To read from STDIN, use - as the filename. The --filter option can be specified in @@ -895,34 +896,37 @@ PostgreSQL documentation - extension: extensions, works like the - -e/--extension option. + extension: extensions. This works like the + -e/--extension and + --exclude-extension options. - foreign_data: data on foreign servers, works like + foreign_data: data on foreign servers. This works like the --include-foreign-data option. This keyword can only be used with the include keyword. - table: tables, works like the - -t/--table option. + table: tables. This works like the + -t/--table and + -T/--exclude-table options. table_and_children: tables including any partitions - or inheritance child tables, works like the - --table-and-children option. + or inheritance child tables. This works like the + --table-and-children and + --exclude-table-and-children options. table_data: table data of any tables matching - pattern, works like the + pattern. This works like the --exclude-table-data option. This keyword can only be used with the exclude keyword. @@ -931,15 +935,16 @@ PostgreSQL documentation table_data_and_children: table data of any tables matching pattern as well as any partitions - or inheritance children of the table(s), works like the + or inheritance children of the table(s). This works like the --exclude-table-data-and-children option. This keyword can only be used with the exclude keyword. - schema: schemas, works like the - -n/--schema option. + schema: schemas. This works like the + -n/--schema and + -N/--exclude-schema options.
Re: Logical Replication of sequences
On Fri, Jun 7, 2024 at 7:55 AM Masahiko Sawada wrote: > > On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila wrote: > > > > On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada > > wrote: > > > > > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila > > > wrote: > > > > > > > > > > > To achieve this, we can allow sequences to be copied during > > > > the initial CREATE SUBSCRIPTION command similar to what we do for > > > > tables. And then later by new/existing command, we re-copy the already > > > > existing sequences on the subscriber. > > > > > > > > The options for the new command could be: > > > > Alter Subscription ... Refresh Sequences > > > > Alter Subscription ... Replicate Sequences > > > > > > > > In the second option, we need to introduce a new keyword Replicate. > > > > Can you think of any better option? > > > > > > Another idea is doing that using options. For example, > > > > > > For initial sequences synchronization: > > > > > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true); > > > > > > > How will it interact with the existing copy_data option? So copy_data > > will become equivalent to copy_table_data, right? > > Right. > > > > > > For re-copy (or update) sequences: > > > > > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true); > > > > > > > Similar to the previous point it can be slightly confusing w.r.t > > copy_data. And would copy_sequence here mean that it would copy > > sequence values of both pre-existing and newly added sequences, if so, > > that would make it behave differently than copy_data? The other > > possibility in this direction would be to introduce an option like > > replicate_all_sequences/copy_all_sequences which indicates a copy of > > both pre-existing and new sequences, if any. > > Copying sequence data works differently than replicating table data > (initial data copy and logical replication). So I thought the > copy_sequence option (or whatever better name) always does both > updating pre-existing sequences and adding new sequences. REFRESH > PUBLICATION updates the tables to be subscribed, so we also update or > add sequences associated to these tables. > Are you imagining the behavior for sequences associated with tables differently than the ones defined by the CREATE SEQUENCE .. command? I was thinking that users would associate sequences with publications similar to what we do for tables for both cases. For example, they need to explicitly mention the sequences they want to replicate by commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR SEQUENCES IN SCHEMA sch1; In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1 should copy both the explicitly defined sequences and sequences defined with the tables. Do you think a different variant for just copying sequences implicitly associated with tables (say for identity columns)? > > > > > > > > > > > In addition to the above, the command Alter Subscription .. Refresh > > > > Publication will fetch any missing sequences similar to what it does > > > > for tables. > > > > > > On the subscriber side, do we need to track which sequences are > > > created via CREATE/ALTER SUBSCRIPTION? > > > > > > > I think so unless we find some other way to know at refresh > > publication time which all new sequences need to be part of the > > subscription. What should be the behavior w.r.t sequences when the > > user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was > > thinking similar to tables, it should fetch any missing sequence > > information from the publisher. > > It seems to make sense to me. But I have one question: do we want to > support replicating sequences that are not associated with any tables? > Yes, unless we see a problem with it. > if yes, what if we refresh two different subscriptions that subscribe > to different tables on the same database? What problem do you see with it? > On the other hand, if no > (i.e. replicating only sequences owned by tables), can we know which > sequences to replicate by checking the subscribed tables? > Sorry, I didn't understand your question. Can you please try to explain in more words or use some examples? -- With Regards, Amit Kapila.
Re: Re: Add support to TLS 1.3 cipher suites and curves lists
On Fri, Jun 07, 2024 at 06:02:37PM +0800, Erica Zhang wrote: > I see the https://commitfest.postgresql.org/48/ is still open, could > it be possible to target for PG17? As I know PG17 is going to be > release this year so that we can upgrade our instances to this new > version accodingly. Echoing with Peter, https://commitfest.postgresql.org/48/ is planned to be the first commit fest of the development cycle for Postgres 18. v17 is in feature freeze state and beta, where only bug fixes are accepted, and not new features. -- Michael signature.asc Description: PGP signature
Re: Conflict Detection and Resolution
On Thu, Jun 6, 2024 at 5:16 PM Nisha Moond wrote: > > > > Here are more use cases of the "earliest_timestamp_wins" resolution method: > 1) Applications where the record of first occurrence of an event is > important. For example, sensor based applications like earthquake > detection systems, capturing the first seismic wave's time is crucial. > 2) Scheduling systems, like appointment booking, prioritize the > earliest request when handling concurrent ones. > 3) In contexts where maintaining chronological order is important - > a) Social media platforms display comments ensuring that the > earliest ones are visible first. > b) Finance transaction processing systems rely on timestamps to > prioritize the processing of transactions, ensuring that the earliest > transaction is handled first > Thanks for sharing examples. However, these scenarios would be handled by the application and not during replication. What we are discussing here is the timestamp when a row was updated/inserted/deleted (or rather when the transaction that updated row committed/became visible) and not a DML on column which is of type timestamp. Some implementations use a hidden timestamp column but that's different from a user column which captures timestamp of (say) an event. The conflict resolution will be based on the timestamp when that column's value was recorded in the database which may be different from the value of the column itself. If we use the transaction commit timestamp as basis for resolution, a transaction where multiple rows conflict may end up with different rows affected by that transaction being resolved differently. Say three transactions T1, T2 and T3 on separate origins with timestamps t1, t2, and t3 respectively changed rows r1, r2 and r2, r3 and r1, r4 respectively. Changes to r1 and r2 will conflict. Let's say T2 and T3 are applied first and then T1 is applied. If t2 < t1 < t3, r1 will end up with version of T3 and r2 will end up with version of T1 after applying all the three transactions. Would that introduce an inconsistency between r1 and r2? -- Best Wishes, Ashutosh Bapat
Re: Things I don't like about \du's "Attributes" column
On Thu, Jun 6, 2024 at 5:10 PM Pavel Luzanov wrote: > Agree. > There is an additional technical argument for removing this replacement. > I don't like explicit cast to text of the "Connection limit" column. > Without 'Not allowed' it is no longer required. > Value -1 can be replaced by NULL with an implicit cast to integer. Yeah, +1 for that idea. > Example output: > > \du+ regress_du* > List of roles > Role name | Login | Attributes | Valid until | > Connection limit | Description > --+---+-+--+--+-- > regress_du_admin | yes | Superuser +| | > | some description > | | Create DB +| | > | > | | Create role+| | > | > | | Inherit+| | > | > | | Replication+| | > | > | | Bypass RLS | | > | > regress_du_role0 | yes | Inherit | Tue Jun 04 00:00:00 2024 PDT | > 0 | > regress_du_role1 | no| Create role+| infinity | > | > | | Inherit | | > | > regress_du_role2 | yes | Inherit+| | > 42 | > | | Replication+| | > | > | | Bypass RLS | | > | > (4 rows) This seems unobjectionable to me. I am not sure whether it is better than the current verison, or whether it is what we want. But it seems reasonable. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On 5/27/24 07:48, shveta malik wrote: > On Sat, May 25, 2024 at 2:39 AM Tomas Vondra > wrote: >> >> On 5/23/24 08:36, shveta malik wrote: >>> Hello hackers, >>> >>> Please find the proposal for Conflict Detection and Resolution (CDR) >>> for Logical replication. >>> >> below details.> >>> >>> Introduction >>> >>> In case the node is subscribed to multiple providers, or when local >>> writes happen on a subscriber, conflicts can arise for the incoming >>> changes. CDR is the mechanism to automatically detect and resolve >>> these conflicts depending on the application and configurations. >>> CDR is not applicable for the initial table sync. If locally, there >>> exists conflicting data on the table, the table sync worker will fail. >>> Please find the details on CDR in apply worker for INSERT, UPDATE and >>> DELETE operations: >>> >> >> Which architecture are you aiming for? Here you talk about multiple >> providers, but the wiki page mentions active-active. I'm not sure how >> much this matters, but it might. > > Currently, we are working for multi providers case but ideally it > should work for active-active also. During further discussion and > implementation phase, if we find that, there are cases which will not > work in straight-forward way for active-active, then our primary focus > will remain to first implement it for multiple providers architecture. > >> >> Also, what kind of consistency you expect from this? Because none of >> these simple conflict resolution methods can give you the regular >> consistency models we're used to, AFAICS. > > Can you please explain a little bit more on this. > I was referring to the well established consistency models / isolation levels, e.g. READ COMMITTED or SNAPSHOT ISOLATION. This determines what guarantees the application developer can expect, what anomalies can happen, etc. I don't think any such isolation level can be implemented with a simple conflict resolution methods like last-update-wins etc. For example, consider an active-active where both nodes do UPDATE accounts SET balance=balance+1000 WHERE id=1 This will inevitably lead to a conflict, and while the last-update-wins resolves this "consistently" on both nodes (e.g. ending with the same result), it's essentially a lost update. This is a very simplistic example of course, I recall there are various more complex examples involving foreign keys, multi-table transactions, constraints, etc. But in principle it's a manifestation of the same inherent limitation of conflict detection and resolution etc. Similarly, I believe this affects not just active-active, but also the case where one node aggregates data from multiple publishers. Maybe not to the same extent / it might be fine for that use case, but you said the end goal is to use this for active-active. So I'm wondering what's the plan, there. If I'm writing an application for active-active using this conflict handling, what assumptions can I make? Will Can I just do stuff as if on a single node, or do I need to be super conscious about the zillion ways things can misbehave in a distributed system? My personal opinion is that the closer this will be to the regular consistency levels, the better. If past experience taught me anything, it's very hard to predict how distributed systems with eventual consistency behave, and even harder to actually test the application in such environment. In any case, if there are any differences compared to the usual behavior, it needs to be very clearly explained in the docs. >> >>> INSERT >>> >>> To resolve INSERT conflict on subscriber, it is important to find out >>> the conflicting row (if any) before we attempt an insertion. The >>> indexes or search preference for the same will be: >>> First check for replica identity (RI) index. >>> - if not found, check for the primary key (PK) index. >>> - if not found, then check for unique indexes (individual ones or >>> added by unique constraints) >>> - if unique index also not found, skip CDR >>> >>> Note: if no RI index, PK, or unique index is found but >>> REPLICA_IDENTITY_FULL is defined, CDR will still be skipped. >>> The reason being that even though a row can be identified with >>> REPLICAT_IDENTITY_FULL, such tables are allowed to have duplicate >>> rows. Hence, we should not go for conflict detection in such a case. >>> >> >> It's not clear to me why would REPLICA_IDENTITY_FULL mean the table is >> allowed to have duplicate values? It just means the upstream is sending >> the whole original row, there can still be a PK/UNIQUE index on both the >> publisher and subscriber. > > Yes, right. Sorry for confusion. I meant the same i.e. in absence of > 'RI index, PK, or unique index', tables can have duplicates. So even > in presence of Replica-identity (FULL in this case) but in absence of > unique/primary index, CDR will be skipped for INSERT. > >> >>> In case of replica identity ‘nothing’ and in absence
Re: Conflict Detection and Resolution
On 5/28/24 11:17, Nisha Moond wrote: > On Mon, May 27, 2024 at 11:19 AM shveta malik wrote: >> >> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra >> wrote: >>> >>> ... >>> >>> I don't understand the why should update_missing or update_deleted be >>> different, especially considering it's not detected reliably. And also >>> that even if we happen to find the row the associated TOAST data may >>> have already been removed. So why would this matter? >> >> Here, we are trying to tackle the case where the row is 'recently' >> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may >> want to opt for a different resolution in such a case as against the >> one where the corresponding row was not even present in the first >> place. The case where the row was deleted long back may not fall into >> this category as there are higher chances that they have been removed >> by vacuum and can be considered equivalent to the update_ missing >> case. >> >> Regarding "TOAST column" for deleted row cases, we may need to dig >> more. Thanks for bringing this case. Let me analyze more here. >> > I tested a simple case with a table with one TOAST column and found > that when a tuple with a TOAST column is deleted, both the tuple and > corresponding pg_toast entries are marked as ‘deleted’ (dead) but not > removed immediately. The main tuple and respective pg_toast entry are > permanently deleted only during vacuum. First, the main table’s dead > tuples are vacuumed, followed by the secondary TOAST relation ones (if > available). > Please let us know if you have a specific scenario in mind where the > TOAST column data is deleted immediately upon ‘delete’ operation, > rather than during vacuum, which we are missing. > I'm pretty sure you can vacuum the TOAST table directly, which means you'll end up with a deleted tuple with TOAST pointers, but with the TOAST entries already gone. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: question regarding policy for patches to out-of-support branches
On Thu, Jun 6, 2024 at 10:04 PM Tom Lane wrote: > > I added them here with minimal copy editing an no attempt to organize or > > sort into groups: > > https://wiki.postgresql.org/wiki/Committing_checklist#Policies > > If someone has thoughts on how to improve I am happy to make more changes. > > Thanks! I summoned the energy to make a few more improvements, > particularly updating stuff that seemed out-of-date. I'm sure > there's more that could be added here. This is nice! I wonder if we could interest anyone in creating tooling that could be used to check some of this stuff -- ideally run as part of the regular build process, so that you fail to notice that you did it wrong. Not all of these rules are subject to automatic verification e.g. it's hard to enforce that a change to an out-of-support branch makes no functional change. But an awful lot of them could be, and I would personally be significantly happier and less stressed if I knew that 'ninja && meson test' was going to tell me that I did it wrong before I pushed, instead of finding out afterward and then having to drop everything to go clean it up. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera wrote: > > static void > > -ZeroBuffer(Buffer buffer, ReadBufferMode mode) > > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero) > > This change makes the API very strange. Should the function be called > ZeroAndLockBuffer() instead? Then the addition of a "bool zero" > argument makes a lot more sense. I agree that's better, but it still looks a bit weird. You have to realize that 'bool zero' means 'is already zeroed' here -- or at least, I guess that's the intention. But then I wonder why you'd call a function called ZeroAndLockBuffer if all you need to do is LockBuffer. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Conflict Detection and Resolution
On 6/3/24 09:30, Amit Kapila wrote: > On Sat, May 25, 2024 at 2:39 AM Tomas Vondra > wrote: >> >> On 5/23/24 08:36, shveta malik wrote: >>> >>> Conflict Resolution >>> >>> a) latest_timestamp_wins:The change with later commit timestamp wins. >>> b) earliest_timestamp_wins: The change with earlier commit timestamp wins. >>> c) apply: Always apply the remote change. >>> d) skip:Remote change is skipped. >>> e) error: Error out on conflict. Replication is stopped, manual >>> action is needed. >>> >> >> Why not to have some support for user-defined conflict resolution >> methods, allowing to do more complex stuff (e.g. merging the rows in >> some way, perhaps even with datatype-specific behavior)? >> >>> The change will be converted to 'UPDATE' and applied if the decision >>> is in favor of applying remote change. >>> >>> It is important to have commit timestamp info available on subscriber >>> when latest_timestamp_wins or earliest_timestamp_wins method is chosen >>> as resolution method. Thus ‘track_commit_timestamp’ must be enabled >>> on subscriber, in absence of which, configuring the said >>> timestamp-based resolution methods will result in error. >>> >>> Note: If the user has chosen the latest or earliest_timestamp_wins, >>> and the remote and local timestamps are the same, then it will go by >>> system identifier. The change with a higher system identifier will >>> win. This will ensure that the same change is picked on all the nodes. >> >> How is this going to deal with the fact that commit LSN and timestamps >> may not correlate perfectly? That is, commits may happen with LSN1 < >> LSN2 but with T1 > T2. >> > > One of the possible scenarios discussed at pgconf.dev with Tomas for > this was as follows: > > Say there are two publisher nodes PN1, PN2, and subscriber node SN3. > The logical replication is configured such that a subscription on SN3 > has publications from both PN1 and PN2. For example, SN3 (sub) -> PN1, > PN2 (p1, p2) > > Now, on PN1, we have the following operations that update the same row: > > T1 > Update-1 on table t1 at LSN1 (1000) on time (200) > > T2 > Update-2 on table t1 at LSN2 (2000) on time (100) > > Then in parallel, we have the following operation on node PN2 that > updates the same row as Update-1, and Update-2 on node PN1. > > T3 > Update-3 on table t1 at LSN(1500) on time (150) > > By theory, we can have a different state on subscribers depending on > the order of updates arriving at SN3 which shouldn't happen. Say, the > order in which they reach SN3 is: Update-1, Update-2, Update-3 then > the final row we have is by Update-3 considering we have configured > last_update_wins as a conflict resolution method. Now, consider the > other order: Update-1, Update-3, Update-2, in this case, the final > row will be by Update-2 because when we try to apply Update-3, it will > generate a conflict and as per the resolution method > (last_update_wins) we need to retain Update-1. > > On further thinking, the operations on node-1 PN-1 as defined above > seem impossible because one of the Updates needs to wait for the other > to write a commit record. So the commits may happen with LSN1 < LSN2 > but with T1 > T2 but they can't be on the same row due to locks. So, > the order of apply should still be consistent. Am, I missing > something? > Sorry, I should have read your message before responding a couple minutes ago. I think you're right this exact example can't happen, due to the dependency between transactions. But as I wrote, I'm not quite convinced this means there are not other issues with this way of resolving conflicts. It's more likely a more complex scenario is required. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Compress ReorderBuffer spill files using LZ4
On 6/6/24 16:24, Alvaro Herrera wrote: > On 2024-Jun-06, Amit Kapila wrote: > >> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires wrote: >>> >>> When the content of a large transaction (size exceeding >>> logical_decoding_work_mem) and its sub-transactions has to be >>> reordered during logical decoding, then, all the changes are written >>> on disk in temporary files located in pg_replslot/. >>> Decoding very large transactions by multiple replication slots can >>> lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > >> Why can't one use 'streaming' option to send changes to the client >> once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. > > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. > > (I don't mean to say that you should implement Zstd compression with > this patch, only that you should choose the implementation so that > adding Zstd support (or whatever) later is just a matter of adding some > branches here and there. With the current #ifdef you propose, it's hard > to do that. Maybe separate the parts that depend on the specific > algorithm to algorithm-agnostic functions.) > I haven't been following the "libpq compression" thread, but wouldn't that also do compression for the streaming case? That was my assumption, at least, and it seems like the right way - we probably don't want to patch every place that sends data over network independently, right? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Compress ReorderBuffer spill files using LZ4
On 6/6/24 12:58, Julien Tachoires wrote: > ... > > When compiled with LZ4 support (--with-lz4), this patch enables data > compression/decompression of these temporary files. Each transaction > change that must be written on disk (ReorderBufferDiskChange) is now > compressed and encapsulated in a new structure. > I'm a bit confused, but why tie this to having lz4? Why shouldn't this be supported even for pglz, or whatever algorithms we add in the future? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: race condition in pg_class
On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier wrote: > On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote: > > It's not this patch set's fault, but I'm not very pleased to see that > > the injection point wait events have been shoehorned into the > > "Extension" category - which they are not - instead of being a new > > wait_event_type. That would have avoided the ugly wait-event naming > > pattern, inconsistent with everything else, introduced by > > inplace050-tests-inj-v1.patch. > > Not sure to agree with that. The set of core backend APIs supporting > injection points have nothing to do with wait events. The library > attached to one or more injection points *may* decide to use a wait > event like what the wait/wakeup calls in modules/injection_points do, > but that's entirely optional. These rely on custom wait events, > plugged into the Extension category as the code run is itself in an > extension. I am not arguing against the point that it may be > interesting to plug in custom wait event categories, but the current > design of wait events makes that much harder than what core is > currently able to handle, and I am not sure that this brings much at > the end as long as the wait event strings can be customized. > > I've voiced upthread concerns over the naming enforced by the patch > and the way it plugs the namings into the isolation functions, by the > way. I think the core code should provide an "Injection Point" wait event type and let extensions add specific wait events there, just like you did for "Extension". Then this ugly naming would go away. As I see it, "Extension" is only supposed to be used as a catch-all when we have no other information, but here we do. If we refuse to use the wait_event_type field to categorize waits, then people are going to have to find some other way to get that data into the system, as Noah has done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Compress ReorderBuffer spill files using LZ4
Le ven. 7 juin 2024 à 05:59, Tomas Vondra a écrit : > > On 6/6/24 12:58, Julien Tachoires wrote: > > ... > > > > When compiled with LZ4 support (--with-lz4), this patch enables data > > compression/decompression of these temporary files. Each transaction > > change that must be written on disk (ReorderBufferDiskChange) is now > > compressed and encapsulated in a new structure. > > > > I'm a bit confused, but why tie this to having lz4? Why shouldn't this > be supported even for pglz, or whatever algorithms we add in the future? That's right, reworking this patch in that sense. Regards, JT
Re: relfilenode statistics
On Thu, Jun 6, 2024 at 11:17 PM Andres Freund wrote: > If we just want to keep prior stats upon arelation rewrite, we can just copy > the stats from the old relfilenode. Or we can decide that those stats don't > really make sense anymore, and start from scratch. I think we need to think carefully about what we want the user experience to be here. "Per-relfilenode stats" could mean "sometimes I don't know the relation OID so I want to use the relfilenumber instead, without changing the user experience" or it could mean "some of these stats actually properly pertain to the relfilenode rather than the relation so I want to associate them with the right object and that will affect how the user sees things." We need to decide which it is. If it's the former, then we need to examine whether the goal of hiding the distinction between relfilenode stats and relation stats from the user is in fact feasible. If it's the latter, then we need to make sure the whole patch reflects that design, which would include e.g. NOT copying stats from the old to the new relfilenode, and which would also include documenting the behavior in a way that will be understandable to users. In my experience, the worst thing you can do in cases like this is be somewhere in the middle. Then you tend to end up with stuff like: the difference isn't supposed to be something that the user knows or cares about, except that they do have to know and care because you haven't thoroughly covered up the deception, and often they have to reverse engineer the behavior because you didn't document what was really happening because you imagined that they wouldn't notice. -- Robert Haas EDB: http://www.enterprisedb.com
Re: ssl tests fail due to TCP port conflict
On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote: On Wed, 5 Jun 2024 at 23:37, Tom Lane wrote: Andrew Dunstan writes: On 2024-06-05 We 16:00, Alexander Lakhin wrote: That is, psql from the test instance 001_ssltests_34 opened a connection to the test server with the client port 50072 and it made using the port by the server from the test instance 001_ssltests_30 impossible. Oh. (kicks self) D'oh. Should we really be allocating ephemeral server ports in the range 41952..65535? Maybe we should be looking for an unallocated port somewhere below 41952, and above, say, 32767, so we couldn't have a client socket collision. Hmm, are there really any standards about how these port numbers are used? I wonder if we don't need to just be prepared to retry the whole thing a few times. Even if it's true that "clients" shouldn't choose ports below 41952, we still have a small chance of failure against a non-Postgres server starting up at the wrong time. My suggestion would be to not touch the ephemeral port range at all for these ports. In practice the ephemeral port range is used for cases where the operating system assigns the port, and the application doesn't care whot it is. Not for when you want to get a free port, but want to know in advance which one it is. For the PgBouncer test suite we do something similar as the PG its perl tests do, but there we allocate a port between 10200 and 32768: https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215 Sure theoretically it's possible to hit a rare case where another server starts up at the wrong time, but that chance seems way lower than a client starting up at the wrong time. Especially since there aren't many servers that use a port with 5 digits. Attached is a patch that updates the port numbers. Makes sense to me. I still think my patch to force TCP mode for the SSL test makes sense as well. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: 回复: An implementation of multi-key sort
To be accurate, "multi-key sort" includes both "multi-key quick sort" and "multi-key heap sort". This patch includes code change related to only "multi-key quick sort" which is used to replace standard quick sort for tuplesort. The "multi-key heap sort" is about an implementation of multi-key heap and should be treated as a separated task. We need to clarify the naming to avoid confusion. I updated code which is related to only function/var renaming and relevant comments, plus some minor assertions changes. Please see the attachment. Thanks, Yao Wang On Fri, May 31, 2024 at 8:09 PM Yao Wang wrote: > > I added two optimizations to mksort which exist on qsort_tuple(): > > 1. When selecting pivot, always pick the item in the middle of array but > not by random. Theoretically it has the same effect to old approach, but > it can eliminate some unstable perf test results, plus a bit perf benefit by > removing random value generator. > 2. Always check whether the array is ordered already, and return > immediately if it is. The pre-ordered check requires extra cost and > impacts perf numbers on some data sets, but can improve perf > significantly on other data sets. > > By now, mksort has perf results equal or better than qsort on all data > sets I ever used. > > I also updated test case. Please see v3 code as attachment. > > Perf test results: > > Data set 1 (with mass duplicate values): > - > > create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100)); > insert into t1 values (generate_series(1,49), 0, 0, 0, 0, > 'aaabbb'); > update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3; > update t1 set c6 = 'aaabbb' > || (c1 % 5)::text; > > Query 1: > > explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1; > > Disable Mksort > > 3021.636 ms > 3014.669 ms > 3033.588 ms > > Enable Mksort > > 1688.590 ms > 1686.956 ms > 1688.567 ms > > The improvement is 78.9%, which is reduced from the previous version > (129%). The most cost should be the pre-ordered check. > > Query 2: > > create index idx_t1_mk on t1 (c6, c5, c4, c3, c2, c1); > > Disable Mksort > > 1674.648 ms > 1680.608 ms > 1681.373 ms > > Enable Mksort > > 1143.341 ms > 1143.462 ms > 1143.894 ms > > The improvement is ~47%, which is also reduced a bit (52%). > > Data set 2 (with distinct values): > -- > > create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100)); > insert into t2 values (generate_series(1,49), 0, 0, 0, 0, ''); > update t2 set c2 = 90 - c1, c3 = 91 - c1, c4 = 92 - c1, c5 > = 93 - c1; > update t2 set c6 = 'aaabbb' > || (94 - c1)::text; > > Query 1: > > explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1; > > Disable Mksort > > 12199.963 ms > 12197.068 ms > 12191.657 ms > > Enable Mksort > > 9538.219 ms > 9571.681 ms > 9536.335 ms > > The improvement is 27.9%, which is much better than the old approach (-6.2%). > > Query 2 (the data is pre-ordered): > > explain analyze select c1 from t2 order by c6 desc, c5, c4, c3, c2, c1; > > Enable Mksort > > 768.191 ms > 768.079 ms > 767.026 ms > > Disable Mksort > > 768.757 ms > 766.166 ms > 766.149 ms > > They are almost the same since no actual sort was performed, and much > better than the old approach (-1198.1%). > > > Thanks, > > Yao Wang > > On Fri, May 24, 2024 at 8:50 PM Yao Wang wrote: > > > > When all leading keys are different, mksort will finish the entire sort at > > the > > first sort key and never touch other keys. For the case, mksort falls back > > to > > kind of qsort actually. > > > > I created another data set with distinct values in all sort keys: > > > > create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100)); > > insert into t2 values (generate_series(1,49), 0, 0, 0, 0, ''); > > update t2 set c2 = 90 - c1, c3 = 91 - c1, c4 = 92 - c1, c5 > > = 93 - c1; > > update t2 set c6 = 'aaabbb' > > || (94 - c1)::text; > > explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1; > > > > Results: > > > > MKsort: > > 12374.427 ms > > 12528.068 ms > > 12554.718 ms > > > > qsort: > > 12251.422 ms > > 12279.938 ms > > 12280.254 ms > > > > MKsort is a bit slower than qsort, which can be explained by extra > > checks of MKsort. > > > > Yao Wang > > > > On Fri, May 24, 2024 at 8:36 PM Wang Yao wrote: > > > > > > > > > > > > 获取Outlook for Android > > > > > > From: Heikki Linnakangas > > > Sent: Thursday, May 23, 2024 8:47:29 PM > > > To: Wang Yao ; PostgreSQL Hackers > > > > > > Cc: inte...@outlook.com > > > Subject: Re: 回复: An implementation of multi-key sort > > > > > > On 23/05/2024 15:39, Wang Yao wrote: > > > > No obvious perf regression is expected
Re: Patch bug: Fix jsonpath .* on Arrays
On Jun 4, 2024, at 20:45, David E. Wheeler wrote: > Oh FFS, unwrapping still breaks my brain. You’re right, of course. Here’s a > new patch that demonstrates that behavior, since that code path is not > currently represented in tests AFAICT (I would have expected to have broken > it with this patch). Rebased and moved the new tests to the end of the file. D v2-0001-Add-tests-for-jsonpath-.-on-arrays.patch Description: Binary data
Re: Patch bug: Fix jsonpath .* on Arrays
On Jun 7, 2024, at 10:23, David E. Wheeler wrote: > Rebased and moved the new tests to the end of the file. Bah, sorry, that was the previous patch. Here’s v3. D v3-0001-Add-tests-for-jsonpath-.-on-arrays.patch Description: Binary data
Re: ssl tests fail due to TCP port conflict
Andrew Dunstan writes: > On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote: >> For the PgBouncer test suite we do something similar as the PG its >> perl tests do, but there we allocate a port between 10200 and 32768: >> https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215 > Makes sense to me. > I still think my patch to force TCP mode for the SSL test makes sense as > well. +1 to both things. If that doesn't get the failure rate down to an acceptable level, we can look at the retry idea. regards, tom lane
Re: Postgresql OOM
> > > >> The planner should recognize this situation and avoid use of hash >> join in such cases, but maybe the statistics aren't reflecting the >> problem, or maybe there's something wrong with the logic specific >> to parallel hash join. You've not really provided enough information >> to diagnose why the poor choice of plan. >> >> regards, tom lane >> > > Thanks for looking into this. I'm not sure what information would be > needed to look at the choice of plan. > The statistics for the join conditions in the query would be: > join_condition | min_count | max_count | avg_count > +---+---+ > snd_tro| 0 | 0 | 0. > rpl_rec_tro| 0 | 2 | 0.99869222814474470477 > rec_tro| 0 | 2 | 0.99869222814474470477 > rpl_snd_tro| 0 | 0 | 0. > r | 0 | 1 | 0.49850916663490161653 > > > The relevant columns for the tables are: > postgres=# \d inputrequest >Table "public.inputrequest" > Column |Type | Collation | > Nullable | Default > > --+-+---+--+- > input_sequence | bigint | | not > null | > msg_type | character varying(8)| | not > null | > msg_content | text| | not > null | > msg_reference| character varying(35) | | > | > originalrequest_id | bigint | | > | > receive_time | timestamp without time zone | | not > null | > related_output_sequence | bigint | | > | > msg_status | character varying(15) | | > | > > Indexes: > "inputrequest_pkey" PRIMARY KEY, btree (input_sequence) > "inputrequest_originalrequest_id_idx" btree (originalrequest_id) > > postgres=# \d outputrequest > Table "public.outputrequest" > Column |Type | Collation | > Nullable | Default > > +-+---+--+- > output_sequence| bigint | | not > null | > input_sequence | bigint | | >| > msg_type | character varying(8)| | >| > msg_content| text| | not > null | > msg_reference | character varying(35) | | >| > reply_input_sequence | bigint | | >| > status | integer | | not > null | > related_input_sequence | bigint | | >| > Indexes: > "outputrequest_pkey" PRIMARY KEY, btree (output_sequence) > "outputrequest_input_sequence_idx" btree (input_sequence) > "outputrequest_reply_input_sequence_idx" btree (reply_input_sequence) > > I wonder if our choice of primary keys (input_sequence and output_sequence) has something to do with the skew in the hash bucket distribution. We use the following format: mmddxx , where is more or less a sequence and xx is the node generating the id, i.e. 01,02,etc (with only one or two values in the dataset). I wonder if it would be difficult to have an upper limit on the private memory that can be allocated by one process (or all processes similar to Oracle's pga_aggregate_limit). I would rather have one query failing with an error message instead of postgres eating up all memory and swap on the server. Best regards, Radu
Re: XACT_EVENT for 'commit prepared'
Xiaoran Wang writes: > I found that in enum XactEvent, there is 'XACT_EVENT_PREPARE' for > 'prepare transaction', but there is no event for 'commit prepared' or > 'rollback prepared'. On the whole, it seems like a good idea to me that those commands don't invoke event triggers. It is a core principle of 2PC that if 'prepare' succeeded, 'commit prepared' must not fail. Invoking a trigger during the second step would add failure cases and I'm not sure what value it has. regards, tom lane
Re: PgStat_KindInfo.named_on_disk not required in shared stats
Hi, On 2024-06-07 14:07:33 +0900, Michael Paquier wrote: > While hacking on the area of pgstat_*.c, I have noticed the existence > of named_on_disk in PgStat_KindInfo, that is here to track the fact > that replication slots are a particular case in the PgStat_HashKey for > the dshash table of the stats because this kind of stats requires a > mapping between the replication slot name and the hash key. > > As far as I can see, this field is not required and is used nowhere, > because the code relies on the existence of the to_serialized_name and > from_serialized_name callbacks to do the mapping. > > Wouldn't it make sense to remove it? This field is defined since > 5891c7a8ed8f that introduced the shmem stats, and has never been used > since. Yes, makes sense. Looks we changed direction during development a bunch of times...q > This frees an extra bit in PgStat_KindInfo, which is going to help me > a bit with what I'm doing with this area of the code while keeping the > structure size the same. Note it's just a single bit, not a full byte. So unless you need precisely 30 bits, rather than 29, I don't really see why it'd help? And i don't see a reason to strictly keep the structure size the same. Greetings, Andres Freund
RE: AIX support
Hi Team, We are pursuing to trim the changes wrt AIX. As of now we trimmed the changes with respect to XLC and currently with trimmed changes the buildfarm script passed (build and all the regression tests) The XLC changes were trimmed only in the below file modified: configure modified: configure.ac We are looking further into the other file changes as well. Warm regards, Sriram.
Re: XACT_EVENT for 'commit prepared'
Hi, On 2024-06-07 11:19:40 -0400, Tom Lane wrote: > Xiaoran Wang writes: > > I found that in enum XactEvent, there is 'XACT_EVENT_PREPARE' for > > 'prepare transaction', but there is no event for 'commit prepared' or > > 'rollback prepared'. > > On the whole, it seems like a good idea to me that those commands > don't invoke event triggers. It is a core principle of 2PC that > if 'prepare' succeeded, 'commit prepared' must not fail. Invoking a > trigger during the second step would add failure cases and I'm not > sure what value it has. Event triggers? Isn't this about RegisterXactCallback? XACT_EVENT_COMMIT is called after the commit record has been flushed and the procarray has been modified. Thus a failure in the hook has somewhat limited consequences. I'd assume XACT_EVENT_COMMIT_PREPARED would do something similar. I suspect the reason we don't callback for 2pc commit/rollback prepared is simpl: The code for doing a 2pc commit prepared lives in twophase.c, not xact.c... Greetings, Andres Freund
Re: Postgresql OOM
Hi, On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote: > I have a query that forces an out of memory error, where the OS will kill > the postgresql process. FWIW, it can be useful to configure the OS with strict memory overcommit. That causes postgres to fail more gracefully, because the OOM killer won't be invoked. > The query plan (run immediately after a vacuum analyze) is at > https://explain.depesz.com/s/ITQI#html . Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the number of workers? It'd be useful to get some of the information about the actual numbers of tuples etc. Greetings, Andres Freund
Re: Postgresql OOM
Hi, On 2024-06-06 13:58:24 +0100, Pantelis Theodosiou wrote: > I am not qualified to answer on the OOM issue but why are you joining the > same table (outputrequest) 4 times (using an identical join condition)? The conditions aren't actually the same rpl_rec_tro. input_sequence = r.input_sequence rpl_snd_tro.reply_input_sequence = r.input_sequence snd_tro.reply_input_sequence = t.input_sequence First two are r.input_sequence to different columns, the third one also uses reply_input_sequence but joins to t, not r. Greetings, Andres Freund
Re: Re: Add support to TLS 1.3 cipher suites and curves lists
On Fri, Jun 7, 2024 at 3:02 AM Erica Zhang wrote: > > For some security consideration, we prefer to use TLS1.3 cipher suites in our > product with some customization values instead of default value > "HIGH:MEDIUM:+3DES:!aNULL". Moreover we prefer to set a group of ecdh keys > instead of a single value. +1 for the curve list feature, at least. No opinions on the 1.3 ciphersuites half, yet. I've added this patch to my planned review for the v18 cycle. Some initial notes: - Could you separate the two features into two patches? That would make it easier for reviewers. (They can still share the same thread and CF entry.) - The "curve" APIs have been renamed "group" in newer OpenSSLs for a while now, and we should probably use those if possible. - I think parsing apart the groups list to check NIDs manually could lead to false negatives. From a docs skim, 3.0 allows providers to add their own group names, and 3.3 now supports question marks in the string to allow graceful fallbacks. - I originally thought it'd be better to just stop calling SSL_set_tmp_ecdh() entirely by default, so we could use OpenSSL's builtin list of groups. But that may have denial-of-service concerns [1]? - We should maybe look into SSL_CTX_config(), if we haven't discussed that already on the list, but that's probably a bigger tangent and doesn't need to be part of this patch. Thanks, --Jacob [1] https://www.openssl.org/blog/blog/2022/10/21/tls-groups-configuration/index.html
WIP: parallel GiST index builds
Hi, After looking into parallel builds for BRIN and GIN indexes, I was wondering if there's a way to do parallel builds for GiST too. I knew next to nothing about how GiST works, but I gave it a shot and here's what I have - the attached patch allows parallel GiST builds for the "unsorted" case (i.e. when the opclass does not include sortsupport), and does not support buffered builds. unsorted builds only Addressing only the unsorted case may seem a bit weird, but I did it this way for two reasons - parallel sort is a solved problem, and adding this to the patch seems quite straightforward. It's what btree does, for example. But I also was not very sure how common this is - we do have sort for points, but I have no idea if the PostGIS indexes define sorting etc. My guess was "no" but I've been told that's no longer true, so I guess sorted builds are more widely applicable than I thought. In any case, I'm not in a rush to parallelize sorted builds. It can be added later, as an improvement, IMHO. In fact, it's a well isolated part of the patch, which might make it a good choice for someone looking for an idea for their first patch ... buffered builds --- The lack of support for buffered builds is a very different thing. The basic idea is that we don't push the index entries all the way to the leaf pages right away, but accumulate them in buffers half-way through. This combines writes and reduces random I/O, which is nice. Unfortunately, the way it's implemented does not work with parallel builds at all - all the state is in private memory, and it assumes the worker is the only possible backend that can split the page (at which point the buffers need to be split too, etc.). But for parallel builds this is obviously not true. I'm not saying parallel builds can't do similar buffering, but it requires moving the buffers into shared memory, and introducing locking to coordinate accesses to the buffers. (Or perhaps it might be enough to only "notify" the workers about page splits, with buffers still in private memory?). Anyway, it seems far too complicated for v1. In fact, I'm not sure the buffering is entirely necessary - maybe the increase in amount of RAM makes this less of an issue? If the index can fit into shared buffers (or at least page cache), maybe the amount of extra I/O is not that bad? I'm sure there may be cases really affected by this, but maybe it's OK to tell people to disable parallel builds in those cases? gistGetFakeLSN -- One more thing - GiST disables WAL-logging during the build, and only logs it once at the end. For serial builds this is fine, because there are no concurrent splits, and so we don't need to rely on page LSNs to detect these cases (in fact, is uses a bogus value). But for parallel builds this would not work - we need page LSNs that actually change, otherwise we'd miss page splits, and the index build would either fail or produce a broken index. But the existing is_build flag affects both things, so I had to introduce a new "is_parallel" flag which only affects the page LSN part, using the gistGetFakeLSN() function, previously used only for unlogged indexes. This means we'll produce WAL during the index build (because gistGetFakeLSN() writes a trivial message into WAL). Compared to the serial builds this produces maybe 25-75% more WAL, but it's an order of magnitude less than with "full" WAL logging (is_build=false). For example, serial build of 5GB index needs ~5GB of WAL. A parallel build may need ~7GB, while a parallel build with "full" logging would use 50GB. I think this is a reasonable trade off. There's one "strange" thing, though - the amount of WAL decreases with the number of parallel workers. Consider for example an index on a numeric field, where the index is ~9GB, but the amount of WAL changes like this (0 workers means serial builds): parallel workers 0 1 3 5 7 WAL (GB)5.79.27.67.06.8 The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN determines if it needs to actually assign a new LSN (and write stuff to WAL) by comparing the last LSN assigned (in a given worker) to the current insert LSN. But the current insert LSN might have been updated by some other worker, in which case we simply use that. Which means that multiple workers may use the same fake LSN, and the likelihood increases with the number of workers - and this is consistent with the observed behavior of the WAL decreasing as the number of workers increases (because more workers use the same LSN). I'm not entirely sure if this is OK or a problem. I was worried two workers might end up using the same LSN for the same page, leading to other workers not noticing the split. But after a week of pretty intensive stress testing, I haven't seen a single such failure ... If this turns out to be a problem, the fix is IMHO quite simple - it should be enough to force gistGetFak
Re: Optimizing COPY with SIMD
On Wed, Jun 5, 2024 at 3:05 PM Nathan Bossart wrote: > For pg_lfind32(), we ended up using an overlapping approach for the > vectorized case (see commit 7644a73). That appeared to help more than it > harmed in the many (admittedly branch predictor friendly) tests I ran. I > wonder if you could do something similar here. > I didn't entirely follow what you are suggesting here -- seems like we would need to do strlen() for the non-SIMD case if we tried to use a similar approach. It'd be interesting to see the threshold where your patch starts winning. > IIUC the vector stuff won't take effect until there are 16 bytes to > process. If we don't expect attributes to ordinarily be >= 16 bytes, it > might be worth trying to mitigate this ~3% regression. Maybe we can find > some other small gains elsewhere to offset it. > For the particular short-strings benchmark I have been using (3 columns with 8-character ASCII strings in each), I suspect the regression is caused by the need to do a strlen(), rather than the vectorized loop itself (we skip the vectorized loop anyway because sizeof(Vector8) == 16 on this machine). (This explains why we see a regression on short strings for text but not CSV: CSV needed to do a strlen() for the non-quoted-string case regardless). Unfortunately this makes it tricky to make the optimization conditional on the length of the string. I suppose we could play some games where we start with a byte-by-byte loop and then switch over to the vectorized path (and take a strlen()) if we have seen more than, say, sizeof(Vector8) bytes so far. Seems a bit kludgy though. I will do some more benchmarking and report back. For the time being, I'm not inclined to push to get the CopyAttributeOutTextVector() into the tree in its current state, as I agree that the short-attribute case is quite important. In the meantime, attached is a revised patch series. This uses SIMD to optimize CopyReadLineText in COPY FROM. Performance results: master @ 8fea1bd5411b: Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-long-strings.sql Time (mean ± σ): 1.944 s ± 0.013 s[User: 0.001 s, System: 0.000 s] Range (min … max):1.927 s … 1.975 s10 runs Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-short-strings.sql Time (mean ± σ): 1.021 s ± 0.017 s[User: 0.002 s, System: 0.001 s] Range (min … max):1.005 s … 1.053 s10 runs master + SIMD patches: Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-long-strings.sql Time (mean ± σ): 1.513 s ± 0.022 s[User: 0.001 s, System: 0.000 s] Range (min … max):1.493 s … 1.552 s10 runs Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-short-strings.sql Time (mean ± σ): 1.032 s ± 0.032 s[User: 0.002 s, System: 0.001 s] Range (min … max):1.009 s … 1.113 s10 runs Neil v4-0005-Optimize-COPY-TO-in-text-format-using-SIMD.patch Description: Binary data v4-0003-Cosmetic-code-cleanup-for-CopyReadLineText.patch Description: Binary data v4-0004-Optimize-COPY-TO-in-CSV-format-using-SIMD.patch Description: Binary data v4-0002-Improve-COPY-test-coverage-for-handling-of-contro.patch Description: Binary data v4-0001-Adjust-misleading-comment-placement.patch Description: Binary data v4-0006-Optimize-COPY-FROM-using-SIMD.patch Description: Binary data
Re: Postgresql OOM
On Fri, Jun 7, 2024 at 7:59 PM Andres Freund wrote: > Hi, > > On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote: > > I have a query that forces an out of memory error, where the OS will kill > > the postgresql process. > > FWIW, it can be useful to configure the OS with strict memory overcommit. > That > causes postgres to fail more gracefully, because the OOM killer won't be > invoked. > > > > The query plan (run immediately after a vacuum analyze) is at > > https://explain.depesz.com/s/ITQI#html . > > Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the > number of > workers? It'd be useful to get some of the information about the actual > numbers of tuples etc. > > > Hi, I've tried first giving more memory to the OS and mounting a tmpfs in pgsql_tmp. It didn't work, I got ERROR: invalid DSA memory alloc request size 1140850688 CONTEXT: parallel worker I've seen around 2 million temporary files created before the crash. With work_mem 100MB I was not able to get it to work with 2 parallel workers. Next, I've increased work_mem to 200MB and now (with extra memory and tmpfs) it finished: https://explain.depesz.com/s/NnRC Radu
Re: Sort functions with specialized comparators
Hello all. I am interested in the proposed patch and would like to propose some additional changes that would complement it. My changes would introduce similar optimizations when working with a list of integers or object identifiers. Additionally, my patch includes an extension for benchmarking, which shows an average speedup of 30-40%. postgres=# SELECT bench_oid_sort(100); bench_oid_sort Time taken by list_sort: 116990848 ns, Time taken by list_oid_sort: 80446640 ns, Percentage difference: 31.24% (1 row) postgres=# SELECT bench_int_sort(100); bench_int_sort Time taken by list_sort: 118168506 ns, Time taken by list_int_sort: 80523373 ns, Percentage difference: 31.86% (1 row) What do you think about these changes? Best regards, Stepan Neretin. On Fri, Jun 7, 2024 at 11:08 PM Andrey M. Borodin wrote: > Hi! > > In a thread about sorting comparators[0] Andres noted that we have > infrastructure to help compiler optimize sorting. PFA attached PoC > implementation. I've checked that it indeed works on the benchmark from > that thread. > > postgres=# CREATE TABLE arrays_to_sort AS >SELECT array_shuffle(a) arr >FROM >(SELECT ARRAY(SELECT generate_series(1, 100)) a), >generate_series(1, 10); > > postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- original > Time: 990.199 ms > postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- patched > Time: 696.156 ms > > The benefit seems to be on the order of magnitude with 30% speedup. > > There's plenty of sorting by TransactionId, BlockNumber, OffsetNumber, Oid > etc. But this sorting routines never show up in perf top or something like > that. > > Seems like in most cases we do not spend much time in sorting. But > specialization does not cost us much too, only some CPU cycles of a > compiler. I think we can further improve speedup by converting inline > comparator to value extractor: more compilers will see what is actually > going on. But I have no proofs for this reasoning. > > What do you think? > > > Best regards, Andrey Borodin. > > [0] > https://www.postgresql.org/message-id/flat/20240209184014.sobshkcsfjix6u4r%40awork3.anarazel.de#fc23df2cf314bef35095b632380b4a59 > From 74bad4bbcff9ea4a9a68f91618c84854dab24701 Mon Sep 17 00:00:00 2001 From: Stepan Neretin Date: Sat, 8 Jun 2024 01:29:42 +0700 Subject: [PATCH v42 6/6] Implemented benchmarking for optimized sorting This commit adds benchmarking functions to compare the performance of two list sorting operations: bench_int_sort and bench_oid_sort. These functions measure the execution time of sorting lists of integers and OIDs, respectively, using different algorithms (list_sort and custom sorting functions). Random lists of specified sizes are generated, sorted using both methods, and their execution times are recorded. The percentage difference in execution time between the two methods is also calculated. This commit aims to provide insights into the efficiency of the sorting algorithms used. --- contrib/Makefile | 1 + contrib/bench_sort_improvements/Makefile | 20 contrib/bench_sort_improvements/bench.c | 105 ++ .../bench_sort_improvements--1.0.sql | 3 + .../bench_sort_improvements.control | 5 + 5 files changed, 134 insertions(+) create mode 100644 contrib/bench_sort_improvements/Makefile create mode 100644 contrib/bench_sort_improvements/bench.c create mode 100644 contrib/bench_sort_improvements/bench_sort_improvements--1.0.sql create mode 100644 contrib/bench_sort_improvements/bench_sort_improvements.control diff --git a/contrib/Makefile b/contrib/Makefile index abd780f277..a1ee9defc2 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -10,6 +10,7 @@ SUBDIRS = \ auto_explain \ basic_archive \ basebackup_to_shell \ + bench_sort_improvements \ bloom \ btree_gin \ btree_gist \ diff --git a/contrib/bench_sort_improvements/Makefile b/contrib/bench_sort_improvements/Makefile new file mode 100644 index 00..46458ee76c --- /dev/null +++ b/contrib/bench_sort_improvements/Makefile @@ -0,0 +1,20 @@ +MODULE_big = bench_sort_improvements + +OBJS = \ + $(WIN32RES) \ + bench.o + +EXTENSION = bench_sort_improvements + +DATA = bench_sort_improvements--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/bench_sort_improvements +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/bench_sort_improvements/bench.c b/contrib/bench_sort_improvements/bench.c new file mode
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, Jun 6, 2024 at 3:27 PM Jelte Fennema-Nio wrote: > Of course there's always the possibility to review more. But I don't > really agree with this summary of my review activity. Nonetheless, I need to take a break from this to work on some of my own stuff. I'll circle back around to it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: altering a column's collation leaves an invalid foreign key
jian he writes: >> * in TryReuseForeignKey, we can pass the information that our primary >> key old collation is nondeterministic >> and old collation != new collation to the foreign key constraint. I have a basic question about this: why are we allowing FKs to be based on nondeterministic collations at all? ISTM that that breaks the assumption that there is exactly one referenced row for any referencing row. regards, tom lane
Re: PgStat_KindInfo.named_on_disk not required in shared stats
On Fri, Jun 07, 2024 at 08:30:06AM -0700, Andres Freund wrote: > Yes, makes sense. Looks we changed direction during development a bunch of > times...q Thanks for looking, Andres! I guess I'll just apply that once v18 opens up. -- Michael signature.asc Description: PGP signature
Re: Assert in heapgettup_pagemode() fails due to underlying buffer change
On Sat, Jun 8, 2024 at 12:47 AM Robert Haas wrote: > > On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera wrote: > > > static void > > > -ZeroBuffer(Buffer buffer, ReadBufferMode mode) > > > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero) > > > > This change makes the API very strange. Should the function be called > > ZeroAndLockBuffer() instead? Then the addition of a "bool zero" > > argument makes a lot more sense. > > I agree that's better, but it still looks a bit weird. You have to > realize that 'bool zero' means 'is already zeroed' here -- or at > least, I guess that's the intention. But then I wonder why you'd call > a function called ZeroAndLockBuffer if all you need to do is > LockBuffer. The name weirdness comes directly from RBM_ZERO_AND_LOCK (the fact that it doesn't always zero despite shouting ZERO is probably what temporarily confused me). But coming up with a better name is hard and I certainly don't propose to change it now. I think it's reasonable for this internal helper function to have that matching name as Alvaro suggested, with a good comment about that. Even though that quick-demonstration change fixed the two reported repros, I think it is still probably racy (or if it isn't, it relies on higher level interlocking that I don't want to rely on). This case really should be using the standard StartBufferIO/TerminateBufferIO infrastructure as it was before. I had moved that around to deal with multi-block I/O, but dropped the ball on the zero case... sorry. Here's a version like that. The "zero" argument (yeah that was not a good name) is now inverted and called "already_valid", but it's only a sort of optimisation for the case where we already know for sure that it's valid. If it isn't, we do the standard BM_IO_IN_PROGRESS/BM_VALID/CV dance, for correct interaction with any concurrent read or zero operation. v2-0001-Fix-RBM_ZERO_AND_LOCK.patch Description: Binary data
Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed
Hello Bertrand and Michael, 23.01.2024 11:07, Bertrand Drouvot wrote: On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote: Anyway, that's not the end of it. What should we do for snapshot snapshot records coming from the bgwriter? What about? 3) depending on how stabilized this test (and others that suffer from "random" xl_running_xacts) is, then think about the bgwriter. A recent buildfarm failure [1] reminds me of that remaining question. Here we have a slow machine (a successful run, for example [2], shows 541.13s duration of the test) and the following information logged: [13:55:13.725](34.411s) ok 25 - inactiveslot slot invalidation is logged with vacuum on pg_class [13:55:13.727](0.002s) not ok 26 - activeslot slot invalidation is logged with vacuum on pg_class [13:55:13.728](0.001s) # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' # at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229. [14:27:42.995](1949.267s) # poll_query_until timed out executing this query: # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' # expecting this output: # t # last actual query output: # f # with stderr: [14:27:42.999](0.004s) not ok 27 - confl_active_logicalslot updated [14:27:43.000](0.001s) # Failed test 'confl_active_logicalslot updated' # at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. Timed out waiting confl_active_logicalslot to be updated at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. --- 035_standby_logical_decoding_standby.log: 2024-06-06 13:55:07.715 UTC [9172:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot" 2024-06-06 13:55:07.715 UTC [9172:8] DETAIL: The slot conflicted with xid horizon 754. 2024-06-06 13:55:07.715 UTC [9172:9] CONTEXT: WAL redo at 0/4020A80 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 754, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 1, nunused: 0, dead: [48]; blkref #0: rel 1663/16384/2610, blk 0 2024-06-06 13:55:14.372 UTC [7532:1] [unknown] LOG: connection received: host=127.0.0.1 port=55328 2024-06-06 13:55:14.381 UTC [7532:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_hba.conf:2) 2024-06-06 13:55:14.381 UTC [7532:3] [unknown] LOG: connection authorized: user=pgrunner database=postgres application_name=035_standby_logical_decoding.pl 2024-06-06 13:55:14.443 UTC [7532:4] 035_standby_logical_decoding.pl LOG: statement: select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' 2024-06-06 13:55:14.452 UTC [7532:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.090 user=pgrunner database=postgres host=127.0.0.1 port=55328 # (there is no `invalidating obsolete replication slot "row_removal_activeslot"` message) ... 2024-06-06 14:27:42.675 UTC [4032:4] 035_standby_logical_decoding.pl LOG: statement: select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' 2024-06-06 14:27:42.681 UTC [4032:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.080 user=pgrunner database=postgres host=127.0.0.1 port=58713 2024-06-06 14:27:43.095 UTC [7892:2] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. It's hard to determine from this info, why row_removal_activeslot was not invalidated, but running this test on a slowed down Windows VM, I (still) get the same looking failures caused by RUNNING_XACTS appeared just before `invalidating obsolete replication slot "row_removal_inactiveslot"`. So I would consider this failure as yet another result of bgwriter activity and add it to the list of known failures as such... [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-06-06%2012%3A36%3A11 [2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2024-06-05%2017%3A03%3A13&stg=misc-check Best regards, Alexander
Re: altering a column's collation leaves an invalid foreign key
On Sat, Jun 8, 2024 at 4:12 AM Tom Lane wrote: > > jian he writes: > >> * in TryReuseForeignKey, we can pass the information that our primary > >> key old collation is nondeterministic > >> and old collation != new collation to the foreign key constraint. > > I have a basic question about this: why are we allowing FKs to be > based on nondeterministic collations at all? ISTM that that breaks > the assumption that there is exactly one referenced row for any > referencing row. > for FKs nondeterministic, I think that would require the PRIMARY KEY collation to not be indeterministic also. for example: CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1'); DROP TABLE IF EXISTS fktable, pktable; CREATE TABLE pktable (x text COLLATE ignore_accent_case PRIMARY KEY); CREATE TABLE fktable (x text REFERENCES pktable on update cascade on delete cascade); INSERT INTO pktable VALUES ('A'); INSERT INTO fktable VALUES ('a'); INSERT INTO fktable VALUES ('A'); update pktable set x = 'Å'; table fktable; if FK is nondeterministic, then it looks PK more like FK. the following example, one FK row is referenced by two PK rows. DROP TABLE IF EXISTS fktable, pktable; CREATE TABLE pktable (x text COLLATE "C" PRIMARY KEY); CREATE TABLE fktable (x text COLLATE ignore_accent_case REFERENCES pktable on update cascade on delete cascade); INSERT INTO pktable VALUES ('A'), ('Å'); INSERT INTO fktable VALUES ('A'); begin; delete from pktable where x = 'Å'; TABLE fktable; rollback; begin; delete from pktable where x = 'A'; TABLE fktable; rollback;
New function normal_rand_array function to contrib/tablefunc.
Here is a new function which could produce an array of numbers with a controllable array length and duplicated elements in these arrays. I used it when working with gin index, and I think it would be helpful for others as well, so here it is. select * from normal_rand_array(5, 10, 1.8::numeric, 3.5::numeric); normal_rand_array --- {3.3,2.3,2.7,3.2,2.0,2.7,3.4,2.7,2.3,2.9} {3.3,1.8,2.9,3.4,2.0,1.8,2.0,3.5,2.8,2.5} {2.1,1.9,2.3,1.9,2.5,2.7,2.4,2.9,1.8} {2.3,2.5,2.4,2.7,2.7,2.3,2.9,3.3,3.3,1.9,3.5} {2.8,3.4,2.7,1.8,3.3,2.3,2.2,3.5,2.6,2.5} (5 rows) select * from normal_rand_array(5, 10, 1.8::int4, 3.5::int4); normal_rand_array - {3,2,2,3,4,2} {2,4,2,3,3,3,3,2,2,3,3,2,3,2} {2,4,3} {4,2,3,4,2,4,2,2,3,4,3,3,2,4,4,2,3} {4,3,3,4,3,3,4,2,4} (5 rows) the 5 means it needs to produce 5 rows in total and the 10 is the average array length, and 1.8 is the minvalue for the random function and 3.5 is the maxvalue. -- Best Regards Andy Fan >From 397dcaf67f29057b80aebbb6116b49ac8344547c Mon Sep 17 00:00:00 2001 From: Andy Fan Date: Sat, 8 Jun 2024 13:21:08 +0800 Subject: [PATCH v20240608 1/1] Add function normal_rand_array function to contrib/tablefunc. It can produce an array of numbers with n controllable array length and duplicated elements in these arrays. --- contrib/tablefunc/Makefile| 2 +- contrib/tablefunc/expected/tablefunc.out | 26 contrib/tablefunc/sql/tablefunc.sql | 10 ++ contrib/tablefunc/tablefunc--1.0--1.1.sql | 7 ++ contrib/tablefunc/tablefunc.c | 140 ++ contrib/tablefunc/tablefunc.control | 2 +- doc/src/sgml/tablefunc.sgml | 10 ++ src/backend/utils/adt/arrayfuncs.c| 7 ++ 8 files changed, 202 insertions(+), 2 deletions(-) create mode 100644 contrib/tablefunc/tablefunc--1.0--1.1.sql diff --git a/contrib/tablefunc/Makefile b/contrib/tablefunc/Makefile index 191a3a1d38..f0c67308fd 100644 --- a/contrib/tablefunc/Makefile +++ b/contrib/tablefunc/Makefile @@ -3,7 +3,7 @@ MODULES = tablefunc EXTENSION = tablefunc -DATA = tablefunc--1.0.sql +DATA = tablefunc--1.0.sql tablefunc--1.0--1.1.sql PGFILEDESC = "tablefunc - various functions that return tables" REGRESS = tablefunc diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out index ddece79029..9f0cbbfbbe 100644 --- a/contrib/tablefunc/expected/tablefunc.out +++ b/contrib/tablefunc/expected/tablefunc.out @@ -12,6 +12,32 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2); -- negative number of tuples SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2); ERROR: number of rows cannot be negative +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i; + count |avg +---+ +10 | 3. +(1 row) + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int4, 8::int4) as i; + count |avg +---+ +10 | 3. +(1 row) + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int8, 8::int8) as i; + count |avg +---+ +10 | 3. +(1 row) + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::float8, 8::float8) as i; + count |avg +---+ +10 | 3. +(1 row) + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 'abc'::text, 'def'::text) as i; +ERROR: unsupported type 25 in normal_rand_array. -- -- crosstab() -- diff --git a/contrib/tablefunc/sql/tablefunc.sql b/contrib/tablefunc/sql/tablefunc.sql index 0fb8e40de2..dec57cfc66 100644 --- a/contrib/tablefunc/sql/tablefunc.sql +++ b/contrib/tablefunc/sql/tablefunc.sql @@ -8,6 +8,16 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2); -- negative number of tuples SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2); +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i; + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int4, 8::int4) as i; + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int8, 8::int8) as i; + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::float8, 8::float8) as i; + +SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 'abc'::text, 'def'::text) as i; + -- -- crosstab() -- diff --git a/contrib/tablefunc/tablefunc--1.0--1.1.sql b/contrib/tablefunc/tablefunc--1.0--1.1.sql new file m