Add isCatalogRel in rmgrdesc
Hi hackers, 6af1793954 added a new field namely "isCatalogRel" in some WAL records to help detecting row removal conflict during logical decoding from standby. Please find attached a patch to add this field in the related rmgrdesc (i.e all the ones that already provide the snapshotConflictHorizon except the one related to xl_heap_visible: indeed a new bit was added in its flag field in 6af1793954 instead of adding the isCatalogRel bool). I think it's worth it, as this new field could help diagnose conflicts issues (if any). Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom 564e689cf74e1b9893a28b5ecf6bdbc6fb549232 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 11 Dec 2023 21:10:30 + Subject: [PATCH v1] adding isCatalogRel to rmgrdesc --- src/backend/access/rmgrdesc/gistdesc.c | 10 ++ src/backend/access/rmgrdesc/hashdesc.c | 5 +++-- src/backend/access/rmgrdesc/heapdesc.c | 10 ++ src/backend/access/rmgrdesc/nbtdesc.c | 10 ++ src/backend/access/rmgrdesc/spgdesc.c | 5 +++-- 5 files changed, 24 insertions(+), 16 deletions(-) 100.0% src/backend/access/rmgrdesc/ diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c index 5dc6e1dcee..8008c041a6 100644 --- a/src/backend/access/rmgrdesc/gistdesc.c +++ b/src/backend/access/rmgrdesc/gistdesc.c @@ -26,18 +26,20 @@ out_gistxlogPageUpdate(StringInfo buf, gistxlogPageUpdate *xlrec) static void out_gistxlogPageReuse(StringInfo buf, gistxlogPageReuse *xlrec) { - appendStringInfo(buf, "rel %u/%u/%u; blk %u; snapshotConflictHorizon %u:%u", + appendStringInfo(buf, "rel %u/%u/%u; blk %u; snapshotConflictHorizon %u:%u, isCatalogRel %u", xlrec->locator.spcOid, xlrec->locator.dbOid, xlrec->locator.relNumber, xlrec->block, EpochFromFullTransactionId(xlrec->snapshotConflictHorizon), - XidFromFullTransactionId(xlrec->snapshotConflictHorizon)); + XidFromFullTransactionId(xlrec->snapshotConflictHorizon), +xlrec->isCatalogRel); } static void out_gistxlogDelete(StringInfo buf, gistxlogDelete *xlrec) { - appendStringInfo(buf, "delete: snapshotConflictHorizon %u, nitems: %u", -xlrec->snapshotConflictHorizon, xlrec->ntodelete); + appendStringInfo(buf, "delete: snapshotConflictHorizon %u, nitems: %u, isCatalogRel %u", +xlrec->snapshotConflictHorizon, xlrec->ntodelete, +xlrec->isCatalogRel); } static void diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c index b6810a9320..49855ce467 100644 --- a/src/backend/access/rmgrdesc/hashdesc.c +++ b/src/backend/access/rmgrdesc/hashdesc.c @@ -113,9 +113,10 @@ hash_desc(StringInfo buf, XLogReaderState *record) { xl_hash_vacuum_one_page *xlrec = (xl_hash_vacuum_one_page *) rec; - appendStringInfo(buf, "ntuples %d, snapshotConflictHorizon %u", + appendStringInfo(buf, "ntuples %d, snapshotConflictHorizon %u, isCatalogRel %u", xlrec->ntuples, - xlrec->snapshotConflictHorizon); + xlrec->snapshotConflictHorizon, + xlrec->isCatalogRel); break; } } diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c index f382c0f623..3c17485b02 100644 --- a/src/backend/access/rmgrdesc/heapdesc.c +++ b/src/backend/access/rmgrdesc/heapdesc.c @@ -179,10 +179,11 @@ heap2_desc(StringInfo buf, XLogReaderState *record) { xl_heap_prune *xlrec = (xl_heap_prune *) rec; - appendStringInfo(buf, "snapshotConflictHorizon: %u, nredirected: %u, ndead: %u", + appendStringInfo(buf, "snapshotConflictHorizon: %u, nredirected: %u, ndead: %u, isCatalogRel %u", xlrec->snapshotConflictHorizon, xlrec->nredirected, -xlrec->ndead); +xlrec->ndead, +xlrec->isCatalogRel); if (XLogRecHasBlockData(record, 0)) { @@ -238,8 +239,9 @@ heap2_desc(StringInfo
planner chooses incremental but not the best one
Dear Hackers, I've come across a behaviour of the planner I can't explain. After a migration from 11 to 15 (on RDS) we noticed a degradation in response time on a query, it went from a few seconds to ten minutes. A vacuum(analyze) has been realized to be sure that all is clean. The 'explain analyze' shows us a change of plan. Postgresql 15 chooses `incremental sort` with an index corresponding to the ORDER BY clause (on the created_at column). The previous v11 plan used a more efficient index. By deactivating incremental sort, response times in v15 are equal to v11 one. Here is the query SELECT inputdocum0_.id AS col_0_0_ FROM document_management_services.input_document inputdocum0_ WHERE (inputdocum0_.indexation_domain_id in ('2d29daf6-e151-479a-a52a-78b08bb3009d')) AND (inputdocum0_.indexation_subsidiary_id in ('9f9df402-f70b-40d9-b283-a3c35232469a')) AND (inputdocum0_.locked_at IS NULL) AND (inputdocum0_.locked_by_app IS NULL) AND (inputdocum0_.locked_by_user IS NULL) AND (inputdocum0_.lock_time_out IS NULL) AND inputdocum0_.archiving_state<> 'DESTROYED' AND (inputdocum0_.creation_state in ('READY')) AND inputdocum0_.active_content=true AND (inputdocum0_.processing_state in ('PENDING_INDEXATION')) ORDER BY inputdocum0_.created_at ASC, inputdocum0_.reception_id ASC, inputdocum0_.reception_order ASC LIMIT 50 ; Here are some details, the table `input_document` is partionned by hash with 20 partitions with a lot of indexes Indexes: "input_document_pkey" PRIMARY KEY, btree (id) "input_document_api_version_idx" btree (api_version) INVALID "input_document_created_at_idx" btree (created_at) "input_document_created_by_user_profile_idx" btree (created_by_user_profile) "input_document_dashboard_idx" btree (processing_state, indexation_family_id, indexation_group_id, reception_id) INCLUDE (active_content, archiving_state, creation_state) WHERE active_content = true AND archiving_state <> 'DESTROYED'::text AND creation_state <> 'PENDING'::text "input_document_fts_description_idx" gin (to_tsvector('simple'::regconfig, description)) "input_document_fts_insured_firstname_idx" gin (to_tsvector('simple'::regconfig, indexation_insured_firstname)) "input_document_fts_insured_lastname_idx" gin (to_tsvector('simple'::regconfig, indexation_insured_lastname)) "input_document_indexation_activity_id_idx" btree (indexation_activity_id) "input_document_indexation_agency_id_idx" btree (indexation_agency_id) "input_document_indexation_distributor_id_idx" btree (indexation_distributor_id) "input_document_indexation_domain_id_idx" btree (indexation_domain_id) "input_document_indexation_family_id_idx" btree (indexation_family_id) "input_document_indexation_group_id_idx" btree (indexation_group_id) "input_document_indexation_insurer_id_idx" btree (indexation_insurer_id) "input_document_indexation_nature_id_idx" btree (indexation_nature_id) "input_document_indexation_reference_idx" btree (indexation_reference) "input_document_indexation_subsidiary_id_idx" btree (indexation_subsidiary_id) "input_document_indexation_warranty_id_idx" btree (indexation_warranty_id) "input_document_locked_by_user_idx" btree (locked_by_user) "input_document_modified_at_idx" btree (modified_at) "input_document_modified_by_user_profile_idx" btree (modified_by_user_profile) "input_document_processing_state_idx" btree (processing_state) "input_document_stock_idx" btree (active_content, archiving_state, creation_state, processing_state) WHERE active_content AND archiving_state <> 'DESTROYED'::text AND creation_state <> 'PENDING'::text AND (processing_state = ANY ('{PENDING_PROCESSING,PENDING_INDEXATION,READY}'::text[])) "input_dom_act_pi_idx" btree (indexation_activity_id, indexation_domain_id) WHERE processing_state = 'PENDING_INDEXATION'::text "input_dom_act_pp_idx" btree (indexation_activity_id, indexation_domain_id) WHERE processing_state = 'PENDING_PROCESSING'::text "input_dom_act_sub_idx" btree (indexation_activity_id, indexation_domain_id, indexation_subsidiary_id) "input_reception_id_created_at_idx" btree (reception_id, created_at) "input_reception_id_reception_order_idx" btree (reception_id, reception_order) "operational_perimeter_view_idx" btree (processing_state, indexation_distributor_id) WHERE processing_state = 'PENDING_PROCESSING'::text Please find attached the 3 plans explain_analyse_incremental_off.txt with enable_incremental_sort to off explain_analyse_incremental_on.txt with enable_incremental_sort to on explain_analyse_incremental_on_limit5000 with enable_incremental_sort to on but with increase the limit to 5000, in this case plan choose don't use `Incremental Sort` The point that I don't understand in the plan (incremental_sort to on) is the top level one, the limit co
Re: How abnormal server shutdown could be detected by tests?
On Sat, Dec 9, 2023 at 9:30 AM Alexander Lakhin wrote: > > Hello hackers, > > While studying bug #18158, I've come to the conclusion that the existing > testing infrastructure is unable to detect abnormal situations. of some > kind. > > Just a simple example: > With Assert(0) injected in walsender (say: > sed "s/WalSndDone(send_data)/Assert(0)/" -i > src/backend/replication/walsender.c > ), I observe the following: > $ make -s check -C src/test/recovery PROVE_TESTS="t/012*" > > t/012_subtransactions.pl .. ok > All tests successful. > > (The same with meson: > 1/1 postgresql:recovery / recovery/012_subtransactions OK > 3.24s 12 subtests passed) > > But: > $ grep TRAP src/test/recovery/tmp_check/log/* > src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373729 > src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373750 > src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373821 > src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373786 > > src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains: > ... > 2023-12-09 03:23:20.210 UTC [375933] LOG: server process (PID 375975) was > terminated by signal 6: Aborted > 2023-12-09 03:23:20.210 UTC [375933] DETAIL: Failed process was running: > START_REPLICATION 0/300 TIMELINE 3 > 2023-12-09 03:23:20.210 UTC [375933] LOG: terminating any other active > server processes > 2023-12-09 03:23:20.210 UTC [375933] LOG: abnormal database system shutdown > 2023-12-09 03:23:20.211 UTC [375933] LOG: database system is shut down > ... > > So the shutdown definitely was considered abnormal, but we missed that. > > With log_min_messages = DEBUG3, I can see in the log: > 2023-12-09 03:26:50.145 UTC [377844] LOG: abnormal database system shutdown > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 0 > before_shmem_exit callbacks to make > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 5 on_shmem_exit > callbacks to make > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up orphaned dynamic > shared memory with ID 2898643884 > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up dynamic shared > memory control segment with ID 3446966170 > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(1): 2 callbacks to make > 2023-12-09 03:26:50.146 UTC [377844] LOG: database system is shut down > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: exit(1) > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 > before_shmem_exit callbacks to make > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 on_shmem_exit > callbacks to make > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(-1): 0 callbacks to > make > > The postmaster process exits with exit code 1, but pg_ctl can't get the > code and just reports that stop was completed successfully. > For what it's worth, there is another thread which stated the similar problem: https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us > Should this be improved? And if yes, how it can be done? > Maybe postmaster shouldn't remove it's postmaster.pid when it exits > abnormally to let pg_ctl know of it? > thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Dec 11, 2023 at 7:12 PM Amit Kapila wrote: > > On Mon, Dec 11, 2023 at 2:41 PM shveta malik wrote: > > > > > > > > 5. > > > +synchronize_slots(WalReceiverConn *wrconn) > > > { > > > ... > > > ... > > > + /* The syscache access needs a transaction env. */ > > > + StartTransactionCommand(); > > > + > > > + /* > > > + * Make result tuples live outside TopTransactionContext to make them > > > + * accessible even after transaction is committed. > > > + */ > > > + MemoryContextSwitchTo(oldctx); > > > + > > > + /* Construct query to get slots info from the primary server */ > > > + initStringInfo(&s); > > > + construct_slot_query(&s); > > > + > > > + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); > > > + > > > + /* Execute the query */ > > > + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); > > > + pfree(s.data); > > > + > > > + if (res->status != WALRCV_OK_TUPLES) > > > + ereport(ERROR, > > > + (errmsg("could not fetch failover logical slots info " > > > + "from the primary server: %s", res->err))); > > > + > > > + CommitTransactionCommand(); > > > ... > > > ... > > > } > > > > > > Where exactly in the above code, there is a syscache access as > > > mentioned above StartTransactionCommand()? > > > > > > > It is in walrcv_exec (libpqrcv_processTuples). I have changed the > > comments to add this info. > > > > Okay, I see that the patch switches context twice once after starting > the transaction and the second time after committing the transaction, > why is that required? Also, can't we extend the duration of the > transaction till the remote_slot information is constructed? If we extend duration, we have to extend till remote_slot information is consumed and not only till it is constructed. > I am > asking this because the context used is TopMemoryContext which should > be used only if we need something specific to be retained at the > process level which doesn't seem to be the case here. > Okay, I understand your concern. But this needs more thoughts on shall we have all the slots synchronized in one txn or is it better to have it existing way i.e. each slot being synchronized in its own txn started in synchronize_one_slot. If we go by the former, can it have any implications? I need to review this bit more before concluding. . > I have noticed a few other minor things: > 1. > postgres=# select * from pg_logical_slot_get_changes('log_slot_2', NULL, > NULL); > ERROR: cannot use replication slot "log_slot_2" for logical decoding > DETAIL: This slot is being synced from the primary server. > ... > ... > postgres=# select * from pg_drop_replication_slot('log_slot_2'); > ERROR: cannot drop replication slot "log_slot_2" > DETAIL: This slot is being synced from the primary. > > I think the DETAIL message should be the same in the above two cases. > > 2. > +void > +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) > +{ > + List*standby_slots; > + > + Assert(!am_walsender); > + > + if (!MyReplicationSlot->data.failover) > + return; > + > + standby_slots = GetStandbySlotList(true); > + > + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); > ... > ... > > Shouldn't we return if the standby slot names list is NIL unless there > is a reason to do ConditionVariablePrepareToSleep() or any of the code > following it? > > -- > With Regards, > Amit Kapila.
Re: Add isCatalogRel in rmgrdesc
On Tue, Dec 12, 2023 at 09:23:46AM +0100, Drouvot, Bertrand wrote: > Please find attached a patch to add this field in the related rmgrdesc (i.e > all the ones that already provide the snapshotConflictHorizon except the one > related to xl_heap_visible: indeed a new bit was added in its flag field in > 6af1793954 > instead of adding the isCatalogRel bool). > > I think it's worth it, as this new field could help diagnose conflicts issues > (if any). Agreed that this is helpful. One would likely guess if you are dealing with a catalog relation depending on its relfilenode, but that does not take into account user_catalog_table that can be set as a reloption, impacting the value of isCatalogRel stored in the records. -- Michael signature.asc Description: PGP signature
Re: Add isCatalogRel in rmgrdesc
Hi, On 12/12/23 10:15 AM, Michael Paquier wrote: On Tue, Dec 12, 2023 at 09:23:46AM +0100, Drouvot, Bertrand wrote: Please find attached a patch to add this field in the related rmgrdesc (i.e all the ones that already provide the snapshotConflictHorizon except the one related to xl_heap_visible: indeed a new bit was added in its flag field in 6af1793954 instead of adding the isCatalogRel bool). I think it's worth it, as this new field could help diagnose conflicts issues (if any). Agreed that this is helpful. Thanks for looking at it! One would likely guess if you are dealing with a catalog relation depending on its relfilenode, but that does not take into account user_catalog_table that can be set as a reloption, impacting the value of isCatalogRel stored in the records. Exactly and not mentioning the other checks in RelationIsAccessibleInLogicalDecoding() like the wal_level >= logical one. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: Synchronizing slots from primary to standby
On Monday, December 11, 2023 3:32 PM Peter Smith > > Here are some review comments for v44-0001 > > == > src/backend/replication/slot.c > > > 1. ReplicationSlotCreate > > * during getting changes, if the two_phase option is enabled it can skip > * prepare because by that time start decoding point has been moved. So > the > * user will only get commit prepared. > + * failover: Allows the slot to be synced to physical standbys so that > logical > + * replication can be resumed after failover. > */ > void > ReplicationSlotCreate(const char *name, bool db_specific, > > ~ > > /Allows the slot.../If enabled, allows the slot.../ Changed. > > == > > 2. validate_standby_slots > > +validate_standby_slots(char **newval) > +{ > + char*rawname; > + List*elemlist; > + ListCell *lc; > + bool ok = true; > + > + /* Need a modifiable copy of string */ rawname = pstrdup(*newval); > + > + /* Verify syntax and parse string into list of identifiers */ if (!(ok > + = SplitIdentifierString(rawname, ',', &elemlist))) > + GUC_check_errdetail("List syntax is invalid."); > + > + /* > + * If there is a syntax error in the name or if the replication slots' > + * data is not initialized yet (i.e., we are in the startup process), > + skip > + * the slot verification. > + */ > + if (!ok || !ReplicationSlotCtl) > + { > + pfree(rawname); > + list_free(elemlist); > + return ok; > + } > > > 2a. > You don't need to initialize 'ok' during declaration because it is assigned > immediately anyway. > > ~ > > 2b. > AFAIK assignment within a conditional like this is not a normal PG coding > style > unless there is no other way to do it. > Changed. > ~ > > 2c. > /into list/into a list/ > > SUGGESTION > /* Verify syntax and parse string into a list of identifiers */ ok = > SplitIdentifierString(rawname, ',', &elemlist); if (!ok) > GUC_check_errdetail("List syntax is invalid."); > > Changed. > ~~~ > > 3. assign_standby_slot_names > > + if (!SplitIdentifierString(standby_slot_names_cpy, ',', > + &standby_slots)) { > + /* This should not happen if GUC checked check_standby_slot_names. */ > + elog(ERROR, "list syntax is invalid"); } > > This error here and in validate_standby_slots() are different -- "list" versus > "List". > The message has been changed to "invalid list syntax" to be consistent with other elog. > == > src/backend/replication/walsender.c > > > 4. WalSndFilterStandbySlots > > > + foreach(lc, standby_slots_cpy) > + { > + char*name = lfirst(lc); > + XLogRecPtr restart_lsn = InvalidXLogRecPtr; bool invalidated = false; > + char*warningfmt = NULL; > + ReplicationSlot *slot; > + > + slot = SearchNamedReplicationSlot(name, true); > + > + if (slot && SlotIsPhysical(slot)) > + { > + SpinLockAcquire(&slot->mutex); > + restart_lsn = slot->data.restart_lsn; > + invalidated = slot->data.invalidated != RS_INVAL_NONE; > + SpinLockRelease(&slot->mutex); } > + > + /* Continue if the current slot hasn't caught up. */ if (!invalidated > + && !XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < wait_for_lsn) { > + /* Log warning if no active_pid for this physical slot */ if > + (slot->active_pid == 0) ereport(WARNING, errmsg("replication slot > + \"%s\" specified in parameter \"%s\" does > not have active_pid", > +name, "standby_slot_names"), > + errdetail("Logical replication is waiting on the " > + "standby associated with \"%s\"", name), errhint("Consider starting > + standby associated with " > + "\"%s\" or amend standby_slot_names", name)); > + > + continue; > + } > + else if (!slot) > + { > + /* > + * It may happen that the slot specified in standby_slot_names GUC > + * value is dropped, so let's skip over it. > + */ > + warningfmt = _("replication slot \"%s\" specified in parameter > \"%s\" does not exist, ignoring"); > + } > + else if (SlotIsLogical(slot)) > + { > + /* > + * If a logical slot name is provided in standby_slot_names, issue > + * a WARNING and skip it. Although logical slots are disallowed in > + * the GUC check_hook(validate_standby_slots), it is still > + * possible for a user to drop an existing physical slot and > + * recreate a logical slot with the same name. Since it is > + * harmless, a WARNING should be enough, no need to error-out. > + */ > + warningfmt = _("cannot have logical replication slot \"%s\" in > parameter \"%s\", ignoring"); > + } > + else if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) { > + /* > + * Specified physical slot may have been invalidated, so there is no > + point > + * in waiting for it. > + */ > + warningfmt = _("physical slot \"%s\" specified in parameter \"%s\" > has been invalidated, ignoring"); > + } > + else > + { > + Assert(restart_lsn >= wait_for_lsn); > + } > > This if/else chain seems structured awkwardly. IMO it would be tidier to > eliminate the NULL slot and IsLogicalSlot up-front, which would also simplify > some of the subsequent conditions > > SUGGESTION > > slot = SearchNamedRepli
Re: logical decoding and replication of sequences, take 2
Hi, There's been a lot discussed over the past month or so, and it's become difficult to get a good idea what's the current state - what issues remain to be solved, what's unrelated to this patch, and how to move if forward. Long-running threads tend to be confusing, so I had a short call with Amit to discuss the current state yesterday, and to make sure we're on the same page. I believe it was very helpful, and I've promised to post a short summary of the call - issues, what we agreed seems like a path forward, etc. Obviously, I might have misunderstood something, in which case Amit can correct me. And I'd certainly welcome opinions from others. In general, we discussed three areas - desirability of the feature, correctness and performance. I believe a brief summary of the agreement would be this: - desirability of the feature: Random IDs (UUIDs etc.) are likely a much better solution for distributed (esp. active-active) systems. But there are important use cases that are likely to keep using regular sequences (online upgrades of single-node instances, existing systems, ...). - correctness: There's one possible correctness issue, when the snapshot changes to FULL between record creating a sequence relfilenode and that sequence advancing. This needs to be verified/reproduced, and fixed. - performance issues: We've agreed the case with a lot of aborts (when DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've discussed whether the overhead with many sequence changes (nextval-40) is acceptable, and/or how to improve it. Next, I'll go over these points in more details, with my understanding of what the challenges are, possible solutions etc. Most of this was discussed/agreed on the call, but some are ideas I had only after the call when writing this summary. 1) desirability of the feature Firstly, do we actually want/need this feature? I believe that's very much a question of what use cases we're targeting. If we only focus on distributed databases (particularly those with multiple active nodes), then we probably agree that the right solution is to not use sequences (~generators of incrementing values) but UUIDs or similar random identifiers (better not call them sequences, there's not much sequential about it). The huge advantage is this does not require replicating any state between the nodes, so logical decoding can simply ignore them and replicate just the generated values. I don't think there's any argument about that. If I as building such distributed system, I'd certainly use such random IDs. The question is what to do about the other use cases - online upgrades relying on logical decoding, failovers to logical replicas, and so on. Or what to do about existing systems that can't be easily changed to use different/random identifiers. Those are not really distributed systems and therefore don't quite need random IDs. Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can easily lead to massive write amplification, for example. There are variants like UUIDv7 reducing the impact, but there's other stuff. My takeaway from this is there's still value in having this feature. 2) correctness The only correctness issue I'm aware of is the question what happens when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding the relfilenode creation and the sequence increment, pointed out by Dilip in [1]. If this happens (and while I don't have a reproducer, I also don't have a very clear idea why it couldn't happen), it breaks how the patch decides between transactional and non-transactional sequence changes. So this seems like a fatal flaw - it definitely needs to be solved. I don't have a good idea how to do that, unfortunately. The problem is the dependency on an earlier record, and that this needs to be evaluated immediately (in the decode phase). Logical messages don't have the same issue because the "transactional" flag does not depend on earlier stuff, and other records are not interpreted until apply/commit, when we know everything relevant was decoded. I don't know what the solution is. Either we find a way to make sure not to lose/skip the smgr record, or we need to rethink how we determine the transactional flag (perhaps even try again adding it to the WAL record, but we didn't find a way to do that earlier). 3) performance issues We have discussed two cases - "ddl-abort" and "nextval-40". The "ddl-abort" is when the workload does a lot of DDL and then aborts them, leading to profiles dominated by DecodeCommit. The agreement here is that while this is a valid issue and we should try fixing it, it's unrelated to this patch. The issue exists even on master. So in the context of this patch we can ignore this issue. The "nextval-40" applies to workloads doing a lot of regular sequence changes. We only decode/apply changes written to WAL, and that happens only for every 32 increments or so. The test was with a very simple transaction (just seque
Re: Add --check option to pgindent
> On 12 Dec 2023, at 01:09, Tristan Partin wrote: > > Not sold on the name, but --check is a combination of --silent-diff and > --show-diff. I envision --check mostly being used in CI environments. I > recently came across a situation where this behavior would have been useful. > Without --check, you're left to capture the output of --show-diff and exit 2 > if the output isn't empty by yourself. I wonder if we should model this around the semantics of git diff to keep it similar to other CI jobs which often use git diff? git diff --check means "are there conflicts or issues" which isn't really comparable to here, git diff --exit-code however is pretty much exactly what this is trying to accomplish. That would make pgindent --show-diff --exit-code exit with 1 if there were diffs and 0 if there are no diffs. -- Daniel Gustafsson
Re: Add --check option to pgindent
Hi, On Tue, Dec 12, 2023 at 11:18:59AM +0100, Daniel Gustafsson wrote: > > On 12 Dec 2023, at 01:09, Tristan Partin wrote: > > > > Not sold on the name, but --check is a combination of --silent-diff > > and --show-diff. I envision --check mostly being used in CI > > environments. I recently came across a situation where this behavior > > would have been useful. Without --check, you're left to capture the > > output of --show-diff and exit 2 if the output isn't empty by > > yourself. > > I wonder if we should model this around the semantics of git diff to > keep it similar to other CI jobs which often use git diff? git diff > --check means "are there conflicts or issues" which isn't really > comparable to here, git diff --exit-code however is pretty much > exactly what this is trying to accomplish. > > That would make pgindent --show-diff --exit-code exit with 1 if there > were diffs and 0 if there are no diffs. To be honest, I find that rather convoluted; contrary to "git diff", I believe the primary action of pgident is not to show diffs, so I find the proposed --check option to be entirely reasonable from a UX perspective. On the other hand, tying a "does this need re-indenting?" question to a "--show-diff --exit-code" option combination is not very obvious (to me, at least). Cheers, Michael
[meson] expose buildtype debug/optimization info to pg_config
build system using configure set VAL_CFLAGS with debug and optimization flags, so pg_config will show these infos. Some extensions depend on the mechanism. This patch exposes these flags with a typo fixed together. -- Regards Junwang Zhao 0001-meson-expose-buildtype-debug-optimization-info-to-pg.patch Description: Binary data
Re: Adding facility for injection points (or probe points?) for more advanced tests
On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote: > Oops, I only included the code changes where I am adding injection > points and some comments to verify that, but missed the actual test > file. Attaching it here. I see. Interesting that this requires persistent connections to work. That's something I've found clunky to rely on when the scenarios a test needs to deal with are rather complex. That's an area that could be made easier to use outside of this patch.. Something got proposed by Andrew Dunstan to make the libpq routines usable through a perl module, for example. > Note: I think the latest patches are conflicting with the head, can you > rebase? Indeed, as per the recent manipulations in ipci.c for the shmem initialization areas. Here goes a v6. -- Michael From 1e11c0400bf802834792f3e9c897b17a3d14bca1 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 12 Dec 2023 11:35:24 +0100 Subject: [PATCH v6 1/4] Add backend facility for injection points This adds a set of routines allowing developers to attach, detach and run custom code based on arbitrary code paths set with a centralized macro called INJECTION_POINT(). Injection points are registered in a shared hash table. Processes also use a local cache to over loading callbacks more than necessary, cleaning up their cache if a callback has found to be removed. --- src/include/pg_config.h.in| 3 + src/include/utils/injection_point.h | 36 ++ src/backend/storage/ipc/ipci.c| 3 + src/backend/storage/lmgr/lwlocknames.txt | 1 + .../utils/activity/wait_event_names.txt | 1 + src/backend/utils/misc/Makefile | 1 + src/backend/utils/misc/injection_point.c | 317 ++ src/backend/utils/misc/meson.build| 1 + doc/src/sgml/installation.sgml| 30 ++ doc/src/sgml/xfunc.sgml | 56 configure | 34 ++ configure.ac | 7 + meson.build | 1 + meson_options.txt | 3 + src/Makefile.global.in| 1 + src/tools/pgindent/typedefs.list | 2 + 16 files changed, 497 insertions(+) create mode 100644 src/include/utils/injection_point.h create mode 100644 src/backend/utils/misc/injection_point.c diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 5f16918243..288bb9cb42 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -698,6 +698,9 @@ /* Define to build with ICU support. (--with-icu) */ #undef USE_ICU +/* Define to 1 to build with injection points. (--enable-injection-points) */ +#undef USE_INJECTION_POINTS + /* Define to 1 to build with LDAP support. (--with-ldap) */ #undef USE_LDAP diff --git a/src/include/utils/injection_point.h b/src/include/utils/injection_point.h new file mode 100644 index 00..6335260fea --- /dev/null +++ b/src/include/utils/injection_point.h @@ -0,0 +1,36 @@ +/*- + * injection_point.h + * Definitions related to injection points. + * + * Copyright (c) 2001-2023, PostgreSQL Global Development Group + * + * src/include/utils/injection_point.h + * -- + */ +#ifndef INJECTION_POINT_H +#define INJECTION_POINT_H + +/* + * Injections points require --enable-injection-points. + */ +#ifdef USE_INJECTION_POINTS +#define INJECTION_POINT(name) InjectionPointRun(name) +#else +#define INJECTION_POINT(name) ((void) name) +#endif + +/* + * Typedef for callback function launched by an injection point. + */ +typedef void (*InjectionPointCallback) (const char *name); + +extern Size InjectionPointShmemSize(void); +extern void InjectionPointShmemInit(void); + +extern void InjectionPointAttach(const char *name, +const char *library, +const char *function); +extern void InjectionPointRun(const char *name); +extern void InjectionPointDetach(const char *name); + +#endif /* INJECTION_POINT_H */ diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 0e0ac22bdd..81799c5688 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -49,6 +49,7 @@ #include "storage/sinvaladt.h" #include "storage/spin.h" #include "utils/guc.h" +#include "utils/injection_point.h" #include "utils/snapmgr.h" #include "utils/wait_event.h" @@ -147,6 +148,7 @@ CalculateShmemSize(int *num_semaphores) size = add_size(size, AsyncShmemSize()); size = add_size(size, StatsShmemSize()); size = add_size(size, WaitEventExtensionShmemSize()); + size = add_size(size, InjectionPointShmemSize()); #ifdef EXEC_BACKEND size = add_size(size,
Re: Sorting regression of text function result since commit 586b98fdf1aae
On Mon, 11 Dec 2023 15:43:12 -0500 Tom Lane wrote: > Jehan-Guillaume de Rorthais writes: > > It looks like since 586b98fdf1aae, the result type collation of > > "convert_from" is forced to "C", like the patch does for type "name", > > instead of the "default" collation for type "text". > > Well, convert_from() inherits its result collation from the input, > per the normal rules for collation assignment [1]. > > > Looking at hints in the header comment of function "exprCollation", I poked > > around and found that the result collation wrongly follow the input > > collation in this case. > > It's not "wrong", it's what the SQL standard requires. Mh, OK. This is at least a surprising behavior. Having a non-data related argument impacting the result collation seems counter-intuitive. But I understand this is by standard, no need to discuss it. > > I couldn't find anything explaining this behavior in the changelog. It looks > > like a regression to me, but if this is actually expected, maybe this > > deserve some documentation patch? > > The v12 release notes do say > > Type name now behaves much like a domain over type text that has > default collation “C”. Sure, and I saw it, but reading at this entry, I couldn't guess this could have such implication on text result from a function call. That's why I hunt for the precise commit and was surprise to find this was the actual change. > You'd have similar results from an expression involving such a domain, > I believe. > > I'm less than excited about patching the v12 release notes four > years later. Maybe, if this point had come up in a more timely > fashion, we'd have mentioned it --- but it's hardly possible to > cover every potential implication of such a change in the > release notes. This could have been documented in the collation concept page, as a trap to be aware of. A link from the release note to such a small paragraph would have been enough to warn devs this might have implications when mixed with other collatable types. But I understand we can not document all the traps paving the way to the standard anyway. Thank you for your explanation! Regards,
Re: brininsert optimization opportunity
On 12/11/23 16:41, Tomas Vondra wrote: > On 12/11/23 16:00, Alexander Lakhin wrote: >> Hello Tomas and Soumyadeep, >> >> 25.11.2023 23:06, Tomas Vondra wrote: >>> I've done a bit more cleanup on the last version of the patch (renamed >>> the fields to start with bis_ as agreed, rephrased the comments / docs / >>> commit message a bit) and pushed. >> >> Please look at a query, which produces warnings similar to the ones >> observed upthread: >> CREATE TABLE t(a INT); >> INSERT INTO t SELECT x FROM generate_series(1,1) x; >> CREATE INDEX idx ON t USING brin (a); >> REINDEX index CONCURRENTLY idx; >> >> WARNING: resource was not closed: [1863] (rel=base/16384/16389, >> blockNum=1, flags=0x9380, refcount=1 1) >> WARNING: resource was not closed: [1862] (rel=base/16384/16389, >> blockNum=0, flags=0x9380, refcount=1 1) >> >> The first bad commit for this anomaly is c1ec02be1. >> > > Thanks for the report. I haven't investigated what the issue is, but it > seems we fail to release the buffers in some situations - I'd bet we > fail to call the cleanup callback in some place, or something like that. > I'll take a look tomorrow. > Yeah, just as I expected this seems to be a case of failing to call the index_insert_cleanup after doing some inserts - in this case the inserts happen in table_index_validate_scan, but validate_index has no idea it needs to do the cleanup. The attached 0001 patch fixes this by adding the call to validate_index, which seems like the proper place as it's where the indexInfo is allocated and destroyed. But this makes me think - are there any other places that might call index_insert without then also doing the cleanup? I did grep for the index_insert() calls, and it seems OK except for this one. There's a call in toast_internals.c, but that seems OK because that only deals with btree indexes (and those don't need any cleanup). The same logic applies to unique_key_recheck(). The rest goes through execIndexing.c, which should do the cleanup in ExecCloseIndices(). Note: We should probably call the cleanup even in the btree cases, even if only to make it clear it needs to be called after index_insert(). I was thinking maybe we should have some explicit call to destroy the IndexInfo, but that seems rather inconvenient - it'd force everyone to carefully track lifetimes of the IndexInfo instead of just relying on memory context reset/destruction. That seems quite error-prone. I propose we do a much simpler thing instead - allow the cache may be initialized / cleaned up repeatedly, and make sure it gets reset at convenient place (typically after index_insert calls that don't go through execIndexing). That'd mean the cleanup does not need to happen very far from the index_insert(), which makes the reasoning much easier. 0002 does this. But maybe there's a better way to ensure the cleanup gets called even when not using execIndexing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 0cf18dd53842e3809b76525867214ce8ff823f32 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 12 Dec 2023 12:01:07 +0100 Subject: [PATCH 1/2] call cleanup in validate_index --- src/backend/catalog/index.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 7b186c0220b..7a0e337a418 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -3414,6 +3414,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot) /* Done with tuplesort object */ tuplesort_end(state.tuplesort); + index_insert_cleanup(indexRelation, indexInfo); + elog(DEBUG2, "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples", state.htups, state.itups, state.tups_inserted); -- 2.42.0 From 0c04b53573cf164c5a0108d909f05ffcbae4e72d Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 12 Dec 2023 12:24:05 +0100 Subject: [PATCH 2/2] call index_insert_cleanup repeatedly --- src/backend/access/heap/heapam_handler.c | 3 +++ src/backend/access/index/indexam.c | 11 ++- src/backend/catalog/index.c | 2 -- 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 7c28dafb728..5db0cf1dffa 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -1976,6 +1976,9 @@ heapam_index_validate_scan(Relation heapRelation, /* These may have been pointing to the now-gone estate */ indexInfo->ii_ExpressionsState = NIL; indexInfo->ii_PredicateState = NULL; + + /* also make sure the IndexInfo cache gets freed */ + index_insert_cleanup(indexRelation, indexInfo); } /* diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index f23e0199f08..515d7649c4d 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -2
Re: Show WAL write and fsync stats in pg_stat_io
Hi, Thanks for the feedback! The new version of the patch is attached. On Tue, 5 Dec 2023 at 09:16, Michael Paquier wrote: > > - if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND) > + if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE) > > Unrelated diff. Done. > > + if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL && > + io_op == IOOP_FSYNC) > + PendingWalStats.wal_sync += cnt; > > Nah, I really don't think that adding this dependency within > pg_stat_io is a good idea. > > - PendingWalStats.wal_sync++; > + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, > + io_start, 1); > > This is the only caller where this matters, and the count is always 1. I reverted that, pgstat_count_io_op_n doesn't count PendingWalStats.wal_sync now. > > + no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER || > + bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND || > + bktype == B_BG_WORKER || bktype == B_BG_WRITER || > + bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER || > + bktype == B_WAL_SENDER || bktype == B_WAL_WRITER; > + > + if (no_wal_normal_read && > + (io_object == IOOBJECT_WAL && > +io_op == IOOP_READ)) > + return false; > > This may be more readable if an enum is applied, without a default > clause so as it would not be forgotten if a new type is added, perhaps > in its own little routine. Done. > > - if (track_io_timing) > + if (track_io_timing || track_wal_io_timing) > INSTR_TIME_SET_CURRENT(io_start); > else > > This interface from pgstat_prepare_io_time() is not really good, > because we could finish by setting io_start in the existing code paths > calling this routine even if track_io_timing is false when > track_wal_io_timing is true. Why not changing this interface a bit > and pass down a GUC (track_io_timing or track_wal_io_timing) as an > argument of the function depending on what we expect to trigger the > timings? Done in 0001. > > - /* Convert counters from microsec to millisec for display */ > - values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / > 1000.0); > - values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / > 1000.0); > + /* > +* There is no need to calculate timings for both pg_stat_wal and > +* pg_stat_io. So, fetch timings from pg_stat_io to make stats > gathering > +* cheaper. Note that, since timings are fetched from pg_stat_io; > +* pg_stat_reset_shared('io') will reset pg_stat_wal's timings too. > +* > +* Convert counters from microsec to millisec for display > +*/ > + values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL, > + > IOCONTEXT_NORMAL, > + > IOOP_WRITE)); > + values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL, > + > IOCONTEXT_NORMAL, > + > IOOP_FSYNC)); > > Perhaps it is simpler to remove these columns from pg_stat_get_wal() > and plug an SQL upgrade to the view definition of pg_stat_wal? Done in 0003 but I am not sure if that is what you expected. > Finding a good balance between the subroutines, the two GUCs, the > contexts, the I/O operation type and the objects is the tricky part of > this patch. If the dependency to PendingWalStats is removed and if > the interface of pgstat_prepare_io_time is improved, things are a bit > cleaner, but it feels like we could do more.. Nya. I agree. The patch is not logically complicated but it is hard to select the best way. Any kind of feedback would be appreciated. -- Regards, Nazir Bilal Yavuz Microsoft From b7bf7b92fa274775136314ecfde90fa32ed435cb Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Wed, 29 Nov 2023 15:30:03 +0300 Subject: [PATCH v6 6/6] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests --- src/test/regress/expected/stats.out | 12 src/test/regress/sql/stats.sql | 8 2 files changed, 20 insertions(+) diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out index 4adda9e479..7f5340cd7e 100644 --- a/src/test/regress/expected/stats.out +++ b/src/test/regress/expected/stats.out @@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off' t (1 row) +-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read. +-- When the servers starts, StartupXLOG function must be called by postmaster +-- or standalone-backend startup and WAL read must be done. +-- So, check these before stats get resetted. +SELECT SUM(reads) >
Re: Add --check option to pgindent
On Tue, Dec 12, 2023, at 7:28 AM, Michael Banck wrote: > On Tue, Dec 12, 2023 at 11:18:59AM +0100, Daniel Gustafsson wrote: > > > On 12 Dec 2023, at 01:09, Tristan Partin wrote: > > > > > > Not sold on the name, but --check is a combination of --silent-diff > > > and --show-diff. I envision --check mostly being used in CI > > > environments. I recently came across a situation where this behavior > > > would have been useful. Without --check, you're left to capture the > > > output of --show-diff and exit 2 if the output isn't empty by > > > yourself. > > > > I wonder if we should model this around the semantics of git diff to > > keep it similar to other CI jobs which often use git diff? git diff > > --check means "are there conflicts or issues" which isn't really > > comparable to here, git diff --exit-code however is pretty much > > exactly what this is trying to accomplish. > > > > That would make pgindent --show-diff --exit-code exit with 1 if there > > were diffs and 0 if there are no diffs. > > To be honest, I find that rather convoluted; contrary to "git diff", I > believe the primary action of pgident is not to show diffs, so I find > the proposed --check option to be entirely reasonable from a UX > perspective. > > On the other hand, tying a "does this need re-indenting?" question to a > "--show-diff --exit-code" option combination is not very obvious (to me, > at least). Multiple options to accomplish a use case might not be obvious. I'm wondering if we can combine it into a unique option. --show-diff show the changes that would be made --silent-diff exit with status 2 if any changes would be made + --check combination of --show-diff and --silent-diff I mean --diff=show,silent,check When you add exceptions, it starts to complicate the UI. usage("Cannot have both --silent-diff and --show-diff") if $silent_diff && $show_diff; +usage("Cannot have both --check and --show-diff") + if $check && $show_diff; + +usage("Cannot have both --check and --silent-diff") + if $check && $silent_diff; + -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Synchronizing slots from primary to standby
A review on v45 patch: If one creates a logical slot with failover=true as - select pg_create_logical_replication_slot('logical_slot','pgoutput', false, true, true); Then, uses the existing logical slot while creating a subscription - postgres=# create subscription sub4 connection 'dbname=postgres host=localhost port=5433' publication pub1t4 WITH (slot_name=logical_slot, create_slot=false, failover=true); NOTICE: changed the failover state of replication slot "logical_slot" on publisher to false CREATE SUBSCRIPTION Despite configuring logical_slot's failover to true and specifying failover=true during subscription creation, the NOTICE indicates a change in the failover state to 'false', without providing any explanation for this transition. It can be confusing for users, so IMO, the notice should include the reason for switching failover to 'false' or should give a hint to use either refresh=false or copy_data=false to enable failover=true for the slot as we do in other similar 'alter subscription...' scenarios. -- Thanks & Regards, Nisha
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina wrote: > > Hi! Thank you for your work. Your patch looks better! > Yes, thank you! It works fine, and I see that the regression tests have been > passed. 🙂 > However, when I ran 'copy from with save_error' operation with simple csv > files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created > it, I described below): > > postgres=# create table test (x int primary key, y int not null); > postgres=# create table test1 (x int, z int, CONSTRAINT fk_x > FOREIGN KEY(x) > REFERENCES test(x)); > > I did not find a table with saved errors after operation, although I received > a log about it: > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV > save_error > NOTICE: 2 rows were skipped because of error. skipped row saved to table > public.test_error > ERROR: duplicate key value violates unique constraint "test_pkey" > DETAIL: Key (x)=(2) already exists. > CONTEXT: COPY test, line 3 > > postgres=# select * from public.test_error; > ERROR: relation "public.test_error" does not exist > LINE 1: select * from public.test_error; > > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV > save_error > NOTICE: 2 rows were skipped because of error. skipped row saved to table > public.test1_error > ERROR: insert or update on table "test1" violates foreign key constraint > "fk_x" > DETAIL: Key (x)=(2) is not present in table "test". > > postgres=# select * from public.test1_error; > ERROR: relation "public.test1_error" does not exist > LINE 1: select * from public.test1_error; > > Two lines were written correctly in the csv files, therefore they should have > been added to the tables, but they were not added to the tables test and > test1. > > If I leave only the correct rows, everything works fine and the rows are > added to the tables. > > in copy_test.csv: > > 2,0 > > 1,1 > > in copy_test1.csv: > > 2,0 > > 2,1 > > 1,1 > > postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV > COPY 2 > postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV > save_error > NOTICE: No error happened.Error holding table public.test1_error will be > droped > COPY 3 > > Maybe I'm launching it the wrong way. If so, let me know about it. looks like the above is about constraints violation while copying. constraints violation while copying not in the scope of this patch. Since COPY FROM is very like the INSERT command, you do want all the valid constraints to check all the copied rows? but the notice raised by the patch is not right. So I place the drop error saving table or raise notice logic above `ExecResetTupleTable(estate->es_tupleTable, false)` in the function CopyFrom. > > I also notice interesting behavior if the table was previously created by the > user. When I was creating an error_table before the 'copy from' operation, > I received a message saying that it is impossible to create a table with the > same name (it is shown below) during the 'copy from' operation. > I think you should add information about this in the documentation, since > this seems to be normal behavior to me. > doc changed. you may check it. From 3024bf3b727b728c58dfef41c62d7a93c083b887 Mon Sep 17 00:00:00 2001 From: pgaddict Date: Tue, 12 Dec 2023 20:58:45 +0800 Subject: [PATCH v11 1/1] Make COPY FROM more error tolerant Currently COPY FROM has 3 types of error while processing the source file. * extra data after last expected column * missing data for column \"%s\" * data type conversion error. Instead of throwing errors while copying, save_error specifier will save errors to a error saving table automatically. We check the error saving table definition by column name and column data type. if table already exists and meets the criteria then errors will save to that table. if the table does not exist, then create one. Only works for COPY FROM, non-BINARY mode. While copying, if error never happened, error saving table will be dropped at the ending of COPY FROM. If the error saving table exists, meaning at least once COPY FROM errors has happened, then all the future errors will be saved to that table. We save the error related meta info to error saving table using SPI, that is construct a query string, then execute the query. --- contrib/file_fdw/file_fdw.c | 4 +- doc/src/sgml/ref/copy.sgml | 100 +- src/backend/commands/copy.c | 12 ++ src/backend/commands/copyfrom.c | 146 +++- src/backend/commands/copyfromparse.c | 169 +-- src/backend/parser/gram.y| 8 +- src/bin/psql/tab-complete.c | 3 +- src/include/commands/copy.h | 3 +- src/include/commands/copyfrom_internal.h | 7 + src/include/parser/kwlist.h | 1 + src/test/regress/expected/copy2.out | 135 ++ src/test/regress/sq
Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)
On Mon, Dec 11, 2023 at 6:16 PM Peter Geoghegan wrote: > Will you be in Prague this week? If not this might have to wait. Sorry, I wouldn't be in Prague this week. Due to my current immigration status, I can't travel. I wish you to have a lovely time in Prague. I'm OK to wait, review once you can. I will probably provide a more polished version meanwhile. -- Regards, Alexander Korotkov
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
[Added Andrey again in CC, because as I understand they are using this code or something like it in production. Please don't randomly remove people from CC lists.] I've been looking at this some more, and I'm not confident in that the group clog update stuff is correct. I think the injection points test case was good enough to discover a problem, but it's hard to get peace of mind that there aren't other, more subtle problems. The problem I see is that the group update mechanism is designed around contention of the global xact-SLRU control lock; it uses atomics to coordinate a single queue when the lock is contended. So if we split up the global SLRU control lock using banks, then multiple processes using different bank locks might not contend. OK, this is fine, but what happens if two separate groups of processes encounter contention on two different bank locks? I think they will both try to update the same queue, and coordinate access to that *using different bank locks*. I don't see how can this work correctly. I suspect the first part of that algorithm, where atomics are used to create the list without a lock, might work fine. But will each "leader" process, each of which is potentially using a different bank lock, coordinate correctly? Maybe this works correctly because only one process will find the queue head not empty? If this is what happens, then there needs to be comments about it. Without any explanation, this seems broken and potentially dangerous, as some transaction commit bits might become lost given high enough concurrency and bad luck. Maybe this can be made to work by having one more lwlock that we use solely to coordinate this task. Though we would have to demonstrate that coordinating this task with a different lock works correctly in conjunction with the per-bank lwlock usage in the regular slru.c paths. Andrey, do you have any stress tests or anything else that you used to gain confidence in this code? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
Re: [PATCH]: Not to invaldiate CatalogSnapshot for local invalidation messages
Hi hackers, I would like to give more details of my patch. In postgres, it uses a global snapshot “CatalogSnapshot” to check catalog data visibility. “CatalogSnapshot” is always updated to the latest version to make the latest catalog table content visible. If there is any updating on catalog tables, to make the changes to be visible for the following commands in the current transaction, “CommandCounterIncrement”- >”AtCCI_LocalCache” ->”CommandEndInvalidationMessages ”->”LocalExecuteInvalidationMessage” ->”InvalidateCatalogSnapshot” it will invalidate the “CatalogSnapshot” by setting it to NULL. And next time, when it needs the “CatalogSnapsthot” and finds it is NULL, it will regenerate one. In a query, “CommandCounterIncrement” may be called many times, and “CatalogSnapsthot” may be destroyed and recreated many times. To reduce such overhead, instead of invalidating “CatalogSnapshot” , we can keep it and just increase the “curcid” of it. When the transaction is committed or aborted, or there are catalog invalidation messages from other backends, the “CatalogSnapshot” will be invalidated and regenerated. Sometimes, the “CatalogSnapshot” is not the same as the transaction “CurrentSnapshot”, but we can still update the CatalogSnapshot’s “curcid”, as the “curcid” only be checked when the tuple is inserted or deleted by the current transaction. Xiaoran Wang 于2023年12月7日周四 10:13写道: > Hi hackers, > > > For local invalidation messages, there is no need to call > `InvalidateCatalogSnapshot` to set the CatalogSnapshot to NULL and > rebuild it later. Instead, just update the CatalogSnapshot's `curcid` > in `SnapshotSetCommandId`, this way can make the CatalogSnapshot work > well too. > > This optimization can reduce the overhead of rebuilding CatalogSnapshot > after each command. > > > Best regards, xiaoran >
Re: How abnormal server shutdown could be detected by tests?
Hello Shveta, 12.12.2023 11:44, shveta malik wrote: The postmaster process exits with exit code 1, but pg_ctl can't get the code and just reports that stop was completed successfully. For what it's worth, there is another thread which stated the similar problem: https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us Thank you for the reference! So I refreshed a first part of the question Tom Lane raised before... I've made a quick experiment with leaving postmaster.pid intact in case of abnormal shutdown: @@ -1113,6 +1113,7 @@ UnlinkLockFiles(int status, Datum arg) { char *curfile = (char *) lfirst(l); +if (strcmp(curfile, DIRECTORY_LOCK_FILE) != 0 || status == 0) unlink(curfile); /* Should we complain if the unlink fails? */ } and `make check-world` passed for me with no failure. (In the meantime, the assertion failure forced as above is detected.) Though there is a minor issue with a couple of tests. Namely, 003_recovery_targets.pl does the following: # wait for the error message in the standby log foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) { $logfile = slurp_file($node_primary->logfile()); $res = ($logfile =~ qr/FATAL: .* recovery ended before configured recovery target was reached/); if ($res) { last; } usleep(100_000); } ok($res, 'recovery end before target reached is a fatal error'); With postmaster.pid left after unclean shutdown, the test waits for 300 seconds by default and then completes successfully. If rewrite that loop as follows: # wait for the error message in the standby log foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) { $logfile = slurp_file($node_primary->logfile()); $res = ($logfile =~ qr/FATAL: .* recovery ended before configured recovery target was reached/); if ($res) { last; } usleep(100_000); } ok($res, 'recovery end before target reached is a fatal error'); the test completes as quickly as before. (standby.log is only 2kb, so rereading it isn't a big deal, IMO) So maybe it's the way to go? Another way I can think of is sending some signal to pg_ctl in case postmaster terminates with status 0. Though I think it would complicate things a little as it allows for three different states: postmaster.pid preserved (in case postmaster killed with -9), postmaster.pid removed and the signal received/not received. Best regards, Alexander
Re: Add --check option to pgindent
"Euler Taveira" writes: > When you add exceptions, it starts to complicate the UI. Indeed. It seems like --silent-diff was poorly defined and poorly named, and we need to rethink that option along the way to adding this behavior. The idea that --show-diff and --silent-diff can be used together is just inherently confusing, because they sound like opposites. regards, tom lane
Re: Add --check option to pgindent
On Tue Dec 12, 2023 at 5:47 AM CST, Euler Taveira wrote: On Tue, Dec 12, 2023, at 7:28 AM, Michael Banck wrote: > On Tue, Dec 12, 2023 at 11:18:59AM +0100, Daniel Gustafsson wrote: > > > On 12 Dec 2023, at 01:09, Tristan Partin wrote: > > > > > > Not sold on the name, but --check is a combination of --silent-diff > > > and --show-diff. I envision --check mostly being used in CI > > > environments. I recently came across a situation where this behavior > > > would have been useful. Without --check, you're left to capture the > > > output of --show-diff and exit 2 if the output isn't empty by > > > yourself. > > > > I wonder if we should model this around the semantics of git diff to > > keep it similar to other CI jobs which often use git diff? git diff > > --check means "are there conflicts or issues" which isn't really > > comparable to here, git diff --exit-code however is pretty much > > exactly what this is trying to accomplish. > > > > That would make pgindent --show-diff --exit-code exit with 1 if there > > were diffs and 0 if there are no diffs. > > To be honest, I find that rather convoluted; contrary to "git diff", I > believe the primary action of pgident is not to show diffs, so I find > the proposed --check option to be entirely reasonable from a UX > perspective. > > On the other hand, tying a "does this need re-indenting?" question to a > "--show-diff --exit-code" option combination is not very obvious (to me, > at least). Multiple options to accomplish a use case might not be obvious. I'm wondering if we can combine it into a unique option. --show-diff show the changes that would be made --silent-diff exit with status 2 if any changes would be made + --check combination of --show-diff and --silent-diff I mean --diff=show,silent,check When you add exceptions, it starts to complicate the UI. usage("Cannot have both --silent-diff and --show-diff") if $silent_diff && $show_diff; +usage("Cannot have both --check and --show-diff") + if $check && $show_diff; + +usage("Cannot have both --check and --silent-diff") + if $check && $silent_diff; + I definitely agree. I just didn't want to remove options, but if people are ok with that, I will just change the interface. I like the git-diff semantics or have --diff and --check, similar to how Python's black[0] is. [0]: https://github.com/psf/black -- Tristan Partin Neon (https://neon.tech)
Re: Add --check option to pgindent
On 2023-Dec-12, Tom Lane wrote: > "Euler Taveira" writes: > > When you add exceptions, it starts to complicate the UI. > > Indeed. It seems like --silent-diff was poorly defined and poorly > named, and we need to rethink that option along the way to adding > this behavior. The idea that --show-diff and --silent-diff can > be used together is just inherently confusing, because they sound > like opposites. Maybe it's enough to rename --silent-diff to --check. You can do "--show-diff --check" and get both the error and the diff printed; or just "--check" and it'll throw an error without further ado; or "--show-diff" and it will both apply the diff and print it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "I must say, I am absolutely impressed with what pgsql's implementation of VALUES allows me to do. It's kind of ridiculous how much "work" goes away in my code. Too bad I can't do this at work (Oracle 8/9)." (Tom Allison) http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php
Re: typedefs.list glitches
On Thu May 12, 2022 at 4:21 PM CDT, Tom Lane wrote: I wrote: > every buildfarm member that's contributing to the typedefs list > builds with OpenSSL. That wouldn't surprise me, except that > my own animal sifaka should be filling that gap. Looking at > its latest attempt[1], it seems to be generating an empty list, > which I guess means that our recipe for extracting typedefs > doesn't work on macOS/arm64. I shall investigate. Found it. Current macOS produces $ objdump -W /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/objdump: error: unknown argument '-W' where last year's vintage produced $ objdump -W objdump: Unknown command line argument '-W'. Try: '/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/objdump --help' objdump: Did you mean '-C'? This confuses run_build.pl into taking the "Linux and sometimes windows" code path instead of the $using_osx one. I think simplest fix is to move the $using_osx branch ahead of the heuristic ones, as attached. Hey Tom, Was this patch ever committed? -- Tristan Partin Neon (https://neon.tech)
Re: typedefs.list glitches
"Tristan Partin" writes: > Was this patch ever committed? Yes, though not till commit dcca861554e90d6395c3c153317b0b0e3841f103 Author: Andrew Dunstan Date: Sun Jan 15 07:32:50 2023 -0500 Improve typedef logic for MacOS sifaka is currently generating typedefs, and I'm pretty certain it's using unpatched REL_17 BF code. regards, tom lane
Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()
Hi hackers, Currently walrcv->walRcvState is set to WALRCV_STREAMING at the beginning of WalReceiverMain(). But it seems that after this assignment things could be wrong before the walreicever actually starts streaming (like not being able to connect to the primary). It looks to me that WALRCV_STREAMING should be set once walrcv_startstreaming() returns true: this is the proposal of this patch. I don't think the current assignment location is causing any issues, but I think it's more appropriate to move it like in the attached. Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom 000bb21e54085abcf9df8cc75e4a86c21e24700e Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Mon, 11 Dec 2023 20:05:25 + Subject: [PATCH v1] move walrcv->walRcvState assignment to WALRCV_STREAMING walrcv->walRcvState = WALRCV_STREAMING was set to early. Move the assignement to a more appropriate place. --- src/backend/replication/walreceiver.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 26ded928a7..47f1d50144 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -242,7 +242,6 @@ WalReceiverMain(void) } /* Advertise our PID so that the startup process can kill us */ walrcv->pid = MyProcPid; - walrcv->walRcvState = WALRCV_STREAMING; /* Fetch information required to start streaming */ walrcv->ready_to_display = false; @@ -413,9 +412,14 @@ WalReceiverMain(void) if (walrcv_startstreaming(wrconn, &options)) { if (first_stream) + { + SpinLockAcquire(&walrcv->mutex); + walrcv->walRcvState = WALRCV_STREAMING; + SpinLockRelease(&walrcv->mutex); ereport(LOG, (errmsg("started streaming WAL from primary at %X/%X on timeline %u", LSN_FORMAT_ARGS(startpoint), startpointTLI))); + } else ereport(LOG, (errmsg("restarted WAL streaming at %X/%X on timeline %u", -- 2.34.1
Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()
On Tue, Dec 12, 2023, at 12:58 PM, Drouvot, Bertrand wrote: > Currently walrcv->walRcvState is set to WALRCV_STREAMING at the > beginning of WalReceiverMain(). > > But it seems that after this assignment things could be wrong before the > walreicever actually starts streaming (like not being able to connect > to the primary). > > It looks to me that WALRCV_STREAMING should be set once > walrcv_startstreaming() > returns true: this is the proposal of this patch. Per the state name (streaming), it seems it should be set later as you proposed, however, I'm concerned about the previous state (WALRCV_STARTING). If I'm reading the code correctly, WALRCV_STARTING is assigned at RequestXLogStreaming(): SetInstallXLogFileSegmentActive(); RequestXLogStreaming(tli, ptr, PrimaryConnInfo, PrimarySlotName, wal_receiver_create_temp_slot); flushedUpto = 0; } /* * Check if WAL receiver is active or wait to start up. */ if (!WalRcvStreaming()) { lastSourceFailed = true; break; } After a few lines the function WalRcvStreaming() has: SpinLockRelease(&walrcv->mutex); /* * If it has taken too long for walreceiver to start up, give up. Setting * the state to STOPPED ensures that if walreceiver later does start up * after all, it will see that it's not supposed to be running and die * without doing anything. */ if (state == WALRCV_STARTING) { pg_time_t now = (pg_time_t) time(NULL); if ((now - startTime) > WALRCV_STARTUP_TIMEOUT) { boolstopped = false; SpinLockAcquire(&walrcv->mutex); if (walrcv->walRcvState == WALRCV_STARTING) { state = walrcv->walRcvState = WALRCV_STOPPED; stopped = true; } SpinLockRelease(&walrcv->mutex); if (stopped) ConditionVariableBroadcast(&walrcv->walRcvStoppedCV); } } Couldn't it give up before starting if you apply your patch? My main concern is due to a slow system, the walrcv_connect() took to long in WalReceiverMain() and the code above kills the walreceiver while in the process to start it. Since you cannot control the hardcoded WALRCV_STARTUP_TIMEOUT value, you might have issues during overload periods. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Report planning memory in EXPLAIN ANALYZE
I would replace the text in explain.sgml with this: + Include information on memory consumption by the query planning phase. + This includes the precise amount of storage used by planner in-memory + structures, as well as overall total consumption of planner memory, + including allocation overhead. + This parameter defaults to FALSE. and remove the new example you're adding; it's not really necessary. In struct ExplainState, I'd put the new member just below summary. If we're creating a new memory context for planner, we don't need the "mem_counts_start" thing, and we don't need the MemoryContextSizeDifference thing. Just add the MemoryContextMemConsumed() function (which ISTM should be just below MemoryContextMemAllocated() instead of in the middle of the MemoryContextStats implementation), and just report the values we get there. I think the SizeDifference stuff increases surface damage for no good reason. ExplainOnePlan() is given a MemoryUsage object (or, if we do as above and no longer have a MemoryUsage struct at all in the first place, a MemoryContextCounters object) even when the MEMORY option is false. This means we waste computing memory usage when not needed. Let's avoid that useless overhead. I'd also do away with the comment you added in explain_ExecutorEnd() and do just this, below setting of es->summary: + /* No support for MEMORY option */ + /* es->memory = false; */ We don't need to elaborate more at present. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
Re: Report planning memory in EXPLAIN ANALYZE
On 2023-Dec-12, Alvaro Herrera wrote: > I would replace the text in explain.sgml with this: > > + Include information on memory consumption by the query planning phase. > + This includes the precise amount of storage used by planner in-memory > + structures, as well as overall total consumption of planner memory, > + including allocation overhead. > + This parameter defaults to FALSE. (This can still use a lot more wordsmithing of course, such as avoiding repeated use of "include/ing", ugh) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "The Gord often wonders why people threaten never to come back after they've been told never to return" (www.actsofgord.com)
Re: pg_upgrade and logical replication
On Thu, Dec 7, 2023 at 8:15 PM vignesh C wrote: > > On Tue, 5 Dec 2023 at 10:56, Michael Paquier wrote: > > > > On Mon, Dec 04, 2023 at 04:30:49PM +0530, Amit Kapila wrote: > > > I have made minor changes in the comments and code at various places. > > > See and let me know if you are not happy with the changes. I think > > > unless there are more suggestions or comments, we can proceed with > > > committing it. > > > > Yeah. I am planning to look more closely at what you have here, and > > it is going to take me a bit more time though (some more stuff planned > > for next CF, an upcoming conference and end/beginning-of-year > > vacations), but I think that targetting the beginning of next CF in > > January would be OK. > > > > Overall, I have the impression that the patch looks pretty solid, with > > a restriction in place for "init" and "ready" relations, while there > > are tests to check all the states that we expect. Seeing coverage > > about all that makes me a happy hacker. > > > > + * If retain_lock is true, then don't release the locks taken in this > > function. > > + * We normally release the locks at the end of transaction but in > > binary-upgrade > > + * mode, we expect to release those immediately. > > > > I think that this should be documented in pg_upgrade_support.c where > > the caller expects the locks to be released, and why these should be > > released. There is a risk that this comment becomes obsolete if > > AddSubscriptionRelState() with locks released is called in a different > > code path. Anyway, I am not sure to get why this is OK, or even > > necessary. It seems like a good practice to keep the locks on the > > subscription until the transaction that updates its state. If there's > > a specific reason explaining why that's better, the patch should tell > > why. > > Added comments for this. > > > + * However, this shouldn't be a problem as the upgrade ensures > > + * that all the transactions were replicated before upgrading the > > + * publisher. > > This wording looks a bit confusing to me, as "the upgrade" could refer > > to the upgrade of a subscriber, but what we want to tell is that the > > replay of the transactions is enforced when doing a publisher upgrade. > > I'd suggest something like "the upgrade of the publisher ensures that > > all the transactions were replicated before upgrading it". > > Modified > > > +my $result = $old_sub->safe_psql('postgres', > > + "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'i'"); > > +is($result, qq(t), "Check that the table is in init state"); > > > > Hmm. Not sure that this is safe. Shouldn't this be a > > poll_query_until(), polling that the state of the relation is what we > > want it to be after requesting a fresh of the publication on the > > subscriber? > > This is not required as the table will be added in init state after > "Alter Subscription ... Refresh .." command itself. > > Thanks for the comments, the attached v24 version patch has the > changes for the same. Thank you for updating the patch. Here are some minor comments: +if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) +ereport(ERROR, +errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("relation %u does not exist", relid)); + I think the error code should be ERRCODE_UNDEFINED_TABLE, and the error message should be something like "relation with OID %u does not exist". Or we might not need such checks since an undefined-object error is caught by relation_open()? --- +/* Fetch the existing tuple. */ +tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId, + CStringGetDatum(subname)); +if (!HeapTupleIsValid(tup)) +ereport(ERROR, +errcode(ERRCODE_UNDEFINED_OBJECT), +errmsg("subscription \"%s\" does not exist", subname)); + +form = (Form_pg_subscription) GETSTRUCT(tup); +subid = form->oid; The above code can be replaced with "get_subscription_oid(subname, false)". binary_upgrade_replorigin_advance() has the same code. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Built-in CTYPE provider
On 12/5/23 3:46 PM, Jeff Davis wrote: > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec. > > === Questions === > > * Is a built-in ctype provider a reasonable direction for Postgres as > a project? > * Does it feel like it would be simpler or more complex than what > we're doing now? > * Do we want to just try to improve our ICU support instead? > * Do we want the built-in provider to be one thing, or have a few > options (e.g. "standard" or "posix" character classification; > "simple" or "full" case mapping)? Generally, I am in favor of this - I think we need to move in the direction of having an in-database option around unicode for PG users, given how easy it is for administrators to mis-manage dependencies. Especially when OS admins can be different from DB admins, and when nobody really understands risks of changing libs with in-place moves to new operating systems - except for like 4 of us on the mailing lists. My biggest concern is around maintenance. Every year Unicode is assigning new characters to existing code points, and those existing code points can of course already be stored in old databases before libs are updated. When users start to notice that regex [[:digit:]] or upper/lower functions aren't working correctly with characters in their DB, they'll probably come asking for fixes. And we may end up with something like the timezone database where we need to periodically add a more current ruleset - albeit alongside as a new version in this case. Here are direct links to charts of newly assigned characters from the last few Unicode updates: 2022: https://www.unicode.org/charts/PDF/Unicode-15.0/ 2021: https://www.unicode.org/charts/PDF/Unicode-14.0/ 2020: https://www.unicode.org/charts/PDF/Unicode-13.0/ 2019: https://www.unicode.org/charts/PDF/Unicode-12.0/ If I'm reading the Unicode 15 update correctly, PostgreSQL regex expressions with [[:digit:]] will not correctly identify Kaktovik or Nag Mundari or Kawi digits without that update to character type specs. If I'm reading the Unicode 12 update correctly, then upper/lower functions aren't going to work correctly on Latin Glottal A and I and U characters without that update to character type specs. Overall I see a lot fewer Unicode updates involving upper/lower than I do with digits - especially since new scripts often involve their own numbering characters which makes new digits more common. But lets remember that people like to build indexes on character classification functions like upper/lower, for case insensitive searching. It's another case where the index will be corrupted if someone happened to store Latin Glottal vowels in their database and then we update libs to the latest character type rules. So even with something as basic as character type, if we're going to do it right, we still need to either version it or definitively decide that we're not going to every support newly added Unicode characters like Latin Glottals. -Jeremy -- http://about.me/jeremy_schneider
Clean up find_typedefs and add support for Mac
The script was using a few deprecated things according to POSIX: - -o instead of || - egrep - `` instead of $() I removed those for their "modern" equivalents. Hopefully no buildfarm member complains. I can remove any of those patches though. I did go ahead and remove egrep usage from the entire codebase while I was at it. There is still a configure check though. I'm thinking that could also be removed? I moved system detection to use uname -s. I hope that isn't a big deal. Not sure the best way to identify Mac otherwise. The big patch here is adding support for Mac. objdump -W doesn't work on Mac. So, I used dsymutil and dwarfdump to achieve the same result. I am not someone who ever uses awk, so someone should definitely check my work there. I can only confirm this works on the latest version of Mac, and have no clue how backward compatible it is. I also wrote this without having a Mac. I had to ping a coworker with a Mac for help. My goal with the Mac support is to enable use of find_typedef for extension developers, where using a Mac might be more prominent than upstream Postgres development, but that is just a guess. -- Tristan Partin Neon (https://neon.tech) From e717a14a171c0226921ffed003dedd104bf3cf99 Mon Sep 17 00:00:00 2001 From: Tristan Partin Date: Tue, 12 Dec 2023 10:13:13 -0600 Subject: [PATCH v1 1/6] Replace egrep with grep -E GNU grep has been warning about grep -E since the 3.8 series. Further GNU commentary: > 7th Edition Unix had commands egrep and fgrep that were the counterparts > of the modern grep -E and grep -F. Although breaking up grep into three > programs was perhaps useful on the small computers of the 1970s, egrep > and fgrep were not standardized by POSIX and are no longer needed. In > the current GNU implementation, egrep and fgrep issue a warning and then > act like their modern counterparts; eventually, they are planned to be > removed entirely. Further man page documentation: > This grep has been enhanced in an upwards-compatible way to provide the > exact functionality of the historical egrep and fgrep commands as well. > It was the clear intention of the standard developers to consolidate the > three greps into a single command. --- src/backend/port/aix/mkldexport.sh | 4 ++-- src/tools/find_typedef | 6 +++--- src/tools/perlcheck/find_perl_files | 2 +- src/tools/pginclude/pgrminclude | 6 +++--- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/src/backend/port/aix/mkldexport.sh b/src/backend/port/aix/mkldexport.sh index adf3793e86..41405eba02 100755 --- a/src/backend/port/aix/mkldexport.sh +++ b/src/backend/port/aix/mkldexport.sh @@ -53,9 +53,9 @@ else fi fi $NM -BCg $1 | \ - egrep ' [TDB] ' | \ + grep -E ' [TDB] ' | \ sed -e 's/.* //' | \ - egrep -v '\$' | \ + grep -E -v '\$' | \ sed -e 's/^[.]//' | \ sort | \ uniq diff --git a/src/tools/find_typedef b/src/tools/find_typedef index 24e9b76651..fec0520c32 100755 --- a/src/tools/find_typedef +++ b/src/tools/find_typedef @@ -36,12 +36,12 @@ do # if objdump -W is recognized, only one line of error should appear if [ `objdump -W 2>&1 | wc -l` -eq 1 ] then # Linux objdump -W "$DIR"/* | - egrep -A3 '\(DW_TAG_typedef\)' | + grep -E -A3 '\(DW_TAG_typedef\)' | awk ' $2 == "DW_AT_name" {print $NF}' elif [ `readelf -w 2>&1 | wc -l` -gt 1 ] then # FreeBSD, similar output to Linux readelf -w "$DIR"/* | - egrep -A3 '\(DW_TAG_typedef\)' | + grep -E -A3 '\(DW_TAG_typedef\)' | awk ' $1 == "DW_AT_name" {print $NF}' fi done | @@ -50,4 +50,4 @@ sort | uniq | # these are used both for typedefs and variable names # so do not include them -egrep -v '^(date|interval|timestamp|ANY)$' +grep -E -v '^(date|interval|timestamp|ANY)$' diff --git a/src/tools/perlcheck/find_perl_files b/src/tools/perlcheck/find_perl_files index 20dceb800d..406ec7f3a0 100644 --- a/src/tools/perlcheck/find_perl_files +++ b/src/tools/perlcheck/find_perl_files @@ -12,7 +12,7 @@ find_perl_files () { find "$@" -type f -name '*.p[lm]' -print # take executable files that file(1) thinks are perl files find "$@" -type f -perm -100 -exec file {} \; -print | - egrep -i ':.*perl[0-9]*\>' | + grep -E -i ':.*perl[0-9]*\>' | cut -d: -f1 } | sort -u | grep -v '^\./\.git/' } diff --git a/src/tools/pginclude/pgrminclude b/src/tools/pginclude/pgrminclude index 7cbd2e7c9c..27c6c5bfb5 100755 --- a/src/tools/pginclude/pgrminclude +++ b/src/tools/pginclude/pgrminclude @@ -64,14 +64,14 @@ compile_file() { [ "$INCLUDE" = "pg_config.h" ] && continue [ "$INCLUDE" = "c.h" ] && continue # Stringify macros will expand undefined identifiers, so skip files that use it - egrep -q '\<(CppAsString2?|CppConcat)\>' "$FILE" && continue + grep -E -q '\<(CppAsString2?|CppConcat)\>' "$FILE" && continue # preserve configure-specific includes # these includes are surrounded by #ifdef's grep -B1 '^#include[ ][ ]*[<"]'"$INCLUDE"'[>"]' "$FILE" | - egrep -q '^#if|^#else|^#elif
Re: encoding affects ICU regex character classification
On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: > > How would you specify what you want? One proposal would be to have a builtin collation provider: https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.ca...@j-davis.com I don't think there are very many ctype options, but they could be specified as part of the locale, or perhaps even as some provider- specific options specified at CREATE COLLATION time. > As with collating, I like the > idea of keeping support for libc even if it is terrible (some libcs > more than others) and eventually not the default, because I think > optional agreement with other software on the same host is a feature. Of course we should keep the libc support around. I'm not sure how relevant such a feature is, but I don't think we actually have to remove it. > Unless you also > implement built-in case mapping, you'd still have to call libc or ICU > for that, right? We can do built-in case mapping, see: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com > It seems a bit strange to use different systems for > classification and mapping. If you do implement mapping too, you > have > to decide if you believe it is language-dependent or not, I think? A complete solution would need to do the language-dependent case mapping. But that seems to only be 3 locales ("az", "lt", and "tr"), and only a handful of mapping changes, so we can handle that with the builtin provider as well. > Hmm, let's see what we're doing now... for ICU the regex code is > using > "simple" case mapping functions like u_toupper(c) that don't take a > locale, so no Turkish i/İ conversion for you, unlike our SQL > upper()/lower(), which this is supposed to agree with according to > the > comments at the top. I see why: POSIX can only do one-by-one > character mappings (which cannot handle Greek's context-sensitive > Σ->σ/ς or German's multi-character ß->SS) Regexes are inherently character-by-character, so transformations like ß->SS are not going to work for case-insensitive regex matching regardless of the provider. Σ->σ/ς does make sense, and what we have seems to be just broken: select 'ς' ~* 'Σ'; -- false in both libc and ICU select 'Σ' ~* 'ς'; -- true in both libc and ICU Similarly for titlecase variants: select 'Dž' ~* 'dž'; -- false in libc and ICU select 'dž' ~* 'Dž'; -- true in libc and ICU If we do the case mapping ourselves, we can make those work. We'd just have to modify the APIs a bit so that allcases() can actually get all of the case variants, rather than relying on just towupper/towlower. Regards, Jeff Davis
Re: encoding affects ICU regex character classification
On 12/12/23 1:39 PM, Jeff Davis wrote: > On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote: >> Unless you also >> implement built-in case mapping, you'd still have to call libc or ICU >> for that, right? > > We can do built-in case mapping, see: > > https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com > >> It seems a bit strange to use different systems for >> classification and mapping. If you do implement mapping too, you >> have >> to decide if you believe it is language-dependent or not, I think? > > A complete solution would need to do the language-dependent case > mapping. But that seems to only be 3 locales ("az", "lt", and "tr"), > and only a handful of mapping changes, so we can handle that with the > builtin provider as well. This thread has me second-guessing the reply I just sent on the other thread. Is someone able to test out upper & lower functions on U+A7BA ... U+A7BF across a few libs/versions? Theoretically the upper/lower behavior should change in ICU between Ubuntu 18.04 LTS and Ubuntu 20.04 LTS (specifically in ICU 64 / Unicode 12). And I have no idea if or when glibc might have picked up the new unicode characters. -Jeremy -- http://about.me/jeremy_schneider
Re: Clean up find_typedefs and add support for Mac
"Tristan Partin" writes: > The big patch here is adding support for Mac. objdump -W doesn't work on > Mac. So, I used dsymutil and dwarfdump to achieve the same result. We should probably nuke the current version of src/tools/find_typedef altogether in favor of copying the current buildfarm code for that. We know that the buildfarm's version works, whereas I'm not real sure that src/tools/find_typedef is being used in anger by anyone. Also, we're long past the point where developers can avoid having Perl installed. Ideally, the buildfarm client would then start to use find_typedef from the tree rather than have its own copy, both to reduce duplication and to ensure that the in-tree copy keeps working. regards, tom lane
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada wrote: > > > > Hi all, > > > > As the comment of ReorderBufferLargestTXN() says, it's very slow with > > many subtransactions: > > > > /* > > * Find the largest transaction (toplevel or subxact) to evict (spill to > > disk). > > * > > * XXX With many subtransactions this might be quite slow, because we'll > > have > > * to walk through all of them. There are some options how we could improve > > * that: (a) maintain some secondary structure with transactions sorted by > > * amount of changes, (b) not looking for the entirely largest transaction, > > * but e.g. for transaction using at least some fraction of the memory > > limit, > > * and (c) evicting multiple transactions at once, e.g. to free a given > > portion > > * of the memory limit (e.g. 50%). > > */ > > > > This is because the reorderbuffer has transaction entries for each > > top-level and sub transaction, and ReorderBufferLargestTXN() walks > > through all transaction entries to pick the transaction to evict. > > I've heard the report internally that replication lag became huge when > > decoding transactions each consisting of 500k sub transactions. Note > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > > Here is a test script for a many subtransactions scenario. In my > > environment, the logical decoding took over 2min to decode one top > > transaction having 100k subtransctions. > > > > - > > create table test (c int); > > create or replace function testfn (cnt int) returns void as $$ > > begin > > for i in 1..cnt loop > > begin > > insert into test values (i); > > exception when division_by_zero then > > raise notice 'caught error'; > > return; > > end; > > end loop; > > end; > > $$ > > language plpgsql; > > select testfn(10) > > set logical_decoding_work_mem to '4MB'; > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > > > > To deal with this problem, I initially thought of the idea (a) > > mentioned in the comment; use a binary heap to maintain the > > transactions sorted by the amount of changes or the size. But it seems > > not a good idea to try maintaining all transactions by its size since > > the size of each transaction could be changed frequently. > > > > The attached patch uses a different approach that consists of three > > strategies; (1) maintain the list of transactions whose size is larger > > than 10% of logical_decoding_work_mem, and preferentially evict a > > transaction from this list. If the list is empty, all transactions are > > small enough, (2) so we evict the oldest top-level transaction from > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > freeing memory blocks in GenerationContext. Finally, if this is also > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > note the fact that even if a transaction is evicted the > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > 0. In the worst case where all (quite a few) transactions are smaller > > than 10% of the memory limit, we might end up checking many > > transactions to find non-zero size transaction entries to evict. So > > the patch adds a new list to maintain all transactions that have at > > least one change in memory. > > > > Summarizing the algorithm I've implemented in the patch, > > > > 1. pick a transaction from the list of large transactions (larger than > > 10% of memory limit). > > 2. pick a transaction from the top-level transaction list in LSN order. > > 3. pick a transaction from the list of transactions that have at least > > one change in memory. > > > > With the patch, the above test case completed within 3 seconds in my > > environment. > > Thanks for working on this, I think it would be good to test other > scenarios as well where this might have some negative impact and see > where we stand. Agreed. > 1) A scenario where suppose you have one very large transaction that > is consuming ~40% of the memory and 5-6 comparatively smaller > transactions that are just above 10% of the memory limit. And now for > coming under the memory limit instead of getting 1 large transaction > evicted out, we are evicting out multiple times. Given the large transaction list will have up to 10 transactions, I think it's cheap to pick the largest transaction among them. It's O(N) but N won't be large. > 2) Another scenario where all the transactions are under 10% of the > memory limit but let's say there are some transactions are consuming > around 8-9% of the memory limit each but those are not very old > transactions whereas there are certain old transactions which are > fairly small and consuming under 1% of memory limit and there are many > such transactions. So how it would affect if we frequently select > many of these transactions to come under memory limit instead
Re: Oversight in reparameterize_path_by_child leading to executor crash
On 6/12/2023 14:30, Richard Guo wrote: I've self-reviewed this patch again and I think it's now in a committable state. I'm wondering if we can mark it as 'Ready for Committer' now, or we need more review comments/feedbacks. To recap, this patch postpones reparameterization of paths until createplan.c, which would help avoid building the reparameterized expressions we might not use. More importantly, it makes it possible to modify the expressions in RTEs (e.g. tablesample) and in RelOptInfos (e.g. baserestrictinfo) for reparameterization. Failing to do that can lead to executor crashes, wrong results, or planning errors, as we have already seen. As I see, this patch contains the following modifications: 1. Changes in the create_nestloop_path: here, it arranges all tests to the value of top_parent_relids, if any. It is ok for me. 2. Changes in reparameterize_path_by_child and coupled new function path_is_reparameterizable_by_child. I compared them, and it looks good. One thing here. You use the assertion trap in the case of inconsistency in the behaviour of these two functions. How disastrous would it be if someone found such inconsistency in production? Maybe better to use elog(PANIC, ...) report? 3. ADJUST_CHILD_ATTRS() macros. Understanding the necessity of this change, I want to say it looks a bit ugly. 4. SampleScan reparameterization part. It looks ok. It can help us in the future with asymmetric join and something else. 5. Tests. All of them are related to partitioning and the SampleScan issue. Maybe it is enough, but remember, this change now impacts the parameterization feature in general. 6. I can't judge how this switch of overall approach could impact something in the planner. I am only wary about what, from the first reparameterization up to the plan creation, we have some inconsistency in expressions (partitions refer to expressions with the parent relid). What if an extension in the middle changes the parent expression? By and large, this patch is in a good state and may be committed. -- regards, Andrei Lepikhov Postgres Professional
Re: broken master regress tests
On Tue, Oct 10, 2023 at 05:54:34PM -0700, Andres Freund wrote: > On 2023-10-10 17:08:25 -0700, Jeff Davis wrote: > > After this, it seems "make check" no longer picks up the locale from > > the system environment by default. > > Yea. I wonder if the better fix would have been to copy setenv("LC_MESSAGES", > "C", 1); > to the initdb template creation. That afaict also fixes the issue, with a > smaller blast radius? +1, that would restore the testing semantics known from v16-. I think the intent of the template was to optimize without changing semantics, and the above proposal aligns with that. Currently, for v17 alone, one needs installcheck or build file hacks to reproduce a locale-specific test failure. An alternative would be to declare that the tests are supported in one encoding+locale only, then stop testing others in the buildfarm. Even if we did that, I'm fairly sure we'd standardize on UTF8, not SQL_ASCII, as the one testable encoding.
Re: broken master regress tests
Noah Misch writes: > An alternative would be to declare that the tests are supported in one > encoding+locale only, then stop testing others in the buildfarm. Surely that's not even a thinkably acceptable choice. regards, tom lane
Re: Synchronizing slots from primary to standby
On Tue, Dec 12, 2023 at 2:44 PM shveta malik wrote: > > On Mon, Dec 11, 2023 at 7:12 PM Amit Kapila wrote: > > > > > I am > > asking this because the context used is TopMemoryContext which should > > be used only if we need something specific to be retained at the > > process level which doesn't seem to be the case here. > > > > Okay, I understand your concern. But this needs more thoughts on shall > we have all the slots synchronized in one txn or is it better to have > it existing way i.e. each slot being synchronized in its own txn > started in synchronize_one_slot. If we go by the former, can it have > any implications? > I think the one advantage of syncing each slot in a different transaction could have been if that helps with the visibility of updated slot information but that is not the case here as we always persist it to file. As per my understanding, here we need a transaction as we may access catalogs while creating/updating slots, so, a single transaction should be okay unless there are any other reasons. -- With Regards, Amit Kapila.
Re: Improve upcasting for INT range and multi range types
On Fri, Dec 8, 2023 at 4:21 AM Federico wrote: > > Hi, > > Postgresql seems to be missing upcasting when doing INT range and > multi-range operation, for example when checking if an int4 is inside > an int8 range. > Some non working example are the following > > SELECT 2::INT4 <@ '[1, 4)'::INT8RANGE > -- ERROR: operator does not exist: integer <@ int8range select oprname, oprleft::regtype, oprright::regtype, oprcode frompg_operator where oprname = '<@'; look at the results, you can see related info is: oprname | oprleft | oprright| oprcode -++---+-- <@ | anyelement | anyrange | elem_contained_by_range <@ | anyelement | anymultirange | elem_contained_by_multirange SELECT 2::INT4 <@ '[1, 4)'::INT8RANGE It actually first does an operator sanity check, transforms anyelement, anyrange to the detailed non-polymorphic data type. then calls the function elem_contained_by_range. but it failed at the first step. per doc https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC Similarly, if there are positions declared anyrange and others declared anyelement or anyarray, the actual range type in the anyrange positions must be a range whose subtype is the same type appearing in the anyelement positions and the same as the element type of the anyarray positions. If there are positions declared anymultirange, their actual multirange type must contain ranges matching parameters declared anyrange and base elements matching parameters declared anyelement and anyarray. Based on my interpretation, I don't think SELECT 2::INT4 <@ '[1, 4)'::INT8RANGE is doable.
Re: Improve upcasting for INT range and multi range types
jian he writes: > Based on my interpretation, I don't think SELECT 2::INT4 <@ '[1, > 4)'::INT8RANGE is doable. Yeah, it would require a considerable expansion of the scope of what can be matched by a polymorphic operator. I'm afraid that the negative consequences (mainly, "ambiguous operator" failures because more than one thing can be matched) would outweigh the benefits. It is kind of annoying though that the system can't do the "obvious" right thing here. regards, tom lane
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera wrote: > > [Added Andrey again in CC, because as I understand they are using this > code or something like it in production. Please don't randomly remove > people from CC lists.] Oh, glad to know that. Yeah, I generally do not remove but I have noticed that in the mail chain, some of the reviewers just replied to me and the hackers' list, and from that point onwards I lost track of the CC list. > I've been looking at this some more, and I'm not confident in that the > group clog update stuff is correct. I think the injection points test > case was good enough to discover a problem, but it's hard to get peace > of mind that there aren't other, more subtle problems. Yeah, I agree. > The problem I see is that the group update mechanism is designed around > contention of the global xact-SLRU control lock; it uses atomics to > coordinate a single queue when the lock is contended. So if we split up > the global SLRU control lock using banks, then multiple processes using > different bank locks might not contend. OK, this is fine, but what > happens if two separate groups of processes encounter contention on two > different bank locks? I think they will both try to update the same > queue, and coordinate access to that *using different bank locks*. I > don't see how can this work correctly. Let's back up a bit and start from the current design with the centralized lock. With that, if one process is holding the lock the other processes will try to perform the group update, and if there is already a group that still hasn't got the lock but trying to update the different CLOG page then what this process wants to update then it will not add itself for the group update instead it will fallback to the normal lock wait. Now, in another situation, it may so happen that the group leader of the other group already got the control lock and in such case, it would have cleared the 'procglobal->clogGroupFirst' that means now we will start forming a different group. So logically if we talk only about the optimization part then the thing is that it is assumed that at a time when we are trying to commit a log of concurrent xid then those xids are mostly of the same range and will fall in the same SLRU page and group update will help them. But if we are getting some out-of-range xid of some long-running transaction they might not even go for group update as the page number will be different. Although the situation might be better here with a bank-wise lock because there if those xids are falling in altogether a different bank it might not even contend. Now, let's talk about the correctness, I think even though we are getting processes that might be contending on different bank-lock, still we are ensuring that in a group all the processes are trying to update the same SLRU page (i.e. same bank also, we will talk about the exception later[1]). One of the processes is becoming a leader and as soon as the leader gets the lock it detaches the queue from the 'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that other group update requesters now can form another parallel group. But here I do not see a problem with correctness. I agree someone might say that since now there is a possibility that different groups might get formed for different bank locks we do not get other groups to get formed until we get the lock for our bank as we do not clear 'procglobal->clogGroupFirst' before we get the lock. Other requesters might want to update the page in different banks so why block them? But the thing is the group update design is optimized for the cases where all requesters are trying to update the status of xids generated near the same range. > I suspect the first part of that algorithm, where atomics are used to > create the list without a lock, might work fine. But will each "leader" > process, each of which is potentially using a different bank lock, > coordinate correctly? Maybe this works correctly because only one > process will find the queue head not empty? If this is what happens, > then there needs to be comments about it. Yes, you got it right, I will try to comment on it better. Without any explanation, > this seems broken and potentially dangerous, as some transaction commit > bits might become lost given high enough concurrency and bad luck. > > Maybe this can be made to work by having one more lwlock that we use > solely to coordinate this task. Do you mean to say a different lock for adding/removing in the list instead of atomic operation? I think then we will lose the benefit we got in the group update by having contention on another lock. [1] I think we already know about the exception case and I have explained in the comments as well that in some cases we might add different clog page update requests in the same group, and for handling that exceptional case we are checking the respective bank lock for each page and if that exc
Re: Synchronizing slots from primary to standby
On Mon, Dec 11, 2023 at 5:13 PM shveta malik wrote: > > On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand > wrote: > > > > > If we agree > > > on that then it would be good to prohibit setting this GUC on standby > > > or at least it should be a no-op even if this GUC should be set on > > > physical standby. > > > > I'd prefer to completely prohibit it on standby (to make it very clear it's > > not > > working at all) as long as one can enable it without downtime once the > > standby > > is promoted (which is the case currently). > > And I think slot-sync worker should exit as well on cascading standby. > Thoughts? > I think one has set all the valid parameters for the slot-sync worker on standby, we should not exit, rather it should be no-op which means it should not try to sync slots from another standby. One scenario where this may help is when users promote the standby which has already synced slots from the primary. In this case, cascading standby will become non-cascading and should sync slots. -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada wrote: > > > Thanks for working on this, I think it would be good to test other > > scenarios as well where this might have some negative impact and see > > where we stand. > > Agreed. > > > 1) A scenario where suppose you have one very large transaction that > > is consuming ~40% of the memory and 5-6 comparatively smaller > > transactions that are just above 10% of the memory limit. And now for > > coming under the memory limit instead of getting 1 large transaction > > evicted out, we are evicting out multiple times. > > Given the large transaction list will have up to 10 transactions, I > think it's cheap to pick the largest transaction among them. It's O(N) > but N won't be large. Yeah, that makes sense. > > 2) Another scenario where all the transactions are under 10% of the > > memory limit but let's say there are some transactions are consuming > > around 8-9% of the memory limit each but those are not very old > > transactions whereas there are certain old transactions which are > > fairly small and consuming under 1% of memory limit and there are many > > such transactions. So how it would affect if we frequently select > > many of these transactions to come under memory limit instead of > > selecting a couple of large transactions which are consuming 8-9%? > > Yeah, probably we can do something for small transactions (i.e. small > and on-memory transactions). One idea is to pick the largest > transaction among them by iterating over all of them. Given that the > more transactions are evicted, the less transactions the on-memory > transaction list has, unlike the current algorithm, we would still > win. Or we could even split it into several sub-lists in order to > reduce the number of transactions to check. For example, splitting it > into two lists: transactions consuming 5% < and 5% >= of the memory > limit, and checking the 5% >= list preferably. The cost for > maintaining these lists could increase, though. > > Do you have any ideas? Yeah something like what you mention might be good, we maintain 3 list that says large, medium, and small transactions. In a large transaction, list suppose we allow transactions that consume more than 10% so there could be at max 10 transactions so we can do a sequence search and spill the largest of all. Whereas in the medium list suppose we keep transactions ranging from e.g. 3-10% then it's just fine to pick from the head because the size differences between the largest and smallest transaction in this list are not very significant. And remaining in the small transaction list and from the small transaction list we can choose to spill multiple transactions at a time. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Subsequent request from pg client
Can postgres client send subsequent requests without receiving response? If so, how does the postgres client correlate between a request and its response? Where can I get more hints about it?
Re: Synchronizing slots from primary to standby
On Tue, Dec 12, 2023 at 5:56 PM Nisha Moond wrote: > > A review on v45 patch: > > If one creates a logical slot with failover=true as - > select pg_create_logical_replication_slot('logical_slot','pgoutput', > false, true, true); > > Then, uses the existing logical slot while creating a subscription - > postgres=# create subscription sub4 connection 'dbname=postgres > host=localhost port=5433' publication pub1t4 WITH > (slot_name=logical_slot, create_slot=false, failover=true); > NOTICE: changed the failover state of replication slot "logical_slot" > on publisher to false > CREATE SUBSCRIPTION > > Despite configuring logical_slot's failover to true and specifying > failover=true during subscription creation, the NOTICE indicates a > change in the failover state to 'false', without providing any > explanation for this transition. > It can be confusing for users, so IMO, the notice should include the > reason for switching failover to 'false' or should give a hint to use > either refresh=false or copy_data=false to enable failover=true for > the slot as we do in other similar 'alter subscription...' scenarios. > Agree. The NOTICE should be more informative. thanks SHveta
Re: Synchronizing slots from primary to standby
On Wed, Dec 13, 2023 at 10:40 AM Amit Kapila wrote: > > On Mon, Dec 11, 2023 at 5:13 PM shveta malik wrote: > > > > On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand > > wrote: > > > > > > > If we agree > > > > on that then it would be good to prohibit setting this GUC on standby > > > > or at least it should be a no-op even if this GUC should be set on > > > > physical standby. > > > > > > I'd prefer to completely prohibit it on standby (to make it very clear > > > it's not > > > working at all) as long as one can enable it without downtime once the > > > standby > > > is promoted (which is the case currently). > > > > And I think slot-sync worker should exit as well on cascading standby. > > Thoughts? > > > > I think one has set all the valid parameters for the slot-sync worker > on standby, we should not exit, rather it should be no-op which means > it should not try to sync slots from another standby. One scenario > where this may help is when users promote the standby which has > already synced slots from the primary. In this case, cascading standby > will become non-cascading and should sync slots. > Right, then perhaps we should increase naptime in this no-op case. It could be even more then current inactivity naptime which is just 10sec. Shall it be say 5min in this case? thanks Shveta
Re: pg_upgrade and logical replication
On Wed, 13 Dec 2023 at 01:56, Masahiko Sawada wrote: > > On Thu, Dec 7, 2023 at 8:15 PM vignesh C wrote: > > > > On Tue, 5 Dec 2023 at 10:56, Michael Paquier wrote: > > > > > > On Mon, Dec 04, 2023 at 04:30:49PM +0530, Amit Kapila wrote: > > > > I have made minor changes in the comments and code at various places. > > > > See and let me know if you are not happy with the changes. I think > > > > unless there are more suggestions or comments, we can proceed with > > > > committing it. > > > > > > Yeah. I am planning to look more closely at what you have here, and > > > it is going to take me a bit more time though (some more stuff planned > > > for next CF, an upcoming conference and end/beginning-of-year > > > vacations), but I think that targetting the beginning of next CF in > > > January would be OK. > > > > > > Overall, I have the impression that the patch looks pretty solid, with > > > a restriction in place for "init" and "ready" relations, while there > > > are tests to check all the states that we expect. Seeing coverage > > > about all that makes me a happy hacker. > > > > > > + * If retain_lock is true, then don't release the locks taken in this > > > function. > > > + * We normally release the locks at the end of transaction but in > > > binary-upgrade > > > + * mode, we expect to release those immediately. > > > > > > I think that this should be documented in pg_upgrade_support.c where > > > the caller expects the locks to be released, and why these should be > > > released. There is a risk that this comment becomes obsolete if > > > AddSubscriptionRelState() with locks released is called in a different > > > code path. Anyway, I am not sure to get why this is OK, or even > > > necessary. It seems like a good practice to keep the locks on the > > > subscription until the transaction that updates its state. If there's > > > a specific reason explaining why that's better, the patch should tell > > > why. > > > > Added comments for this. > > > > > + * However, this shouldn't be a problem as the upgrade ensures > > > + * that all the transactions were replicated before upgrading the > > > + * publisher. > > > This wording looks a bit confusing to me, as "the upgrade" could refer > > > to the upgrade of a subscriber, but what we want to tell is that the > > > replay of the transactions is enforced when doing a publisher upgrade. > > > I'd suggest something like "the upgrade of the publisher ensures that > > > all the transactions were replicated before upgrading it". > > > > Modified > > > > > +my $result = $old_sub->safe_psql('postgres', > > > + "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = > > > 'i'"); > > > +is($result, qq(t), "Check that the table is in init state"); > > > > > > Hmm. Not sure that this is safe. Shouldn't this be a > > > poll_query_until(), polling that the state of the relation is what we > > > want it to be after requesting a fresh of the publication on the > > > subscriber? > > > > This is not required as the table will be added in init state after > > "Alter Subscription ... Refresh .." command itself. > > > > Thanks for the comments, the attached v24 version patch has the > > changes for the same. > > Thank you for updating the patch. > > Here are some minor comments: > > +if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > +ereport(ERROR, > +errcode(ERRCODE_INVALID_PARAMETER_VALUE), > +errmsg("relation %u does not exist", relid)); > + > > I think the error code should be ERRCODE_UNDEFINED_TABLE, and the > error message should be something like "relation with OID %u does not > exist". Or we might not need such checks since an undefined-object > error is caught by relation_open()? I have removed this as it will be caught by relation_open. > --- > +/* Fetch the existing tuple. */ > +tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId, > + CStringGetDatum(subname)); > +if (!HeapTupleIsValid(tup)) > +ereport(ERROR, > +errcode(ERRCODE_UNDEFINED_OBJECT), > +errmsg("subscription \"%s\" does not > exist", subname)); > + > +form = (Form_pg_subscription) GETSTRUCT(tup); > +subid = form->oid; Modified > The above code can be replaced with "get_subscription_oid(subname, > false)". binary_upgrade_replorigin_advance() has the same code. Modified Thanks for the comments, the attached v25 version patch has the changes for the same. Regards, Vignesh From a227419d761e9e121eeaf2db621d227c6abc6dc0 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Thu, 7 Dec 2023 09:50:27 +0530 Subject: [PATCH v25] Allow upgrades to preserve the full subscription's state. This feature will allow us to replicate the changes on subscriber nodes after the upgrade. Previously, only the subscription
Fix bug with indexes on whole-row expressions
Hi, Thomas Munro and Laurenz Albe. Since I didn't subscribe to the psql-hackers mailing list before this bug was raised, please forgive me for not being able to reply to this email by placing the email message below. https://www.postgresql.org/message-id/flat/e48a5d9a2d3d72985d61ee254314f5f5f5444a55.ca...@cybertec.at I forbid to create indexes on whole-row expression in the following patch. I'd like to hear your opinions. -- Best Wishes, ywgrit diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index cd23ab3b25..e4451b1d36 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -572,9 +572,18 @@ DefineIndex(Oid relationId, Oid root_save_userid; int root_save_sec_context; int root_save_nestlevel; + ListCell *lc; root_save_nestlevel = NewGUCNestLevel(); + foreach (lc, stmt->indexParams) + { + IndexElem *ielem = castNode(IndexElem, lfirst(lc)); + if (IsA(ielem->expr, Var) && castNode(Var, ielem->expr)->varattno == 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), + errmsg("cannot create index on whole-row expression of table '%s'", ielem->indexcolname))); + } /* * Some callers need us to run with an empty default_tablespace; this is a * necessary hack to be able to reproduce catalog state accurately when