date:20231212

Add isCatalogRel in rmgrdesc

2023-12-12 Thread Drouvot, Bertrand


Hi hackers,

6af1793954 added a new field namely "isCatalogRel" in some WAL records
to help detecting row removal conflict during logical decoding from standby.

Please find attached a patch to add this field in the related rmgrdesc (i.e
all the ones that already provide the snapshotConflictHorizon except the one
related to xl_heap_visible: indeed a new bit was added in its flag field in 
6af1793954
instead of adding the isCatalogRel bool).

I think it's worth it, as this new field could help diagnose conflicts issues 
(if any).

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 564e689cf74e1b9893a28b5ecf6bdbc6fb549232 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 11 Dec 2023 21:10:30 +
Subject: [PATCH v1] adding isCatalogRel to rmgrdesc

---
 src/backend/access/rmgrdesc/gistdesc.c | 10 ++
 src/backend/access/rmgrdesc/hashdesc.c |  5 +++--
 src/backend/access/rmgrdesc/heapdesc.c | 10 ++
 src/backend/access/rmgrdesc/nbtdesc.c  | 10 ++
 src/backend/access/rmgrdesc/spgdesc.c  |  5 +++--
 5 files changed, 24 insertions(+), 16 deletions(-)
 100.0% src/backend/access/rmgrdesc/

diff --git a/src/backend/access/rmgrdesc/gistdesc.c 
b/src/backend/access/rmgrdesc/gistdesc.c
index 5dc6e1dcee..8008c041a6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -26,18 +26,20 @@ out_gistxlogPageUpdate(StringInfo buf, gistxlogPageUpdate 
*xlrec)
 static void
 out_gistxlogPageReuse(StringInfo buf, gistxlogPageReuse *xlrec)
 {
-   appendStringInfo(buf, "rel %u/%u/%u; blk %u; snapshotConflictHorizon 
%u:%u",
+   appendStringInfo(buf, "rel %u/%u/%u; blk %u; snapshotConflictHorizon 
%u:%u, isCatalogRel %u",
 xlrec->locator.spcOid, 
xlrec->locator.dbOid,
 xlrec->locator.relNumber, xlrec->block,
 
EpochFromFullTransactionId(xlrec->snapshotConflictHorizon),
-
XidFromFullTransactionId(xlrec->snapshotConflictHorizon));
+
XidFromFullTransactionId(xlrec->snapshotConflictHorizon),
+xlrec->isCatalogRel);
 }
 
 static void
 out_gistxlogDelete(StringInfo buf, gistxlogDelete *xlrec)
 {
-   appendStringInfo(buf, "delete: snapshotConflictHorizon %u, nitems: %u",
-xlrec->snapshotConflictHorizon, 
xlrec->ntodelete);
+   appendStringInfo(buf, "delete: snapshotConflictHorizon %u, nitems: %u, 
isCatalogRel %u",
+xlrec->snapshotConflictHorizon, 
xlrec->ntodelete,
+xlrec->isCatalogRel);
 }
 
 static void
diff --git a/src/backend/access/rmgrdesc/hashdesc.c 
b/src/backend/access/rmgrdesc/hashdesc.c
index b6810a9320..49855ce467 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -113,9 +113,10 @@ hash_desc(StringInfo buf, XLogReaderState *record)
{
xl_hash_vacuum_one_page *xlrec = 
(xl_hash_vacuum_one_page *) rec;
 
-   appendStringInfo(buf, "ntuples %d, 
snapshotConflictHorizon %u",
+   appendStringInfo(buf, "ntuples %d, 
snapshotConflictHorizon %u, isCatalogRel %u",
 xlrec->ntuples,
-
xlrec->snapshotConflictHorizon);
+
xlrec->snapshotConflictHorizon,
+
xlrec->isCatalogRel);
break;
}
}
diff --git a/src/backend/access/rmgrdesc/heapdesc.c 
b/src/backend/access/rmgrdesc/heapdesc.c
index f382c0f623..3c17485b02 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -179,10 +179,11 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
{
xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-   appendStringInfo(buf, "snapshotConflictHorizon: %u, 
nredirected: %u, ndead: %u",
+   appendStringInfo(buf, "snapshotConflictHorizon: %u, 
nredirected: %u, ndead: %u, isCatalogRel %u",
 xlrec->snapshotConflictHorizon,
 xlrec->nredirected,
-xlrec->ndead);
+xlrec->ndead,
+xlrec->isCatalogRel);
 
if (XLogRecHasBlockData(record, 0))
{
@@ -238,8 +239,9 @@ heap2_desc(StringInfo

planner chooses incremental but not the best one

2023-12-12 Thread Nicolas Lutic



Dear Hackers,

I've come across a behaviour of the planner I can't explain.
After a migration from 11 to 15 (on RDS) we noticed a degradation in 
response time on a query, it went from a few seconds to ten minutes.

A vacuum(analyze) has been realized to be sure that all is clean.
The 'explain analyze' shows us a change of plan. Postgresql 15 chooses 
`incremental sort` with an index corresponding to the ORDER BY clause 
(on the created_at column). The previous v11 plan used a more efficient 
index.


By deactivating incremental sort, response times in v15 are equal to v11 
one.


Here is the query

    SELECT inputdocum0_.id AS col_0_0_
    FROM document_management_services.input_document inputdocum0_
    WHERE (inputdocum0_.indexation_domain_id in 
('2d29daf6-e151-479a-a52a-78b08bb3009d'))
  AND (inputdocum0_.indexation_subsidiary_id in 
('9f9df402-f70b-40d9-b283-a3c35232469a'))

  AND (inputdocum0_.locked_at IS NULL)
  AND (inputdocum0_.locked_by_app IS NULL)
  AND (inputdocum0_.locked_by_user IS NULL)
  AND (inputdocum0_.lock_time_out IS NULL)
  AND inputdocum0_.archiving_state<> 'DESTROYED'
  AND (inputdocum0_.creation_state in ('READY'))
  AND inputdocum0_.active_content=true
  AND (inputdocum0_.processing_state in ('PENDING_INDEXATION'))
    ORDER BY inputdocum0_.created_at ASC,
 inputdocum0_.reception_id ASC,
 inputdocum0_.reception_order ASC
    LIMIT 50 ;

Here are some details, the table `input_document` is partionned by hash 
with 20 partitions  with a lot of indexes


Indexes:
    "input_document_pkey" PRIMARY KEY, btree (id)
    "input_document_api_version_idx" btree (api_version) INVALID
    "input_document_created_at_idx" btree (created_at)
    "input_document_created_by_user_profile_idx" btree 
(created_by_user_profile)
    "input_document_dashboard_idx" btree (processing_state, 
indexation_family_id, indexation_group_id, reception_id) INCLUDE 
(active_content, archiving_state, creation_state) WHERE active_content = 
true AND archiving_state <> 'DESTROYED'::text AND creation_state <> 
'PENDING'::text
    "input_document_fts_description_idx" gin 
(to_tsvector('simple'::regconfig, description))
    "input_document_fts_insured_firstname_idx" gin 
(to_tsvector('simple'::regconfig, indexation_insured_firstname))
    "input_document_fts_insured_lastname_idx" gin 
(to_tsvector('simple'::regconfig, indexation_insured_lastname))
    "input_document_indexation_activity_id_idx" btree 
(indexation_activity_id)

    "input_document_indexation_agency_id_idx" btree (indexation_agency_id)
    "input_document_indexation_distributor_id_idx" btree 
(indexation_distributor_id)

    "input_document_indexation_domain_id_idx" btree (indexation_domain_id)
    "input_document_indexation_family_id_idx" btree (indexation_family_id)
    "input_document_indexation_group_id_idx" btree (indexation_group_id)
    "input_document_indexation_insurer_id_idx" btree 
(indexation_insurer_id)

    "input_document_indexation_nature_id_idx" btree (indexation_nature_id)
    "input_document_indexation_reference_idx" btree (indexation_reference)
    "input_document_indexation_subsidiary_id_idx" btree 
(indexation_subsidiary_id)
    "input_document_indexation_warranty_id_idx" btree 
(indexation_warranty_id)

    "input_document_locked_by_user_idx" btree (locked_by_user)
    "input_document_modified_at_idx" btree (modified_at)
    "input_document_modified_by_user_profile_idx" btree 
(modified_by_user_profile)

    "input_document_processing_state_idx" btree (processing_state)
    "input_document_stock_idx" btree (active_content, archiving_state, 
creation_state, processing_state) WHERE active_content AND 
archiving_state <> 'DESTROYED'::text AND creation_state <> 
'PENDING'::text AND (processing_state = ANY 
('{PENDING_PROCESSING,PENDING_INDEXATION,READY}'::text[]))
    "input_dom_act_pi_idx" btree (indexation_activity_id, 
indexation_domain_id) WHERE processing_state = 'PENDING_INDEXATION'::text
    "input_dom_act_pp_idx" btree (indexation_activity_id, 
indexation_domain_id) WHERE processing_state = 'PENDING_PROCESSING'::text
    "input_dom_act_sub_idx" btree (indexation_activity_id, 
indexation_domain_id, indexation_subsidiary_id)

    "input_reception_id_created_at_idx" btree (reception_id, created_at)
    "input_reception_id_reception_order_idx" btree (reception_id, 
reception_order)
    "operational_perimeter_view_idx" btree (processing_state, 
indexation_distributor_id) WHERE processing_state = 
'PENDING_PROCESSING'::text


Please find attached the 3 plans

explain_analyse_incremental_off.txt with enable_incremental_sort to off
explain_analyse_incremental_on.txt with enable_incremental_sort to on
explain_analyse_incremental_on_limit5000 with enable_incremental_sort to 
on but with increase the limit to 5000, in this case plan choose don't 
use `Incremental Sort`


The point that I don't understand in the plan (incremental_sort to on) 
is the top level one, the limit co

Re: How abnormal server shutdown could be detected by tests?

2023-12-12 Thread shveta malik

On Sat, Dec 9, 2023 at 9:30 AM Alexander Lakhin  wrote:
>
> Hello hackers,
>
> While studying bug #18158, I've come to the conclusion that the existing
> testing infrastructure is unable to detect abnormal situations. of some
> kind.
>
> Just a simple example:
> With Assert(0) injected in walsender (say:
> sed "s/WalSndDone(send_data)/Assert(0)/" -i 
> src/backend/replication/walsender.c
> ), I observe the following:
> $ make -s check -C src/test/recovery PROVE_TESTS="t/012*"
>
> t/012_subtransactions.pl .. ok
> All tests successful.
>
> (The same with meson:
> 1/1 postgresql:recovery / recovery/012_subtransactions OK
> 3.24s   12 subtests passed)
>
> But:
> $ grep TRAP src/test/recovery/tmp_check/log/*
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373729
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373750
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373821
> src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373786
>
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
> ...
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  server process (PID 375975) was 
> terminated by signal 6: Aborted
> 2023-12-09 03:23:20.210 UTC [375933] DETAIL:  Failed process was running: 
> START_REPLICATION 0/300 TIMELINE 3
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  terminating any other active 
> server processes
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  abnormal database system shutdown
> 2023-12-09 03:23:20.211 UTC [375933] LOG:  database system is shut down
> ...
>
> So the shutdown definitely was considered abnormal, but we missed that.
>
> With log_min_messages = DEBUG3, I can see in the log:
> 2023-12-09 03:26:50.145 UTC [377844] LOG:  abnormal database system shutdown
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 0 
> before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 5 on_shmem_exit 
> callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up orphaned dynamic 
> shared memory with ID 2898643884
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up dynamic shared 
> memory control segment with ID 3446966170
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(1): 2 callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] LOG:  database system is shut down
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  exit(1)
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 
> before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
> callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(-1): 0 callbacks to 
> make
>
> The postmaster process exits with exit code 1, but pg_ctl can't get the
> code and just reports that stop was completed successfully.
>

For what it's worth, there is another thread which stated the similar problem:
https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us

> Should this be improved? And if yes, how it can be done?
> Maybe postmaster shouldn't remove it's postmaster.pid when it exits
> abnormally to let pg_ctl know of it?
>


thanks
Shveta

Re: Synchronizing slots from primary to standby

2023-12-12 Thread shveta malik

On Mon, Dec 11, 2023 at 7:12 PM Amit Kapila  wrote:
>
> On Mon, Dec 11, 2023 at 2:41 PM shveta malik  wrote:
> >
> > >
> > > 5.
> > > +synchronize_slots(WalReceiverConn *wrconn)
> > > {
> > > ...
> > > ...
> > > + /* The syscache access needs a transaction env. */
> > > + StartTransactionCommand();
> > > +
> > > + /*
> > > + * Make result tuples live outside TopTransactionContext to make them
> > > + * accessible even after transaction is committed.
> > > + */
> > > + MemoryContextSwitchTo(oldctx);
> > > +
> > > + /* Construct query to get slots info from the primary server */
> > > + initStringInfo(&s);
> > > + construct_slot_query(&s);
> > > +
> > > + elog(DEBUG2, "slot sync worker's query:%s \n", s.data);
> > > +
> > > + /* Execute the query */
> > > + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow);
> > > + pfree(s.data);
> > > +
> > > + if (res->status != WALRCV_OK_TUPLES)
> > > + ereport(ERROR,
> > > + (errmsg("could not fetch failover logical slots info "
> > > + "from the primary server: %s", res->err)));
> > > +
> > > + CommitTransactionCommand();
> > > ...
> > > ...
> > > }
> > >
> > > Where exactly in the above code, there is a syscache access as
> > > mentioned above StartTransactionCommand()?
> > >
> >
> > It is in walrcv_exec (libpqrcv_processTuples). I have changed the
> > comments to add this info.
> >
>
> Okay, I see that the patch switches context twice once after starting
> the transaction and the second time after committing the transaction,
> why is that required? Also, can't we extend the duration of the
> transaction till the remote_slot information is constructed?

If we extend duration, we have to extend till remote_slot information
is consumed and not only till it is constructed.

> I am
> asking this because the context used is TopMemoryContext which should
> be used only if we need something specific to be retained at the
> process level which doesn't seem to be the case here.
>

Okay, I understand your concern. But this needs more thoughts on shall
we have all the slots synchronized in one txn or is it better to have
it existing way i.e. each slot being synchronized in its own txn
started in synchronize_one_slot. If we go by the former, can it have
any implications? I need to review this bit more before concluding.
.
> I have noticed a few other minor things:
> 1.
> postgres=# select * from pg_logical_slot_get_changes('log_slot_2', NULL, 
> NULL);
> ERROR:  cannot use replication slot "log_slot_2" for logical decoding
> DETAIL:  This slot is being synced from the primary server.
> ...
> ...
> postgres=# select * from pg_drop_replication_slot('log_slot_2');
> ERROR:  cannot drop replication slot "log_slot_2"
> DETAIL:  This slot is being synced from the primary.
>
> I think the DETAIL message should be the same in the above two cases.
>
> 2.
> +void
> +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn)
> +{
> + List*standby_slots;
> +
> + Assert(!am_walsender);
> +
> + if (!MyReplicationSlot->data.failover)
> + return;
> +
> + standby_slots = GetStandbySlotList(true);
> +
> + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv);
> ...
> ...
>
> Shouldn't we return if the standby slot names list is NIL unless there
> is a reason to do ConditionVariablePrepareToSleep() or any of the code
> following it?
>
> --
> With Regards,
> Amit Kapila.

Re: Add isCatalogRel in rmgrdesc

2023-12-12 Thread Michael Paquier

On Tue, Dec 12, 2023 at 09:23:46AM +0100, Drouvot, Bertrand wrote:
> Please find attached a patch to add this field in the related rmgrdesc (i.e
> all the ones that already provide the snapshotConflictHorizon except the one
> related to xl_heap_visible: indeed a new bit was added in its flag field in 
> 6af1793954
> instead of adding the isCatalogRel bool).
> 
> I think it's worth it, as this new field could help diagnose conflicts issues 
> (if any).

Agreed that this is helpful.  One would likely guess if you are
dealing with a catalog relation depending on its relfilenode, but that
does not take into account user_catalog_table that can be set as a
reloption, impacting the value of isCatalogRel stored in the records.
--
Michael


signature.asc
Description: PGP signature

Re: Add isCatalogRel in rmgrdesc

2023-12-12 Thread Drouvot, Bertrand


Hi,

On 12/12/23 10:15 AM, Michael Paquier wrote:

On Tue, Dec 12, 2023 at 09:23:46AM +0100, Drouvot, Bertrand wrote:

Please find attached a patch to add this field in the related rmgrdesc (i.e
all the ones that already provide the snapshotConflictHorizon except the one
related to xl_heap_visible: indeed a new bit was added in its flag field in 
6af1793954
instead of adding the isCatalogRel bool).

I think it's worth it, as this new field could help diagnose conflicts issues 
(if any).


Agreed that this is helpful.


Thanks for looking at it!


 One would likely guess if you are
dealing with a catalog relation depending on its relfilenode, but that
does not take into account user_catalog_table that can be set as a
reloption, impacting the value of isCatalogRel stored in the records.


Exactly and not mentioning the other checks in 
RelationIsAccessibleInLogicalDecoding()
like the wal_level >= logical one.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Synchronizing slots from primary to standby

2023-12-12 Thread Zhijie Hou (Fujitsu)

On Monday, December 11, 2023 3:32 PM Peter Smith 
> 
> Here are some review comments for v44-0001
> 
> ==
> src/backend/replication/slot.c
> 
> 
> 1. ReplicationSlotCreate
> 
>   * during getting changes, if the two_phase option is enabled it can skip
>   * prepare because by that time start decoding point has been moved. So
> the
>   * user will only get commit prepared.
> + * failover: Allows the slot to be synced to physical standbys so that 
> logical
> + * replication can be resumed after failover.
>   */
>  void
>  ReplicationSlotCreate(const char *name, bool db_specific,
> 
> ~
> 
> /Allows the slot.../If enabled, allows the slot.../

Changed.

> 
> ==
> 
> 2. validate_standby_slots
> 
> +validate_standby_slots(char **newval)
> +{
> + char*rawname;
> + List*elemlist;
> + ListCell   *lc;
> + bool ok = true;
> +
> + /* Need a modifiable copy of string */ rawname = pstrdup(*newval);
> +
> + /* Verify syntax and parse string into list of identifiers */ if (!(ok
> + = SplitIdentifierString(rawname, ',', &elemlist)))
> + GUC_check_errdetail("List syntax is invalid.");
> +
> + /*
> + * If there is a syntax error in the name or if the replication slots'
> + * data is not initialized yet (i.e., we are in the startup process),
> + skip
> + * the slot verification.
> + */
> + if (!ok || !ReplicationSlotCtl)
> + {
> + pfree(rawname);
> + list_free(elemlist);
> + return ok;
> + }
> 
> 
> 2a.
> You don't need to initialize 'ok' during declaration because it is assigned
> immediately anyway.
> 
> ~
> 
> 2b.
> AFAIK assignment within a conditional like this is not a normal PG coding 
> style
> unless there is no other way to do it.
> 

Changed.

> ~
> 
> 2c.
> /into list/into a list/
> 
> SUGGESTION
> /* Verify syntax and parse string into a list of identifiers */ ok =
> SplitIdentifierString(rawname, ',', &elemlist); if (!ok)
>   GUC_check_errdetail("List syntax is invalid.");
> 
> 

Changed.

> ~~~
> 
> 3. assign_standby_slot_names
> 
> + if (!SplitIdentifierString(standby_slot_names_cpy, ',',
> + &standby_slots)) {
> + /* This should not happen if GUC checked check_standby_slot_names. */
> + elog(ERROR, "list syntax is invalid"); }
> 
> This error here and in validate_standby_slots() are different -- "list" versus
> "List".
> 

The message has been changed to "invalid list syntax" to be consistent with 
other elog.

> ==
> src/backend/replication/walsender.c
> 
> 
> 4. WalSndFilterStandbySlots
> 
> 
> + foreach(lc, standby_slots_cpy)
> + {
> + char*name = lfirst(lc);
> + XLogRecPtr restart_lsn = InvalidXLogRecPtr; bool invalidated = false;
> + char*warningfmt = NULL;
> + ReplicationSlot *slot;
> +
> + slot = SearchNamedReplicationSlot(name, true);
> +
> + if (slot && SlotIsPhysical(slot))
> + {
> + SpinLockAcquire(&slot->mutex);
> + restart_lsn = slot->data.restart_lsn;
> + invalidated = slot->data.invalidated != RS_INVAL_NONE;
> + SpinLockRelease(&slot->mutex); }
> +
> + /* Continue if the current slot hasn't caught up. */ if (!invalidated
> + && !XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < wait_for_lsn) {
> + /* Log warning if no active_pid for this physical slot */ if
> + (slot->active_pid == 0) ereport(WARNING, errmsg("replication slot
> + \"%s\" specified in parameter \"%s\" does
> not have active_pid",
> +name, "standby_slot_names"),
> + errdetail("Logical replication is waiting on the "
> +   "standby associated with \"%s\"", name), errhint("Consider starting
> + standby associated with "
> + "\"%s\" or amend standby_slot_names", name));
> +
> + continue;
> + }
> + else if (!slot)
> + {
> + /*
> + * It may happen that the slot specified in standby_slot_names GUC
> + * value is dropped, so let's skip over it.
> + */
> + warningfmt = _("replication slot \"%s\" specified in parameter
> \"%s\" does not exist, ignoring");
> + }
> + else if (SlotIsLogical(slot))
> + {
> + /*
> + * If a logical slot name is provided in standby_slot_names, issue
> + * a WARNING and skip it. Although logical slots are disallowed in
> + * the GUC check_hook(validate_standby_slots), it is still
> + * possible for a user to drop an existing physical slot and
> + * recreate a logical slot with the same name. Since it is
> + * harmless, a WARNING should be enough, no need to error-out.
> + */
> + warningfmt = _("cannot have logical replication slot \"%s\" in
> parameter \"%s\", ignoring");
> + }
> + else if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) {
> + /*
> + * Specified physical slot may have been invalidated, so there is no
> + point
> + * in waiting for it.
> + */
> + warningfmt = _("physical slot \"%s\" specified in parameter \"%s\"
> has been invalidated, ignoring");
> + }
> + else
> + {
> + Assert(restart_lsn >= wait_for_lsn);
> + }
> 
> This if/else chain seems structured awkwardly. IMO it would be tidier to
> eliminate the NULL slot and IsLogicalSlot up-front, which would also simplify
> some of the subsequent conditions
> 
> SUGGESTION
> 
> slot = SearchNamedRepli

Re: logical decoding and replication of sequences, take 2

2023-12-12 Thread Tomas Vondra

Hi,

There's been a lot discussed over the past month or so, and it's become
difficult to get a good idea what's the current state - what issues
remain to be solved, what's unrelated to this patch, and how to move if
forward. Long-running threads tend to be confusing, so I had a short
call with Amit to discuss the current state yesterday, and to make sure
we're on the same page. I believe it was very helpful, and I've promised
to post a short summary of the call - issues, what we agreed seems like
a path forward, etc.

Obviously, I might have misunderstood something, in which case Amit can
correct me. And I'd certainly welcome opinions from others.

In general, we discussed three areas - desirability of the feature,
correctness and performance. I believe a brief summary of the agreement
would be this:

- desirability of the feature: Random IDs (UUIDs etc.) are likely a much
better solution for distributed (esp. active-active) systems. But there
are important use cases that are likely to keep using regular sequences
(online upgrades of single-node instances, existing systems, ...).

- correctness: There's one possible correctness issue, when the snapshot
changes to FULL between record creating a sequence relfilenode and that
sequence advancing. This needs to be verified/reproduced, and fixed.

- performance issues: We've agreed the case with a lot of aborts (when
DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've
discussed whether the overhead with many sequence changes (nextval-40)
is acceptable, and/or how to improve it.

Next, I'll go over these points in more details, with my understanding
of what the challenges are, possible solutions etc. Most of this was
discussed/agreed on the call, but some are ideas I had only after the
call when writing this summary.


1) desirability of the feature

Firstly, do we actually want/need this feature? I believe that's very
much a question of what use cases we're targeting.

If we only focus on distributed databases (particularly those with
multiple active nodes), then we probably agree that the right solution
is to not use sequences (~generators of incrementing values) but UUIDs
or similar random identifiers (better not call them sequences, there's
not much sequential about it). The huge advantage is this does not
require replicating any state between the nodes, so logical decoding can
simply ignore them and replicate just the generated values. I don't
think there's any argument about that. If I as building such distributed
system, I'd certainly use such random IDs.

The question is what to do about the other use cases - online upgrades
relying on logical decoding, failovers to logical replicas, and so on.
Or what to do about existing systems that can't be easily changed to use
different/random identifiers. Those are not really distributed systems
and therefore don't quite need random IDs.

Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can
easily lead to massive write amplification, for example. There are
variants like UUIDv7 reducing the impact, but there's other stuff.

My takeaway from this is there's still value in having this feature.


2) correctness

The only correctness issue I'm aware of is the question what happens
when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding
the relfilenode creation and the sequence increment, pointed out by
Dilip in [1].

If this happens (and while I don't have a reproducer, I also don't have
a very clear idea why it couldn't happen), it breaks how the patch
decides between transactional and non-transactional sequence changes.

So this seems like a fatal flaw - it definitely needs to be solved. I
don't have a good idea how to do that, unfortunately. The problem is the
dependency on an earlier record, and that this needs to be evaluated
immediately (in the decode phase). Logical messages don't have the same
issue because the "transactional" flag does not depend on earlier stuff,
and other records are not interpreted until apply/commit, when we know
everything relevant was decoded.

I don't know what the solution is. Either we find a way to make sure not
to lose/skip the smgr record, or we need to rethink how we determine the
transactional flag (perhaps even try again adding it to the WAL record,
but we didn't find a way to do that earlier).


3) performance issues

We have discussed two cases - "ddl-abort" and "nextval-40".

The "ddl-abort" is when the workload does a lot of DDL and then aborts
them, leading to profiles dominated by DecodeCommit. The agreement here
is that while this is a valid issue and we should try fixing it, it's
unrelated to this patch. The issue exists even on master. So in the
context of this patch we can ignore this issue.

The "nextval-40" applies to workloads doing a lot of regular sequence
changes. We only decode/apply changes written to WAL, and that happens
only for every 32 increments or so. The test was with a very simple
transaction (just seque

Re: Add --check option to pgindent

2023-12-12 Thread Daniel Gustafsson

> On 12 Dec 2023, at 01:09, Tristan Partin  wrote:
> 
> Not sold on the name, but --check is a combination of --silent-diff and 
> --show-diff. I envision --check mostly being used in CI environments. I 
> recently came across a situation where this behavior would have been useful. 
> Without --check, you're left to capture the output of --show-diff and exit 2 
> if the output isn't empty by yourself.

I wonder if we should model this around the semantics of git diff to keep it
similar to other CI jobs which often use git diff?  git diff --check means "are
there conflicts or issues" which isn't really comparable to here, git diff
--exit-code however is pretty much exactly what this is trying to accomplish.

That would make pgindent --show-diff --exit-code exit with 1 if there were
diffs and 0 if there are no diffs.

--
Daniel Gustafsson

Re: Add --check option to pgindent

2023-12-12 Thread Michael Banck

Hi,

On Tue, Dec 12, 2023 at 11:18:59AM +0100, Daniel Gustafsson wrote:
> > On 12 Dec 2023, at 01:09, Tristan Partin  wrote:
> > 
> > Not sold on the name, but --check is a combination of --silent-diff
> > and --show-diff. I envision --check mostly being used in CI
> > environments. I recently came across a situation where this behavior
> > would have been useful. Without --check, you're left to capture the
> > output of --show-diff and exit 2 if the output isn't empty by
> > yourself.
> 
> I wonder if we should model this around the semantics of git diff to
> keep it similar to other CI jobs which often use git diff?  git diff
> --check means "are there conflicts or issues" which isn't really
> comparable to here, git diff --exit-code however is pretty much
> exactly what this is trying to accomplish.
> 
> That would make pgindent --show-diff --exit-code exit with 1 if there
> were diffs and 0 if there are no diffs.

To be honest, I find that rather convoluted; contrary to "git diff", I
believe the primary action of pgident is not to show diffs, so I find
the proposed --check option to be entirely reasonable from a UX
perspective.

On the other hand, tying a "does this need re-indenting?" question to a
"--show-diff --exit-code" option combination is not very obvious (to me,
at least).


Cheers,

Michael

[meson] expose buildtype debug/optimization info to pg_config

2023-12-12 Thread Junwang Zhao

build system using configure set VAL_CFLAGS with debug and
optimization flags, so pg_config will show these infos. Some
extensions depend on the mechanism.

This patch exposes these flags with a typo fixed together.

-- 
Regards
Junwang Zhao


0001-meson-expose-buildtype-debug-optimization-info-to-pg.patch
Description: Binary data

Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-12-12 Thread Michael Paquier

On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote:
> Oops, I only included the code changes where I am adding injection
> points and some comments to verify that, but missed the actual test
> file. Attaching it here.

I see.  Interesting that this requires persistent connections to work.
That's something I've found clunky to rely on when the scenarios a
test needs to deal with are rather complex.  That's an area that could
be made easier to use outside of this patch..  Something got proposed
by Andrew Dunstan to make the libpq routines usable through a perl
module, for example.

> Note:  I think the latest patches are conflicting with the head, can you 
> rebase?

Indeed, as per the recent manipulations in ipci.c for the shmem
initialization areas.  Here goes a v6.
--
Michael
From 1e11c0400bf802834792f3e9c897b17a3d14bca1 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 12 Dec 2023 11:35:24 +0100
Subject: [PATCH v6 1/4] Add backend facility for injection points

This adds a set of routines allowing developers to attach, detach and
run custom code based on arbitrary code paths set with a centralized
macro called INJECTION_POINT().  Injection points are registered in a
shared hash table.  Processes also use a local cache to over loading
callbacks more than necessary, cleaning up their cache if a callback has
found to be removed.
---
 src/include/pg_config.h.in|   3 +
 src/include/utils/injection_point.h   |  36 ++
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 .../utils/activity/wait_event_names.txt   |   1 +
 src/backend/utils/misc/Makefile   |   1 +
 src/backend/utils/misc/injection_point.c  | 317 ++
 src/backend/utils/misc/meson.build|   1 +
 doc/src/sgml/installation.sgml|  30 ++
 doc/src/sgml/xfunc.sgml   |  56 
 configure |  34 ++
 configure.ac  |   7 +
 meson.build   |   1 +
 meson_options.txt |   3 +
 src/Makefile.global.in|   1 +
 src/tools/pgindent/typedefs.list  |   2 +
 16 files changed, 497 insertions(+)
 create mode 100644 src/include/utils/injection_point.h
 create mode 100644 src/backend/utils/misc/injection_point.c

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5f16918243..288bb9cb42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -698,6 +698,9 @@
 /* Define to build with ICU support. (--with-icu) */
 #undef USE_ICU
 
+/* Define to 1 to build with injection points. (--enable-injection-points) */
+#undef USE_INJECTION_POINTS
+
 /* Define to 1 to build with LDAP support. (--with-ldap) */
 #undef USE_LDAP
 
diff --git a/src/include/utils/injection_point.h 
b/src/include/utils/injection_point.h
new file mode 100644
index 00..6335260fea
--- /dev/null
+++ b/src/include/utils/injection_point.h
@@ -0,0 +1,36 @@
+/*-
+ * injection_point.h
+ *   Definitions related to injection points.
+ *
+ * Copyright (c) 2001-2023, PostgreSQL Global Development Group
+ *
+ * src/include/utils/injection_point.h
+ * --
+ */
+#ifndef INJECTION_POINT_H
+#define INJECTION_POINT_H
+
+/*
+ * Injections points require --enable-injection-points.
+ */
+#ifdef USE_INJECTION_POINTS
+#define INJECTION_POINT(name) InjectionPointRun(name)
+#else
+#define INJECTION_POINT(name) ((void) name)
+#endif
+
+/*
+ * Typedef for callback function launched by an injection point.
+ */
+typedef void (*InjectionPointCallback) (const char *name);
+
+extern Size InjectionPointShmemSize(void);
+extern void InjectionPointShmemInit(void);
+
+extern void InjectionPointAttach(const char *name,
+const char 
*library,
+const char 
*function);
+extern void InjectionPointRun(const char *name);
+extern void InjectionPointDetach(const char *name);
+
+#endif /* INJECTION_POINT_H */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0e0ac22bdd..81799c5688 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -49,6 +49,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/snapmgr.h"
 #include "utils/wait_event.h"
 
@@ -147,6 +148,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, AsyncShmemSize());
size = add_size(size, StatsShmemSize());
size = add_size(size, WaitEventExtensionShmemSize());
+   size = add_size(size, InjectionPointShmemSize());
 #ifdef EXEC_BACKEND
size = add_size(size,

Re: Sorting regression of text function result since commit 586b98fdf1aae

2023-12-12 Thread Jehan-Guillaume de Rorthais

On Mon, 11 Dec 2023 15:43:12 -0500
Tom Lane  wrote:

> Jehan-Guillaume de Rorthais  writes:
> > It looks like since 586b98fdf1aae, the result type collation of
> > "convert_from" is forced to "C", like the patch does for type "name",
> > instead of the "default" collation for type "text".  
> 
> Well, convert_from() inherits its result collation from the input,
> per the normal rules for collation assignment [1].
> 
> > Looking at hints in the header comment of function "exprCollation", I poked
> > around and found that the result collation wrongly follow the input
> > collation in this case.  
> 
> It's not "wrong", it's what the SQL standard requires.

Mh, OK. This is at least a surprising behavior. Having a non-data related
argument impacting the result collation seems counter-intuitive. But I
understand this is by standard, no need to discuss it.

> > I couldn't find anything explaining this behavior in the changelog. It looks
> > like a regression to me, but if this is actually expected, maybe this
> > deserve some documentation patch?  
> 
> The v12 release notes do say
> 
> Type name now behaves much like a domain over type text that has
> default collation “C”.

Sure, and I saw it, but reading at this entry, I couldn't guess this could have
such implication on text result from a function call. That's why I hunt for
the precise commit and was surprise to find this was the actual change.

> You'd have similar results from an expression involving such a domain,
> I believe.
> 
> I'm less than excited about patching the v12 release notes four
> years later.  Maybe, if this point had come up in a more timely
> fashion, we'd have mentioned it --- but it's hardly possible to
> cover every potential implication of such a change in the
> release notes.

This could have been documented in the collation concept page, as a trap to be
aware of. A link from the release note to such a small paragraph would have
been enough to warn devs this might have implications when mixed with other
collatable types. But I understand we can not document all the traps paving the
way to the standard anyway.

Thank you for your explanation!

Regards,

Re: brininsert optimization opportunity

2023-12-12 Thread Tomas Vondra

On 12/11/23 16:41, Tomas Vondra wrote:
> On 12/11/23 16:00, Alexander Lakhin wrote:
>> Hello Tomas and Soumyadeep,
>>
>> 25.11.2023 23:06, Tomas Vondra wrote:
>>> I've done a bit more cleanup on the last version of the patch (renamed
>>> the fields to start with bis_ as agreed, rephrased the comments / docs /
>>> commit message a bit) and pushed.
>>
>> Please look at a query, which produces warnings similar to the ones
>> observed upthread:
>> CREATE TABLE t(a INT);
>> INSERT INTO t SELECT x FROM generate_series(1,1) x;
>> CREATE INDEX idx ON t USING brin (a);
>> REINDEX index CONCURRENTLY idx;
>>
>> WARNING:  resource was not closed: [1863] (rel=base/16384/16389,
>> blockNum=1, flags=0x9380, refcount=1 1)
>> WARNING:  resource was not closed: [1862] (rel=base/16384/16389,
>> blockNum=0, flags=0x9380, refcount=1 1)
>>
>> The first bad commit for this anomaly is c1ec02be1.
>>
> 
> Thanks for the report. I haven't investigated what the issue is, but it
> seems we fail to release the buffers in some situations - I'd bet we
> fail to call the cleanup callback in some place, or something like that.
> I'll take a look tomorrow.
> 

Yeah, just as I expected this seems to be a case of failing to call the
index_insert_cleanup after doing some inserts - in this case the inserts
happen in table_index_validate_scan, but validate_index has no idea it
needs to do the cleanup.

The attached 0001 patch fixes this by adding the call to validate_index,
which seems like the proper place as it's where the indexInfo is
allocated and destroyed.

But this makes me think - are there any other places that might call
index_insert without then also doing the cleanup? I did grep for the
index_insert() calls, and it seems OK except for this one.

There's a call in toast_internals.c, but that seems OK because that only
deals with btree indexes (and those don't need any cleanup). The same
logic applies to unique_key_recheck(). The rest goes through
execIndexing.c, which should do the cleanup in ExecCloseIndices().

Note: We should probably call the cleanup even in the btree cases, even
if only to make it clear it needs to be called after index_insert().

I was thinking maybe we should have some explicit call to destroy the
IndexInfo, but that seems rather inconvenient - it'd force everyone to
carefully track lifetimes of the IndexInfo instead of just relying on
memory context reset/destruction. That seems quite error-prone.

I propose we do a much simpler thing instead - allow the cache may be
initialized / cleaned up repeatedly, and make sure it gets reset at
convenient place (typically after index_insert calls that don't go
through execIndexing). That'd mean the cleanup does not need to happen
very far from the index_insert(), which makes the reasoning much easier.
0002 does this.

But maybe there's a better way to ensure the cleanup gets called even
when not using execIndexing.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 0cf18dd53842e3809b76525867214ce8ff823f32 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 12 Dec 2023 12:01:07 +0100
Subject: [PATCH 1/2] call cleanup in validate_index

---
 src/backend/catalog/index.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7b186c0220b..7a0e337a418 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3414,6 +3414,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	/* Done with tuplesort object */
 	tuplesort_end(state.tuplesort);

+	index_insert_cleanup(indexRelation, indexInfo);
+
 	elog(DEBUG2,
 		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
 		 state.htups, state.itups, state.tups_inserted);
-- 
2.42.0

From 0c04b53573cf164c5a0108d909f05ffcbae4e72d Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 12 Dec 2023 12:24:05 +0100
Subject: [PATCH 2/2] call index_insert_cleanup repeatedly

---
 src/backend/access/heap/heapam_handler.c |  3 +++
 src/backend/access/index/indexam.c   | 11 ++-
 src/backend/catalog/index.c  |  2 --
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..5db0cf1dffa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1976,6 +1976,9 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	/* also make sure the IndexInfo cache gets freed */
+	index_insert_cleanup(indexRelation, indexInfo);
 }

 /*
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f23e0199f08..515d7649c4d 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2

Re: Show WAL write and fsync stats in pg_stat_io

2023-12-12 Thread Nazir Bilal Yavuz

Hi,

Thanks for the feedback! The new version of the patch is attached.

On Tue, 5 Dec 2023 at 09:16, Michael Paquier  wrote:
>
> -   if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> +   if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)
>
> Unrelated diff.

Done.

>
> +   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +   io_op == IOOP_FSYNC)
> +   PendingWalStats.wal_sync += cnt;
>
> Nah, I really don't think that adding this dependency within
> pg_stat_io is a good idea.
>
> -   PendingWalStats.wal_sync++;
> +   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
> +   io_start, 1);
>
> This is the only caller where this matters, and the count is always 1.

I reverted that, pgstat_count_io_op_n doesn't count
PendingWalStats.wal_sync now.

>
> +   no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
> +   bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
> +   bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
> +   bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
> +   bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
> +
> +   if (no_wal_normal_read &&
> +   (io_object == IOOBJECT_WAL &&
> +io_op == IOOP_READ))
> +   return false;
>
> This may be more readable if an enum is applied, without a default
> clause so as it would not be forgotten if a new type is added, perhaps
> in its own little routine.

Done.

>
> -   if (track_io_timing)
> +   if (track_io_timing || track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(io_start);
> else
>
> This interface from pgstat_prepare_io_time() is not really good,
> because we could finish by setting io_start in the existing code paths
> calling this routine even if track_io_timing is false when
> track_wal_io_timing is true.  Why not changing this interface a bit
> and pass down a GUC (track_io_timing or track_wal_io_timing) as an
> argument of the function depending on what we expect to trigger the
> timings?

Done in 0001.

>
> -   /* Convert counters from microsec to millisec for display */
> -   values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 
> 1000.0);
> -   values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 
> 1000.0);
> +   /*
> +* There is no need to calculate timings for both pg_stat_wal and
> +* pg_stat_io. So, fetch timings from pg_stat_io to make stats 
> gathering
> +* cheaper. Note that, since timings are fetched from pg_stat_io;
> +* pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
> +*
> +* Convert counters from microsec to millisec for display
> +*/
> +   values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> + 
>  IOCONTEXT_NORMAL,
> + 
>  IOOP_WRITE));
> +   values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> + 
>  IOCONTEXT_NORMAL,
> + 
>  IOOP_FSYNC));
>
> Perhaps it is simpler to remove these columns from pg_stat_get_wal()
> and plug an SQL upgrade to the view definition of pg_stat_wal?

Done in 0003 but I am not sure if that is what you expected.

> Finding a good balance between the subroutines, the two GUCs, the
> contexts, the I/O operation type and the objects is the tricky part of
> this patch.  If the dependency to PendingWalStats is removed and if
> the interface of pgstat_prepare_io_time is improved, things are a bit
> cleaner, but it feels like we could do more..  Nya.

I agree. The patch is not logically complicated but it is hard to
select the best way.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From b7bf7b92fa274775136314ecfde90fa32ed435cb Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 29 Nov 2023 15:30:03 +0300
Subject: [PATCH v6 6/6] Add IOOBJECT_WAL / IOCONTEXT_NORMAL / read tests

---
 src/test/regress/expected/stats.out | 12 
 src/test/regress/sql/stats.sql  |  8 
 2 files changed, 20 insertions(+)

diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 4adda9e479..7f5340cd7e 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -881,6 +881,18 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+-- Test pg_stat_io IOOBJECT_WAL / IOCONTEXT_NORMAL / read.
+-- When the servers starts, StartupXLOG function must be called by postmaster
+-- or standalone-backend startup and WAL read must be done.
+-- So, check these before stats get resetted.
+SELECT SUM(reads) >

Re: Add --check option to pgindent

2023-12-12 Thread Euler Taveira

On Tue, Dec 12, 2023, at 7:28 AM, Michael Banck wrote:
> On Tue, Dec 12, 2023 at 11:18:59AM +0100, Daniel Gustafsson wrote:
> > > On 12 Dec 2023, at 01:09, Tristan Partin  wrote:
> > > 
> > > Not sold on the name, but --check is a combination of --silent-diff
> > > and --show-diff. I envision --check mostly being used in CI
> > > environments. I recently came across a situation where this behavior
> > > would have been useful. Without --check, you're left to capture the
> > > output of --show-diff and exit 2 if the output isn't empty by
> > > yourself.
> > 
> > I wonder if we should model this around the semantics of git diff to
> > keep it similar to other CI jobs which often use git diff?  git diff
> > --check means "are there conflicts or issues" which isn't really
> > comparable to here, git diff --exit-code however is pretty much
> > exactly what this is trying to accomplish.
> > 
> > That would make pgindent --show-diff --exit-code exit with 1 if there
> > were diffs and 0 if there are no diffs.
> 
> To be honest, I find that rather convoluted; contrary to "git diff", I
> believe the primary action of pgident is not to show diffs, so I find
> the proposed --check option to be entirely reasonable from a UX
> perspective.
> 
> On the other hand, tying a "does this need re-indenting?" question to a
> "--show-diff --exit-code" option combination is not very obvious (to me,
> at least).

Multiple options to accomplish a use case might not be obvious. I'm wondering
if we can combine it into a unique option.

--show-diff show the changes that would be made
--silent-diff   exit with status 2 if any changes would be made
+ --check combination of --show-diff and --silent-diff

I mean

--diff=show,silent,check

When you add exceptions, it starts to complicate the UI.

usage("Cannot have both --silent-diff and --show-diff")
   if $silent_diff && $show_diff;
 
+usage("Cannot have both --check and --show-diff")
+  if $check && $show_diff;
+
+usage("Cannot have both --check and --silent-diff")
+  if $check && $silent_diff;
+


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Synchronizing slots from primary to standby

2023-12-12 Thread Nisha Moond

A review on v45 patch:

If one creates a logical slot with failover=true as -
select pg_create_logical_replication_slot('logical_slot','pgoutput',
false, true, true);

Then, uses the existing logical slot while creating a subscription -
postgres=#  create subscription sub4 connection 'dbname=postgres
host=localhost port=5433' publication pub1t4 WITH
(slot_name=logical_slot, create_slot=false, failover=true);
NOTICE:  changed the failover state of replication slot "logical_slot"
on publisher to false
CREATE SUBSCRIPTION

Despite configuring logical_slot's failover to true and specifying
failover=true during subscription creation, the NOTICE indicates a
change in the failover state to 'false', without providing any
explanation for this transition.
It can be confusing for users, so IMO, the notice should include the
reason for switching failover to 'false' or should give a hint to use
either refresh=false or copy_data=false to enable failover=true for
the slot as we do in other similar 'alter subscription...' scenarios.

--
Thanks & Regards,
Nisha

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-12-12 Thread jian he

On Mon, Dec 11, 2023 at 10:05 PM Alena Rybakina
 wrote:
>
> Hi! Thank you for your work. Your patch looks better!
> Yes, thank you! It works fine, and I see that the regression tests have been 
> passed. 🙂
> However, when I ran 'copy from with save_error' operation with simple csv 
> files (copy_test.csv, copy_test1.csv) for tables test, test1 (how I created 
> it, I described below):
>
> postgres=# create table test (x int primary key, y int not null);
> postgres=# create table test1 (x int, z int, CONSTRAINT fk_x
>   FOREIGN KEY(x)
>   REFERENCES test(x));
>
> I did not find a table with saved errors after operation, although I received 
> a log about it:
>
> postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV 
> save_error
> NOTICE:  2 rows were skipped because of error. skipped row saved to table 
> public.test_error
> ERROR:  duplicate key value violates unique constraint "test_pkey"
> DETAIL:  Key (x)=(2) already exists.
> CONTEXT:  COPY test, line 3
>
> postgres=# select * from public.test_error;
> ERROR:  relation "public.test_error" does not exist
> LINE 1: select * from public.test_error;
>
> postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV 
> save_error
> NOTICE:  2 rows were skipped because of error. skipped row saved to table 
> public.test1_error
> ERROR:  insert or update on table "test1" violates foreign key constraint 
> "fk_x"
> DETAIL:  Key (x)=(2) is not present in table "test".
>
> postgres=# select * from public.test1_error;
> ERROR:  relation "public.test1_error" does not exist
> LINE 1: select * from public.test1_error;
>
> Two lines were written correctly in the csv files, therefore they should have 
> been added to the tables, but they were not added to the tables test and 
> test1.
>
> If I leave only the correct rows, everything works fine and the rows are 
> added to the tables.
>
> in copy_test.csv:
>
> 2,0
>
> 1,1
>
> in copy_test1.csv:
>
> 2,0
>
> 2,1
>
> 1,1
>
> postgres=# \copy test from '/home/alena/copy_test.csv' DELIMITER ',' CSV
> COPY 2
> postgres=# \copy test1 from '/home/alena/copy_test1.csv' DELIMITER ',' CSV 
> save_error
> NOTICE:  No error happened.Error holding table public.test1_error will be 
> droped
> COPY 3
>
> Maybe I'm launching it the wrong way. If so, let me know about it.

looks like the above is about constraints violation while copying.
constraints violation while copying not in the scope of this patch.

Since COPY FROM is very like the INSERT command,
you do want all the valid constraints to check all the copied rows?

but the notice raised by the patch is not right.
So I place the drop error saving table or raise notice logic above
`ExecResetTupleTable(estate->es_tupleTable, false)` in the function
CopyFrom.

>
> I also notice interesting behavior if the table was previously created by the 
> user. When I was creating an error_table before the 'copy from' operation,
> I received a message saying that it is impossible to create a table with the 
> same name (it is shown below) during the 'copy from' operation.
> I think you should add information about this in the documentation, since 
> this seems to be normal behavior to me.
>

doc changed. you may check it.
From 3024bf3b727b728c58dfef41c62d7a93c083b887 Mon Sep 17 00:00:00 2001
From: pgaddict 
Date: Tue, 12 Dec 2023 20:58:45 +0800
Subject: [PATCH v11 1/1] Make COPY FROM more error tolerant

Currently COPY FROM has 3 types of error while processing the source file.
* extra data after last expected column
* missing data for column \"%s\"
* data type conversion error.

Instead of throwing errors while copying, save_error specifier will
save errors to a error saving table automatically.

We check the error saving table definition by column name and column data type.
if table already exists and meets the criteria then errors will save to that table.
if the table does not exist, then create one.

Only works for COPY FROM, non-BINARY mode.

While copying, if error never happened, error saving  table will be dropped at the ending of COPY FROM.
If the error saving table exists, meaning at least once COPY FROM errors has happened,
then all the future errors will be saved to that table.
We save the error related meta info to error saving table using SPI,
that is construct a query string, then execute the query.
---
 contrib/file_fdw/file_fdw.c  |   4 +-
 doc/src/sgml/ref/copy.sgml   | 100 +-
 src/backend/commands/copy.c  |  12 ++
 src/backend/commands/copyfrom.c  | 146 +++-
 src/backend/commands/copyfromparse.c | 169 +--
 src/backend/parser/gram.y|   8 +-
 src/bin/psql/tab-complete.c  |   3 +-
 src/include/commands/copy.h  |   3 +-
 src/include/commands/copyfrom_internal.h |   7 +
 src/include/parser/kwlist.h  |   1 +
 src/test/regress/expected/copy2.out  | 135 ++
 src/test/regress/sq

Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans)

2023-12-12 Thread Alexander Korotkov

On Mon, Dec 11, 2023 at 6:16 PM Peter Geoghegan  wrote:
> Will you be in Prague this week? If not this might have to wait.

Sorry, I wouldn't be in Prague this week.  Due to my current
immigration status, I can't travel.
I wish you to have a lovely time in Prague.  I'm OK to wait, review
once you can.  I will probably provide a more polished version
meanwhile.

--
Regards,
Alexander Korotkov

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-12 Thread Alvaro Herrera

[Added Andrey again in CC, because as I understand they are using this
code or something like it in production.  Please don't randomly remove
people from CC lists.]

I've been looking at this some more, and I'm not confident in that the
group clog update stuff is correct.  I think the injection points test
case was good enough to discover a problem, but it's hard to get peace
of mind that there aren't other, more subtle problems.

The problem I see is that the group update mechanism is designed around
contention of the global xact-SLRU control lock; it uses atomics to
coordinate a single queue when the lock is contended.  So if we split up
the global SLRU control lock using banks, then multiple processes using
different bank locks might not contend.  OK, this is fine, but what
happens if two separate groups of processes encounter contention on two
different bank locks?  I think they will both try to update the same
queue, and coordinate access to that *using different bank locks*.  I
don't see how can this work correctly.

I suspect the first part of that algorithm, where atomics are used to
create the list without a lock, might work fine.  But will each "leader"
process, each of which is potentially using a different bank lock,
coordinate correctly?  Maybe this works correctly because only one
process will find the queue head not empty?  If this is what happens,
then there needs to be comments about it.  Without any explanation,
this seems broken and potentially dangerous, as some transaction commit
bits might become lost given high enough concurrency and bad luck.

Maybe this can be made to work by having one more lwlock that we use
solely to coordinate this task.  Though we would have to demonstrate
that coordinating this task with a different lock works correctly in
conjunction with the per-bank lwlock usage in the regular slru.c paths.


Andrey, do you have any stress tests or anything else that you used to
gain confidence in this code?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

Re: [PATCH]: Not to invaldiate CatalogSnapshot for local invalidation messages

2023-12-12 Thread Xiaoran Wang

Hi hackers,

I would like to give more details of my patch.


In postgres, it uses a global snapshot “CatalogSnapshot” to check catalog
data visibility.

“CatalogSnapshot” is always updated to the latest version to make the
latest catalog table

content visible.


If there is any updating on catalog tables, to make the changes to be
visible for the following

commands in the current transaction,

 “CommandCounterIncrement”-

>”AtCCI_LocalCache”

->”CommandEndInvalidationMessages

”->”LocalExecuteInvalidationMessage”

->”InvalidateCatalogSnapshot”

 it will invalidate the “CatalogSnapshot” by setting it to NULL.  And next
time, when it needs the

“CatalogSnapsthot” and finds it is NULL, it will regenerate one.


In a query, “CommandCounterIncrement” may be called many times, and
“CatalogSnapsthot” may be

destroyed and recreated many times.  To reduce such overhead, instead of
invalidating “CatalogSnapshot”

, we can keep it and just increase the “curcid” of it.


When the transaction is committed or aborted, or there are catalog
invalidation messages from other

backends, the “CatalogSnapshot” will be invalidated and regenerated.
Sometimes, the “CatalogSnapshot” is

not the same as the transaction “CurrentSnapshot”, but we can still update
the CatalogSnapshot’s

“curcid”, as the “curcid” only be checked when the tuple is inserted or
deleted by the current transaction.





Xiaoran Wang  于2023年12月7日周四 10:13写道：

> Hi hackers,
>
>
> For local invalidation messages, there is no need to call
> `InvalidateCatalogSnapshot` to set the CatalogSnapshot to NULL and
>  rebuild it later. Instead, just update the CatalogSnapshot's `curcid`
>  in `SnapshotSetCommandId`, this way can make the CatalogSnapshot work
> well too.
>
>  This optimization can reduce the overhead of rebuilding CatalogSnapshot
>  after each command.
>
>
> Best regards, xiaoran
>

Re: How abnormal server shutdown could be detected by tests?

2023-12-12 Thread Alexander Lakhin


Hello Shveta,

12.12.2023 11:44, shveta malik wrote:



The postmaster process exits with exit code 1, but pg_ctl can't get the
code and just reports that stop was completed successfully.


For what it's worth, there is another thread which stated the similar problem:
https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us



Thank you for the reference!
So I refreshed a first part of the question Tom Lane raised before...

I've made a quick experiment with leaving postmaster.pid intact in case of
abnormal shutdown:
@@ -1113,6 +1113,7 @@ UnlinkLockFiles(int status, Datum arg)
 {
 char   *curfile = (char *) lfirst(l);

+if (strcmp(curfile, DIRECTORY_LOCK_FILE) != 0 || status == 0)
 unlink(curfile);
 /* Should we complain if the unlink fails? */
 }

and `make check-world` passed for me with no failure.
(In the meantime, the assertion failure forced as above is detected.)

Though there is a minor issue with a couple of tests. Namely,
003_recovery_targets.pl does the following:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
    $logfile = slurp_file($node_primary->logfile());
    $res = ($logfile =~
    qr/FATAL: .* recovery ended before configured recovery target was 
reached/);
    if ($res) {
    last;
    }
    usleep(100_000);
}
ok($res,
    'recovery end before target reached is a fatal error');

With postmaster.pid left after unclean shutdown, the test waits for 300
seconds by default and then completes successfully.

If rewrite that loop as follows:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
    $logfile = slurp_file($node_primary->logfile());
    $res = ($logfile =~
    qr/FATAL: .* recovery ended before configured recovery target was 
reached/);
    if ($res) {
    last;
    }
    usleep(100_000);
}
ok($res,
    'recovery end before target reached is a fatal error');

the test completes as quickly as before.
(standby.log is only 2kb, so rereading it isn't a big deal, IMO)

So maybe it's the way to go?

Another way I can think of is sending some signal to pg_ctl in case
postmaster terminates with status 0. Though I think it would complicate
things a little as it allows for three different states:
postmaster.pid preserved (in case postmaster killed with -9),
postmaster.pid removed and the signal received/not received.

Best regards,
Alexander

Re: Add --check option to pgindent

2023-12-12 Thread Tom Lane

"Euler Taveira"  writes:
> When you add exceptions, it starts to complicate the UI.

Indeed.  It seems like --silent-diff was poorly defined and poorly
named, and we need to rethink that option along the way to adding
this behavior.  The idea that --show-diff and --silent-diff can
be used together is just inherently confusing, because they sound
like opposites.

regards, tom lane

Re: Add --check option to pgindent

2023-12-12 Thread Tristan Partin


On Tue Dec 12, 2023 at 5:47 AM CST, Euler Taveira wrote:

On Tue, Dec 12, 2023, at 7:28 AM, Michael Banck wrote:
> On Tue, Dec 12, 2023 at 11:18:59AM +0100, Daniel Gustafsson wrote:
> > > On 12 Dec 2023, at 01:09, Tristan Partin  wrote:
> > > 
> > > Not sold on the name, but --check is a combination of --silent-diff

> > > and --show-diff. I envision --check mostly being used in CI
> > > environments. I recently came across a situation where this behavior
> > > would have been useful. Without --check, you're left to capture the
> > > output of --show-diff and exit 2 if the output isn't empty by
> > > yourself.
> > 
> > I wonder if we should model this around the semantics of git diff to

> > keep it similar to other CI jobs which often use git diff?  git diff
> > --check means "are there conflicts or issues" which isn't really
> > comparable to here, git diff --exit-code however is pretty much
> > exactly what this is trying to accomplish.
> > 
> > That would make pgindent --show-diff --exit-code exit with 1 if there

> > were diffs and 0 if there are no diffs.
> 
> To be honest, I find that rather convoluted; contrary to "git diff", I

> believe the primary action of pgident is not to show diffs, so I find
> the proposed --check option to be entirely reasonable from a UX
> perspective.
> 
> On the other hand, tying a "does this need re-indenting?" question to a

> "--show-diff --exit-code" option combination is not very obvious (to me,
> at least).

Multiple options to accomplish a use case might not be obvious. I'm wondering
if we can combine it into a unique option.

--show-diff show the changes that would be made
--silent-diff   exit with status 2 if any changes would be made
+ --check combination of --show-diff and --silent-diff

I mean

--diff=show,silent,check

When you add exceptions, it starts to complicate the UI.

usage("Cannot have both --silent-diff and --show-diff")
   if $silent_diff && $show_diff;
 
+usage("Cannot have both --check and --show-diff")

+  if $check && $show_diff;
+
+usage("Cannot have both --check and --silent-diff")
+  if $check && $silent_diff;
+


I definitely agree. I just didn't want to remove options, but if people 
are ok with that, I will just change the interface.


I like the git-diff semantics or have --diff and --check, similar to how 
Python's black[0] is.


[0]: https://github.com/psf/black

--
Tristan Partin
Neon (https://neon.tech)

Re: Add --check option to pgindent

2023-12-12 Thread Alvaro Herrera

On 2023-Dec-12, Tom Lane wrote:

> "Euler Taveira"  writes:
> > When you add exceptions, it starts to complicate the UI.
> 
> Indeed.  It seems like --silent-diff was poorly defined and poorly
> named, and we need to rethink that option along the way to adding
> this behavior.  The idea that --show-diff and --silent-diff can
> be used together is just inherently confusing, because they sound
> like opposites.

Maybe it's enough to rename --silent-diff to --check.  You can do
"--show-diff --check" and get both the error and the diff printed; or
just "--check" and it'll throw an error without further ado; or
"--show-diff" and it will both apply the diff and print it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I must say, I am absolutely impressed with what pgsql's implementation of
VALUES allows me to do. It's kind of ridiculous how much "work" goes away in
my code.  Too bad I can't do this at work (Oracle 8/9)."   (Tom Allison)
   http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php

Re: typedefs.list glitches

2023-12-12 Thread Tristan Partin


On Thu May 12, 2022 at 4:21 PM CDT, Tom Lane wrote:

I wrote:
> every buildfarm member that's contributing to the typedefs list
> builds with OpenSSL.  That wouldn't surprise me, except that
> my own animal sifaka should be filling that gap.  Looking at
> its latest attempt[1], it seems to be generating an empty list,
> which I guess means that our recipe for extracting typedefs
> doesn't work on macOS/arm64.  I shall investigate.

Found it.  Current macOS produces

$ objdump -W
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/objdump:
 error: unknown argument '-W'

where last year's vintage produced

$ objdump -W
objdump: Unknown command line argument '-W'.  Try: 
'/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/objdump
 --help'
objdump: Did you mean '-C'?

This confuses run_build.pl into taking the "Linux and sometimes windows"
code path instead of the $using_osx one.  I think simplest fix is to
move the $using_osx branch ahead of the heuristic ones, as attached.


Hey Tom,

Was this patch ever committed?

--
Tristan Partin
Neon (https://neon.tech)

Re: typedefs.list glitches

2023-12-12 Thread Tom Lane

"Tristan Partin"  writes:
> Was this patch ever committed?

Yes, though not till

commit dcca861554e90d6395c3c153317b0b0e3841f103
Author: Andrew Dunstan 
Date:   Sun Jan 15 07:32:50 2023 -0500

Improve typedef logic for MacOS

sifaka is currently generating typedefs, and I'm pretty certain
it's using unpatched REL_17 BF code.

regards, tom lane

Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()

2023-12-12 Thread Drouvot, Bertrand


Hi hackers,

Currently walrcv->walRcvState is set to WALRCV_STREAMING at the
beginning of WalReceiverMain().

But it seems that after this assignment things could be wrong before the
walreicever actually starts streaming (like not being able to connect
to the primary).

It looks to me that WALRCV_STREAMING should be set once walrcv_startstreaming()
returns true: this is the proposal of this patch.

I don't think the current assignment location is causing any issues, but I
think it's more appropriate to move it like in the attached.

Looking forward to your feedback,

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 000bb21e54085abcf9df8cc75e4a86c21e24700e Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 11 Dec 2023 20:05:25 +
Subject: [PATCH v1] move walrcv->walRcvState assignment to WALRCV_STREAMING

walrcv->walRcvState = WALRCV_STREAMING was set to early. Move the assignement
to a more appropriate place.
---
 src/backend/replication/walreceiver.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
 100.0% src/backend/replication/

diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 26ded928a7..47f1d50144 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -242,7 +242,6 @@ WalReceiverMain(void)
}
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
-   walrcv->walRcvState = WALRCV_STREAMING;
 
/* Fetch information required to start streaming */
walrcv->ready_to_display = false;
@@ -413,9 +412,14 @@ WalReceiverMain(void)
if (walrcv_startstreaming(wrconn, &options))
{
if (first_stream)
+   {
+   SpinLockAcquire(&walrcv->mutex);
+   walrcv->walRcvState = WALRCV_STREAMING;
+   SpinLockRelease(&walrcv->mutex);
ereport(LOG,
(errmsg("started streaming WAL 
from primary at %X/%X on timeline %u",

LSN_FORMAT_ARGS(startpoint), startpointTLI)));
+   }
else
ereport(LOG,
(errmsg("restarted WAL 
streaming at %X/%X on timeline %u",
-- 
2.34.1

Re: Move walreceiver state assignment (to WALRCV_STREAMING) in WalReceiverMain()

2023-12-12 Thread Euler Taveira

On Tue, Dec 12, 2023, at 12:58 PM, Drouvot, Bertrand wrote:
> Currently walrcv->walRcvState is set to WALRCV_STREAMING at the
> beginning of WalReceiverMain().
> 
> But it seems that after this assignment things could be wrong before the
> walreicever actually starts streaming (like not being able to connect
> to the primary).
> 
> It looks to me that WALRCV_STREAMING should be set once 
> walrcv_startstreaming()
> returns true: this is the proposal of this patch.

Per the state name (streaming), it seems it should be set later as you
proposed, however, I'm concerned about the previous state (WALRCV_STARTING). If
I'm reading the code correctly, WALRCV_STARTING is assigned at
RequestXLogStreaming():

SetInstallXLogFileSegmentActive();
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
 PrimarySlotName,
 wal_receiver_create_temp_slot);
flushedUpto = 0; 
}

/*
 * Check if WAL receiver is active or wait to start up.
 */
if (!WalRcvStreaming())
{
lastSourceFailed = true;
break;
}

After a few lines the function WalRcvStreaming() has:

SpinLockRelease(&walrcv->mutex);

/*  
 * If it has taken too long for walreceiver to start up, give up. Setting
 * the state to STOPPED ensures that if walreceiver later does start up
 * after all, it will see that it's not supposed to be running and die
 * without doing anything.
 */
if (state == WALRCV_STARTING)
{   
pg_time_t   now = (pg_time_t) time(NULL);

if ((now - startTime) > WALRCV_STARTUP_TIMEOUT)
{   
boolstopped = false;

SpinLockAcquire(&walrcv->mutex);
if (walrcv->walRcvState == WALRCV_STARTING)
{   
state = walrcv->walRcvState = WALRCV_STOPPED;
stopped = true;
}
SpinLockRelease(&walrcv->mutex);

if (stopped)
ConditionVariableBroadcast(&walrcv->walRcvStoppedCV);
}
}   

Couldn't it give up before starting if you apply your patch? My main concern is
due to a slow system, the walrcv_connect() took to long in WalReceiverMain()
and the code above kills the walreceiver while in the process to start it.
Since you cannot control the hardcoded WALRCV_STARTUP_TIMEOUT value, you might
have issues during overload periods.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Report planning memory in EXPLAIN ANALYZE

2023-12-12 Thread Alvaro Herrera

I would replace the text in explain.sgml with this:

+  Include information on memory consumption by the query planning phase.
+  This includes the precise amount of storage used by planner in-memory
+  structures, as well as overall total consumption of planner memory,
+  including allocation overhead.
+  This parameter defaults to FALSE.

and remove the new example you're adding; it's not really necessary.

In struct ExplainState, I'd put the new member just below summary.

If we're creating a new memory context for planner, we don't need the
"mem_counts_start" thing, and we don't need the
MemoryContextSizeDifference thing.  Just add the
MemoryContextMemConsumed() function (which ISTM should be just below
MemoryContextMemAllocated() instead of in the middle of the
MemoryContextStats implementation), and
just report the values we get there.  I think the SizeDifference stuff
increases surface damage for no good reason.

ExplainOnePlan() is given a MemoryUsage object (or, if we do as above
and no longer have a MemoryUsage struct at all in the first place, a
MemoryContextCounters object) even when the MEMORY option is false.
This means we waste computing memory usage when not needed.  Let's avoid
that useless overhead.

I'd also do away with the comment you added in explain_ExecutorEnd() and
do just this, below setting of es->summary:

+   /* No support for MEMORY option */
+   /* es->memory = false; */

We don't need to elaborate more at present.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)

Re: Report planning memory in EXPLAIN ANALYZE

2023-12-12 Thread Alvaro Herrera

On 2023-Dec-12, Alvaro Herrera wrote:

> I would replace the text in explain.sgml with this:
> 
> +  Include information on memory consumption by the query planning phase.
> +  This includes the precise amount of storage used by planner in-memory
> +  structures, as well as overall total consumption of planner memory,
> +  including allocation overhead.
> +  This parameter defaults to FALSE.

(This can still use a lot more wordsmithing of course, such as avoiding
repeated use of "include/ing", ugh)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)

Re: pg_upgrade and logical replication

2023-12-12 Thread Masahiko Sawada

On Thu, Dec 7, 2023 at 8:15 PM vignesh C  wrote:
>
> On Tue, 5 Dec 2023 at 10:56, Michael Paquier  wrote:
> >
> > On Mon, Dec 04, 2023 at 04:30:49PM +0530, Amit Kapila wrote:
> > > I have made minor changes in the comments and code at various places.
> > > See and let me know if you are not happy with the changes. I think
> > > unless there are more suggestions or comments, we can proceed with
> > > committing it.
> >
> > Yeah.  I am planning to look more closely at what you have here, and
> > it is going to take me a bit more time though (some more stuff planned
> > for next CF, an upcoming conference and end/beginning-of-year
> > vacations), but I think that targetting the beginning of next CF in
> > January would be OK.
> >
> > Overall, I have the impression that the patch looks pretty solid, with
> > a restriction in place for "init" and "ready" relations, while there
> > are tests to check all the states that we expect.  Seeing coverage
> > about all that makes me a happy hacker.
> >
> > + * If retain_lock is true, then don't release the locks taken in this 
> > function.
> > + * We normally release the locks at the end of transaction but in 
> > binary-upgrade
> > + * mode, we expect to release those immediately.
> >
> > I think that this should be documented in pg_upgrade_support.c where
> > the caller expects the locks to be released, and why these should be
> > released.  There is a risk that this comment becomes obsolete if
> > AddSubscriptionRelState() with locks released is called in a different
> > code path.  Anyway, I am not sure to get why this is OK, or even
> > necessary.  It seems like a good practice to keep the locks on the
> > subscription until the transaction that updates its state.  If there's
> > a specific reason explaining why that's better, the patch should tell
> > why.
>
> Added comments for this.
>
> > + * However, this shouldn't be a problem as the upgrade ensures
> > + * that all the transactions were replicated before upgrading the
> > + * publisher.
> > This wording looks a bit confusing to me, as "the upgrade" could refer
> > to the upgrade of a subscriber, but what we want to tell is that the
> > replay of the transactions is enforced when doing a publisher upgrade.
> > I'd suggest something like "the upgrade of the publisher ensures that
> > all the transactions were replicated before upgrading it".
>
> Modified
>
> > +my $result = $old_sub->safe_psql('postgres',
> > +   "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'i'");
> > +is($result, qq(t), "Check that the table is in init state");
> >
> > Hmm.  Not sure that this is safe.  Shouldn't this be a
> > poll_query_until(), polling that the state of the relation is what we
> > want it to be after requesting a fresh of the publication on the
> > subscriber?
>
> This is not required as the table will be added in init state after
> "Alter Subscription ... Refresh .." command itself.
>
> Thanks for the comments, the attached v24 version patch has the
> changes for the same.

Thank you for updating the patch.

Here are some minor comments:

+if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+ereport(ERROR,
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("relation %u does not exist", relid));
+

I think the error code should be ERRCODE_UNDEFINED_TABLE, and the
error message should be something like "relation with OID %u does not
exist". Or we might not need such checks since an undefined-object
error is caught by relation_open()?

---
+/* Fetch the existing tuple. */
+tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId,
+  CStringGetDatum(subname));
+if (!HeapTupleIsValid(tup))
+ereport(ERROR,
+errcode(ERRCODE_UNDEFINED_OBJECT),
+errmsg("subscription \"%s\" does not
exist", subname));
+
+form = (Form_pg_subscription) GETSTRUCT(tup);
+subid = form->oid;

The above code can be replaced with "get_subscription_oid(subname,
false)". binary_upgrade_replorigin_advance() has the same code.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Built-in CTYPE provider

2023-12-12 Thread Jeremy Schneider

On 12/5/23 3:46 PM, Jeff Davis wrote:
> === Character Classification ===
> 
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
> 
> === LOWER()/INITCAP()/UPPER() ===
> 
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.
> 
> === Questions ===
> 
> * Is a built-in ctype provider a reasonable direction for Postgres as
>   a project?
> * Does it feel like it would be simpler or more complex than what
>   we're doing now?
> * Do we want to just try to improve our ICU support instead?
> * Do we want the built-in provider to be one thing, or have a few
>   options (e.g. "standard" or "posix" character classification;
>   "simple" or "full" case mapping)?

Generally, I am in favor of this - I think we need to move in the
direction of having an in-database option around unicode for PG users,
given how easy it is for administrators to mis-manage dependencies.
Especially when OS admins can be different from DB admins, and when
nobody really understands risks of changing libs with in-place moves to
new operating systems - except for like 4 of us on the mailing lists.

My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before libs
are updated. When users start to notice that regex [[:digit:]] or
upper/lower functions aren't working correctly with characters in their
DB, they'll probably come asking for fixes. And we may end up with
something like the timezone database where we need to periodically add a
more current ruleset - albeit alongside as a new version in this case.

Here are direct links to charts of newly assigned characters from the
last few Unicode updates:

2022: https://www.unicode.org/charts/PDF/Unicode-15.0/
2021: https://www.unicode.org/charts/PDF/Unicode-14.0/
2020: https://www.unicode.org/charts/PDF/Unicode-13.0/
2019: https://www.unicode.org/charts/PDF/Unicode-12.0/

If I'm reading the Unicode 15 update correctly, PostgreSQL regex
expressions with [[:digit:]] will not correctly identify Kaktovik or Nag
Mundari or Kawi digits without that update to character type specs.

If I'm reading the Unicode 12 update correctly, then upper/lower
functions aren't going to work correctly on Latin Glottal A and I and U
characters without that update to character type specs.

Overall I see a lot fewer Unicode updates involving upper/lower than I
do with digits - especially since new scripts often involve their own
numbering characters which makes new digits more common.

But lets remember that people like to build indexes on character
classification functions like upper/lower, for case insensitive
searching. It's another case where the index will be corrupted if
someone happened to store Latin Glottal vowels in their database and
then we update libs to the latest character type rules.

So even with something as basic as character type, if we're going to do
it right, we still need to either version it or definitively decide that
we're not going to every support newly added Unicode characters like
Latin Glottals.

-Jeremy

-- 
http://about.me/jeremy_schneider

Clean up find_typedefs and add support for Mac

2023-12-12 Thread Tristan Partin


The script was using a few deprecated things according to POSIX:

- -o instead of ||
- egrep
- `` instead of $()

I removed those for their "modern" equivalents. Hopefully no buildfarm 
member complains. I can remove any of those patches though. I did go 
ahead and remove egrep usage from the entire codebase while I was at it. 
There is still a configure check though. I'm thinking that could also be 
removed?


I moved system detection to use uname -s. I hope that isn't a big deal. 
Not sure the best way to identify Mac otherwise.


The big patch here is adding support for Mac. objdump -W doesn't work on 
Mac. So, I used dsymutil and dwarfdump to achieve the same result. I am 
not someone who ever uses awk, so someone should definitely check my 
work there. I can only confirm this works on the latest version of Mac, 
and have no clue how backward compatible it is. I also wrote this 
without having a Mac. I had to ping a coworker with a Mac for help.


My goal with the Mac support is to enable use of find_typedef for 
extension developers, where using a Mac might be more prominent than 
upstream Postgres development, but that is just a guess.


--
Tristan Partin
Neon (https://neon.tech)
From e717a14a171c0226921ffed003dedd104bf3cf99 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 12 Dec 2023 10:13:13 -0600
Subject: [PATCH v1 1/6] Replace egrep with grep -E

GNU grep has been warning about grep -E since the 3.8 series. Further
GNU commentary:

> 7th Edition Unix had commands egrep and fgrep that were the counterparts
> of the modern grep -E and grep -F. Although breaking up grep into three
> programs was perhaps useful on the small computers of the 1970s, egrep
> and fgrep were not standardized by POSIX and are no longer needed. In
> the current GNU implementation, egrep and fgrep issue a warning and then
> act like their modern counterparts; eventually, they are planned to be
> removed entirely.

Further man page documentation:

> This grep has been enhanced in an upwards-compatible way to provide the
> exact functionality of the historical egrep and fgrep commands as well.
> It was the clear intention of the standard developers to consolidate the
> three greps into a single command.
---
 src/backend/port/aix/mkldexport.sh  | 4 ++--
 src/tools/find_typedef  | 6 +++---
 src/tools/perlcheck/find_perl_files | 2 +-
 src/tools/pginclude/pgrminclude | 6 +++---
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/backend/port/aix/mkldexport.sh b/src/backend/port/aix/mkldexport.sh
index adf3793e86..41405eba02 100755
--- a/src/backend/port/aix/mkldexport.sh
+++ b/src/backend/port/aix/mkldexport.sh
@@ -53,9 +53,9 @@ else
 	fi
 fi
 $NM -BCg $1 | \
-	egrep ' [TDB] ' | \
+	grep -E ' [TDB] ' | \
 	sed -e 's/.* //' | \
-	egrep -v '\$' | \
+	grep -E -v '\$' | \
 	sed -e 's/^[.]//' | \
 	sort | \
 	uniq
diff --git a/src/tools/find_typedef b/src/tools/find_typedef
index 24e9b76651..fec0520c32 100755
--- a/src/tools/find_typedef
+++ b/src/tools/find_typedef
@@ -36,12 +36,12 @@ do	# if objdump -W is recognized, only one line of error should appear
 	if [ `objdump -W 2>&1 | wc -l` -eq 1 ]
 	then	# Linux
 		objdump -W "$DIR"/* |
-		egrep -A3 '\(DW_TAG_typedef\)' |
+		grep -E -A3 '\(DW_TAG_typedef\)' |
 		awk ' $2 == "DW_AT_name" {print $NF}'
 	elif [ `readelf -w 2>&1 | wc -l` -gt 1 ]
 	then	# FreeBSD, similar output to Linux
 		readelf -w "$DIR"/* |
-		egrep -A3 '\(DW_TAG_typedef\)' |
+		grep -E -A3 '\(DW_TAG_typedef\)' |
 		awk ' $1 == "DW_AT_name" {print $NF}'
 	fi
 done |
@@ -50,4 +50,4 @@ sort |
 uniq |
 # these are used both for typedefs and variable names
 # so do not include them
-egrep -v '^(date|interval|timestamp|ANY)$'
+grep -E -v '^(date|interval|timestamp|ANY)$'
diff --git a/src/tools/perlcheck/find_perl_files b/src/tools/perlcheck/find_perl_files
index 20dceb800d..406ec7f3a0 100644
--- a/src/tools/perlcheck/find_perl_files
+++ b/src/tools/perlcheck/find_perl_files
@@ -12,7 +12,7 @@ find_perl_files () {
 		find "$@" -type f -name '*.p[lm]' -print
 		# take executable files that file(1) thinks are perl files
 		find "$@" -type f -perm -100 -exec file {} \; -print |
-		egrep -i ':.*perl[0-9]*\>' |
+		grep -E -i ':.*perl[0-9]*\>' |
 		cut -d: -f1
 	} | sort -u | grep -v '^\./\.git/'
 }
diff --git a/src/tools/pginclude/pgrminclude b/src/tools/pginclude/pgrminclude
index 7cbd2e7c9c..27c6c5bfb5 100755
--- a/src/tools/pginclude/pgrminclude
+++ b/src/tools/pginclude/pgrminclude
@@ -64,14 +64,14 @@ compile_file() {
 	[ "$INCLUDE" = "pg_config.h" ] && continue
 	[ "$INCLUDE" = "c.h" ] && continue
 	# Stringify macros will expand undefined identifiers, so skip files that use it
-	egrep -q '\<(CppAsString2?|CppConcat)\>' "$FILE" && continue
+	grep -E -q '\<(CppAsString2?|CppConcat)\>' "$FILE" && continue
 
 	# preserve configure-specific includes
 	# these includes are surrounded by #ifdef's
 	grep -B1 '^#include[ 	][ 	]*[<"]'"$INCLUDE"'[>"]' "$FILE" |
-	 egrep -q '^#if|^#else|^#elif

Re: encoding affects ICU regex character classification

2023-12-12 Thread Jeff Davis

On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:

> 
> How would you specify what you want?

One proposal would be to have a builtin collation provider:

https://postgr.es/m/9d63548c4d86b0f820e1ff15a83f93ed9ded4543.ca...@j-davis.com

I don't think there are very many ctype options, but they could be
specified as part of the locale, or perhaps even as some provider-
specific options specified at CREATE COLLATION time.

> As with collating, I like the
> idea of keeping support for libc even if it is terrible (some libcs
> more than others) and eventually not the default, because I think
> optional agreement with other software on the same host is a feature.

Of course we should keep the libc support around. I'm not sure how
relevant such a feature is, but I don't think we actually have to
remove it.

> Unless you also
> implement built-in case mapping, you'd still have to call libc or ICU
> for that, right?

We can do built-in case mapping, see:

https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com

>   It seems a bit strange to use different systems for
> classification and mapping.  If you do implement mapping too, you
> have
> to decide if you believe it is language-dependent or not, I think?

A complete solution would need to do the language-dependent case
mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
and only a handful of mapping changes, so we can handle that with the
builtin provider as well.

> Hmm, let's see what we're doing now... for ICU the regex code is
> using
> "simple" case mapping functions like u_toupper(c) that don't take a
> locale, so no Turkish i/İ conversion for you, unlike our SQL
> upper()/lower(), which this is supposed to agree with according to
> the
> comments at the top.  I see why: POSIX can only do one-by-one
> character mappings (which cannot handle Greek's context-sensitive
> Σ->σ/ς or German's multi-character ß->SS)

Regexes are inherently character-by-character, so transformations like
ß->SS are not going to work for case-insensitive regex matching
regardless of the provider.

Σ->σ/ς does make sense, and what we have seems to be just broken:

  select 'ς' ~* 'Σ'; -- false in both libc and ICU
  select 'Σ' ~* 'ς'; -- true in both libc and ICU

Similarly for titlecase variants:

  select 'ǅ' ~* 'ǆ'; -- false in libc and ICU
  select 'ǆ' ~* 'ǅ'; -- true in libc and ICU

If we do the case mapping ourselves, we can make those work. We'd just
have to modify the APIs a bit so that allcases() can actually get all
of the case variants, rather than relying on just towupper/towlower.

Regards,
Jeff Davis

Re: encoding affects ICU regex character classification

2023-12-12 Thread Jeremy Schneider

On 12/12/23 1:39 PM, Jeff Davis wrote:
> On Sun, 2023-12-10 at 10:39 +1300, Thomas Munro wrote:
>> Unless you also
>> implement built-in case mapping, you'd still have to call libc or ICU
>> for that, right?
> 
> We can do built-in case mapping, see:
> 
> https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com
> 
>>   It seems a bit strange to use different systems for
>> classification and mapping.  If you do implement mapping too, you
>> have
>> to decide if you believe it is language-dependent or not, I think?
> 
> A complete solution would need to do the language-dependent case
> mapping. But that seems to only be 3 locales ("az", "lt", and "tr"),
> and only a handful of mapping changes, so we can handle that with the
> builtin provider as well.

This thread has me second-guessing the reply I just sent on the other
thread.

Is someone able to test out upper & lower functions on U+A7BA ... U+A7BF
across a few libs/versions?  Theoretically the upper/lower behavior
should change in ICU between Ubuntu 18.04 LTS and Ubuntu 20.04 LTS
(specifically in ICU 64 / Unicode 12).  And I have no idea if or when
glibc might have picked up the new unicode characters.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Clean up find_typedefs and add support for Mac

2023-12-12 Thread Tom Lane

"Tristan Partin"  writes:
> The big patch here is adding support for Mac. objdump -W doesn't work on 
> Mac. So, I used dsymutil and dwarfdump to achieve the same result.

We should probably nuke the current version of src/tools/find_typedef
altogether in favor of copying the current buildfarm code for that.
We know that the buildfarm's version works, whereas I'm not real sure
that src/tools/find_typedef is being used in anger by anyone.  Also,
we're long past the point where developers can avoid having Perl
installed.

Ideally, the buildfarm client would then start to use find_typedef
from the tree rather than have its own copy, both to reduce
duplication and to ensure that the in-tree copy keeps working.

regards, tom lane

Re: Improve eviction algorithm in ReorderBuffer

2023-12-12 Thread Masahiko Sawada

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar  wrote:
>
> On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada  wrote:
> >
> > Hi all,
> >
> > As the comment of ReorderBufferLargestTXN() says, it's very slow with
> > many subtransactions:
> >
> > /*
> >  * Find the largest transaction (toplevel or subxact) to evict (spill to 
> > disk).
> >  *
> >  * XXX With many subtransactions this might be quite slow, because we'll 
> > have
> >  * to walk through all of them. There are some options how we could improve
> >  * that: (a) maintain some secondary structure with transactions sorted by
> >  * amount of changes, (b) not looking for the entirely largest transaction,
> >  * but e.g. for transaction using at least some fraction of the memory 
> > limit,
> >  * and (c) evicting multiple transactions at once, e.g. to free a given 
> > portion
> >  * of the memory limit (e.g. 50%).
> >  */
> >
> > This is because the reorderbuffer has transaction entries for each
> > top-level and sub transaction, and ReorderBufferLargestTXN() walks
> > through all transaction entries to pick the transaction to evict.
> > I've heard the report internally that replication lag became huge when
> > decoding transactions each consisting of 500k sub transactions. Note
> > that ReorderBufferLargetstTXN() is used only in non-streaming mode.
> >
> > Here is a test script for a many subtransactions scenario. In my
> > environment, the logical decoding took over 2min to decode one top
> > transaction having 100k subtransctions.
> >
> > -
> > create table test (c int);
> > create or replace function testfn (cnt int) returns void as $$
> > begin
> >   for i in 1..cnt loop
> > begin
> >   insert into test values (i);
> > exception when division_by_zero then
> >   raise notice 'caught error';
> >   return;
> > end;
> >   end loop;
> > end;
> > $$
> > language plpgsql;
> > select testfn(10)
> > set logical_decoding_work_mem to '4MB';
> > select count(*) from pg_logical_slot_peek_changes('s', null, null)
> > 
> >
> > To deal with this problem, I initially thought of the idea (a)
> > mentioned in the comment; use a binary heap to maintain the
> > transactions sorted by the amount of changes or the size. But it seems
> > not a good idea to try maintaining all transactions by  its size since
> > the size of each transaction could be changed frequently.
> >
> > The attached patch uses a different approach that consists of three
> > strategies; (1) maintain the list of transactions whose size is larger
> > than 10% of logical_decoding_work_mem, and preferentially evict a
> > transaction from this list. If the list is empty, all transactions are
> > small enough, (2) so we evict the oldest top-level transaction from
> > rb->toplevel_by_lsn list. Evicting older transactions would help in
> > freeing memory blocks in GenerationContext. Finally, if this is also
> > empty, (3) we evict a transaction that size is > 0. Here, we need to
> > note the fact that even if a transaction is evicted the
> > ReorderBufferTXN entry is not removed from rb->by_txn but its size is
> > 0. In the worst case where all (quite a few) transactions are smaller
> > than 10% of the memory limit, we might end up checking many
> > transactions to find non-zero size transaction entries to evict. So
> > the patch adds a new list to maintain all transactions that have at
> > least one change in memory.
> >
> > Summarizing the algorithm I've implemented in the patch,
> >
> > 1. pick a transaction from the list of large transactions (larger than
> > 10% of memory limit).
> > 2. pick a transaction from the top-level transaction list in LSN order.
> > 3. pick a transaction from the list of transactions that have at least
> > one change in memory.
> >
> > With the patch, the above test case completed within 3 seconds in my
> > environment.
>
> Thanks for working on this, I think it would be good to test other
> scenarios as well where this might have some negative impact and see
> where we stand.

Agreed.

> 1) A scenario where suppose you have one very large transaction that
> is consuming ~40% of the memory and 5-6 comparatively smaller
> transactions that are just above 10% of the memory limit.  And now for
> coming under the memory limit instead of getting 1 large transaction
> evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

> 2) Another scenario where all the transactions are under 10% of the
> memory limit but let's say there are some transactions are consuming
> around 8-9% of the memory limit each but those are not very old
> transactions whereas there are certain old transactions which are
> fairly small and consuming under 1% of memory limit and there are many
> such transactions.  So how it would affect if we frequently select
> many of these transactions to come under memory limit instead

Re: Oversight in reparameterize_path_by_child leading to executor crash

2023-12-12 Thread Andrei Lepikhov


On 6/12/2023 14:30, Richard Guo wrote:

I've self-reviewed this patch again and I think it's now in a
committable state.  I'm wondering if we can mark it as 'Ready for
Committer' now, or we need more review comments/feedbacks.

To recap, this patch postpones reparameterization of paths until
createplan.c, which would help avoid building the reparameterized
expressions we might not use.  More importantly, it makes it possible to
modify the expressions in RTEs (e.g. tablesample) and in RelOptInfos
(e.g. baserestrictinfo) for reparameterization.  Failing to do that can
lead to executor crashes, wrong results, or planning errors, as we have
already seen.


As I see, this patch contains the following modifications:
1. Changes in the create_nestloop_path: here, it arranges all tests to 
the value of top_parent_relids, if any. It is ok for me.
2. Changes in reparameterize_path_by_child and coupled new function 
path_is_reparameterizable_by_child. I compared them, and it looks good.
One thing here. You use the assertion trap in the case of inconsistency 
in the behaviour of these two functions. How disastrous would it be if 
someone found such inconsistency in production? Maybe better to use 
elog(PANIC, ...) report?
3. ADJUST_CHILD_ATTRS() macros. Understanding the necessity of this 
change, I want to say it looks a bit ugly.
4. SampleScan reparameterization part. It looks ok. It can help us in 
the future with asymmetric join and something else.
5. Tests. All of them are related to partitioning and the SampleScan 
issue. Maybe it is enough, but remember, this change now impacts the 
parameterization feature in general.
6. I can't judge how this switch of overall approach could impact 
something in the planner. I am only wary about what, from the first 
reparameterization up to the plan creation, we have some inconsistency 
in expressions (partitions refer to expressions with the parent relid). 
What if an extension in the middle changes the parent expression?


By and large, this patch is in a good state and may be committed.

--
regards,
Andrei Lepikhov
Postgres Professional

Re: broken master regress tests

2023-12-12 Thread Noah Misch

On Tue, Oct 10, 2023 at 05:54:34PM -0700, Andres Freund wrote:
> On 2023-10-10 17:08:25 -0700, Jeff Davis wrote:
> > After this, it seems "make check" no longer picks up the locale from
> > the system environment by default.
> 
> Yea. I wonder if the better fix would have been to copy setenv("LC_MESSAGES", 
> "C", 1);
> to the initdb template creation. That afaict also fixes the issue, with a
> smaller blast radius?

+1, that would restore the testing semantics known from v16-.  I think the
intent of the template was to optimize without changing semantics, and the
above proposal aligns with that.  Currently, for v17 alone, one needs
installcheck or build file hacks to reproduce a locale-specific test failure.

An alternative would be to declare that the tests are supported in one
encoding+locale only, then stop testing others in the buildfarm.  Even if we
did that, I'm fairly sure we'd standardize on UTF8, not SQL_ASCII, as the one
testable encoding.

Re: broken master regress tests

2023-12-12 Thread Tom Lane

Noah Misch  writes:
> An alternative would be to declare that the tests are supported in one
> encoding+locale only, then stop testing others in the buildfarm.

Surely that's not even a thinkably acceptable choice.

regards, tom lane

Re: Synchronizing slots from primary to standby

2023-12-12 Thread Amit Kapila

On Tue, Dec 12, 2023 at 2:44 PM shveta malik  wrote:
>
> On Mon, Dec 11, 2023 at 7:12 PM Amit Kapila  wrote:
> >
>
> > I am
> > asking this because the context used is TopMemoryContext which should
> > be used only if we need something specific to be retained at the
> > process level which doesn't seem to be the case here.
> >
>
> Okay, I understand your concern. But this needs more thoughts on shall
> we have all the slots synchronized in one txn or is it better to have
> it existing way i.e. each slot being synchronized in its own txn
> started in synchronize_one_slot. If we go by the former, can it have
> any implications?
>

I think the one advantage of syncing each slot in a different
transaction could have been if that helps with the visibility of
updated slot information but that is not the case here as we always
persist it to file. As per my understanding, here we need a
transaction as we may access catalogs while creating/updating slots,
so, a single transaction should be okay unless there are any other
reasons.

-- 
With Regards,
Amit Kapila.

Re: Improve upcasting for INT range and multi range types

2023-12-12 Thread jian he

On Fri, Dec 8, 2023 at 4:21 AM Federico  wrote:
>
> Hi,
>
> Postgresql seems to be missing upcasting when doing INT range and
> multi-range operation, for example when checking if an int4 is inside
> an int8 range.
> Some non working example are the following
>
> SELECT 2::INT4 <@ '[1, 4)'::INT8RANGE
> -- ERROR: operator does not exist: integer <@ int8range

select  oprname,
oprleft::regtype,
oprright::regtype,
oprcode
frompg_operator
where   oprname = '<@';

look at the results, you can see related info is:
 oprname |  oprleft   |   oprright|   oprcode
-++---+--
 <@  | anyelement | anyrange  | elem_contained_by_range
 <@  | anyelement | anymultirange | elem_contained_by_multirange

SELECT 2::INT4 <@ '[1, 4)'::INT8RANGE
It actually first does an operator sanity check, transforms
anyelement, anyrange to the detailed non-polymorphic data type.
 then calls the function elem_contained_by_range.
but it failed at the first step.

per doc 
https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
Similarly, if there are positions declared anyrange and others
declared anyelement or anyarray, the actual range type in the anyrange
positions must be a range whose subtype is the same type appearing in
the anyelement positions and the same as the element type of the
anyarray positions. If there are positions declared anymultirange,
their actual multirange type must contain ranges matching parameters
declared anyrange and base elements matching parameters declared
anyelement and anyarray.

Based on my interpretation, I don't think SELECT 2::INT4 <@ '[1,
4)'::INT8RANGE is doable.

Re: Improve upcasting for INT range and multi range types

2023-12-12 Thread Tom Lane

jian he  writes:
> Based on my interpretation, I don't think SELECT 2::INT4 <@ '[1,
> 4)'::INT8RANGE is doable.

Yeah, it would require a considerable expansion of the scope of
what can be matched by a polymorphic operator.  I'm afraid that
the negative consequences (mainly, "ambiguous operator" failures
because more than one thing can be matched) would outweigh the
benefits.  It is kind of annoying though that the system can't
do the "obvious" right thing here.

regards, tom lane

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-12 Thread Dilip Kumar

On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera  wrote:
>
> [Added Andrey again in CC, because as I understand they are using this
> code or something like it in production.  Please don't randomly remove
> people from CC lists.]

Oh, glad to know that.  Yeah, I generally do not remove but I have
noticed that in the mail chain, some of the reviewers just replied to
me and the hackers' list, and from that point onwards I lost track of
the CC list.

> I've been looking at this some more, and I'm not confident in that the
> group clog update stuff is correct.  I think the injection points test
> case was good enough to discover a problem, but it's hard to get peace
> of mind that there aren't other, more subtle problems.

Yeah, I agree.

> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended.  So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend.  OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks?  I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*.  I
> don't see how can this work correctly.

Let's back up a bit and start from the current design with the
centralized lock.  With that, if one process is holding the lock the
other processes will try to perform the group update, and if there is
already a group that still hasn't got the lock but trying to update
the different CLOG page then what this process wants to update then it
will not add itself for the group update instead it will fallback to
the normal lock wait.  Now, in another situation, it may so happen
that the group leader of the other group already got the control lock
and in such case, it would have cleared the
'procglobal->clogGroupFirst' that means now we will start forming a
different group.  So logically if we talk only about the optimization
part then the thing is that it is assumed that at a time when we are
trying to commit a log of concurrent xid then those xids are mostly of
the same range and will fall in the same SLRU page and group update
will help them.  But if we are getting some out-of-range xid of some
long-running transaction they might not even go for group update as
the page number will be different.  Although the situation might be
better here with a bank-wise lock because there if those xids are
falling in altogether a different bank it might not even contend.

Now, let's talk about the correctness, I think even though we are
getting processes that might be contending on different bank-lock,
still we are ensuring that in a group all the processes are trying to
update the same SLRU page (i.e. same bank also, we will talk about the
exception later[1]).  One of the processes is becoming a leader and as
soon as the leader gets the lock it detaches the queue from the
'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that
other group update requesters now can form another parallel group.
But here I do not see a problem with correctness.

I agree someone might say that since now there is a possibility that
different groups might get formed for different bank locks we do not
get other groups to get formed until we get the lock for our bank as
we do not clear  'procglobal->clogGroupFirst' before we get the lock.
Other requesters might want to update the page in different banks so
why block them?  But the thing is the group update design is optimized
for the cases where all requesters are trying to update the status of
xids generated near the same range.

> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine.  But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly?  Maybe this works correctly because only one
> process will find the queue head not empty?  If this is what happens,
> then there needs to be comments about it.

Yes, you got it right, I will try to comment on it better.

  Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.
>
> Maybe this can be made to work by having one more lwlock that we use
> solely to coordinate this task.

Do you mean to say a different lock for adding/removing in the list
instead of atomic operation?  I think then we will lose the benefit we
got in the group update by having contention on another lock.

[1] I think we already know about the exception case and I have
explained in the comments as well that in some cases we might add
different clog page update requests in the same group, and for
handling that exceptional case we are checking the respective bank
lock for each page and if that exc

Re: Synchronizing slots from primary to standby

2023-12-12 Thread Amit Kapila

On Mon, Dec 11, 2023 at 5:13 PM shveta malik  wrote:
>
> On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand
>  wrote:
> >
> > > If we agree
> > > on that then it would be good to prohibit setting this GUC on standby
> > > or at least it should be a no-op even if this GUC should be set on
> > > physical standby.
> >
> > I'd prefer to completely prohibit it on standby (to make it very clear it's 
> > not
> > working at all) as long as one can enable it without downtime once the 
> > standby
> > is promoted (which is the case currently).
>
> And I think slot-sync worker should exit as well on cascading standby. 
> Thoughts?
>

I think one has set all the valid parameters for the slot-sync worker
on standby, we should not exit, rather it should be no-op which means
it should not try to sync slots from another standby. One scenario
where this may help is when users promote the standby which has
already synced slots from the primary. In this case, cascading standby
will become non-cascading and should sync slots.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2023-12-12 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada  wrote:
>
> > Thanks for working on this, I think it would be good to test other
> > scenarios as well where this might have some negative impact and see
> > where we stand.
>
> Agreed.
>
> > 1) A scenario where suppose you have one very large transaction that
> > is consuming ~40% of the memory and 5-6 comparatively smaller
> > transactions that are just above 10% of the memory limit.  And now for
> > coming under the memory limit instead of getting 1 large transaction
> > evicted out, we are evicting out multiple times.
>
> Given the large transaction list will have up to 10 transactions, I
> think it's cheap to pick the largest transaction among them. It's O(N)
> but N won't be large.

Yeah, that makes sense.

> > 2) Another scenario where all the transactions are under 10% of the
> > memory limit but let's say there are some transactions are consuming
> > around 8-9% of the memory limit each but those are not very old
> > transactions whereas there are certain old transactions which are
> > fairly small and consuming under 1% of memory limit and there are many
> > such transactions.  So how it would affect if we frequently select
> > many of these transactions to come under memory limit instead of
> > selecting a couple of large transactions which are consuming 8-9%?
>
> Yeah, probably we can do something for small transactions (i.e. small
> and on-memory transactions). One idea is to pick the largest
> transaction among them by iterating over all of them. Given that the
> more transactions are evicted, the less transactions the on-memory
> transaction list has, unlike the current algorithm, we would still
> win. Or we could even split it into several sub-lists in order to
> reduce the number of transactions to check. For example, splitting it
> into two lists: transactions consuming 5% < and 5% >=  of the memory
> limit, and checking the 5% >= list preferably. The cost for
> maintaining these lists could increase, though.
>
> Do you have any ideas?

Yeah something like what you mention might be good, we maintain 3 list
that says large, medium, and small transactions.  In a large
transaction, list suppose we allow transactions that consume more than
10% so there could be at max 10 transactions so we can do a sequence
search and spill the largest of all.  Whereas in the medium list
suppose we keep transactions ranging from e.g. 3-10% then it's just
fine to pick from the head because the size differences between the
largest and smallest transaction in this list are not very
significant.  And remaining in the small transaction list and from the
small transaction list we can choose to spill multiple transactions at
a time.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Subsequent request from pg client

2023-12-12 Thread Abdul Matin

Can postgres client send subsequent requests without receiving response? If
so, how does the postgres client correlate between a request and its
response? Where can I get more hints about it?

Re: Synchronizing slots from primary to standby

2023-12-12 Thread shveta malik

On Tue, Dec 12, 2023 at 5:56 PM Nisha Moond  wrote:
>
> A review on v45 patch:
>
> If one creates a logical slot with failover=true as -
> select pg_create_logical_replication_slot('logical_slot','pgoutput',
> false, true, true);
>
> Then, uses the existing logical slot while creating a subscription -
> postgres=#  create subscription sub4 connection 'dbname=postgres
> host=localhost port=5433' publication pub1t4 WITH
> (slot_name=logical_slot, create_slot=false, failover=true);
> NOTICE:  changed the failover state of replication slot "logical_slot"
> on publisher to false
> CREATE SUBSCRIPTION
>
> Despite configuring logical_slot's failover to true and specifying
> failover=true during subscription creation, the NOTICE indicates a
> change in the failover state to 'false', without providing any
> explanation for this transition.
> It can be confusing for users, so IMO, the notice should include the
> reason for switching failover to 'false' or should give a hint to use
> either refresh=false or copy_data=false to enable failover=true for
> the slot as we do in other similar 'alter subscription...' scenarios.
>

Agree. The NOTICE should be more informative.

thanks
SHveta

Re: Synchronizing slots from primary to standby

2023-12-12 Thread shveta malik

On Wed, Dec 13, 2023 at 10:40 AM Amit Kapila  wrote:
>
> On Mon, Dec 11, 2023 at 5:13 PM shveta malik  wrote:
> >
> > On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand
> >  wrote:
> > >
> > > > If we agree
> > > > on that then it would be good to prohibit setting this GUC on standby
> > > > or at least it should be a no-op even if this GUC should be set on
> > > > physical standby.
> > >
> > > I'd prefer to completely prohibit it on standby (to make it very clear 
> > > it's not
> > > working at all) as long as one can enable it without downtime once the 
> > > standby
> > > is promoted (which is the case currently).
> >
> > And I think slot-sync worker should exit as well on cascading standby. 
> > Thoughts?
> >
>
> I think one has set all the valid parameters for the slot-sync worker
> on standby, we should not exit, rather it should be no-op which means
> it should not try to sync slots from another standby. One scenario
> where this may help is when users promote the standby which has
> already synced slots from the primary. In this case, cascading standby
> will become non-cascading and should sync slots.
>

Right, then perhaps we should increase naptime in this no-op case. It
could be even more then current inactivity naptime which is just
10sec. Shall it be say 5min in this case?

thanks
Shveta

Re: pg_upgrade and logical replication

2023-12-12 Thread vignesh C

On Wed, 13 Dec 2023 at 01:56, Masahiko Sawada  wrote:
>
> On Thu, Dec 7, 2023 at 8:15 PM vignesh C  wrote:
> >
> > On Tue, 5 Dec 2023 at 10:56, Michael Paquier  wrote:
> > >
> > > On Mon, Dec 04, 2023 at 04:30:49PM +0530, Amit Kapila wrote:
> > > > I have made minor changes in the comments and code at various places.
> > > > See and let me know if you are not happy with the changes. I think
> > > > unless there are more suggestions or comments, we can proceed with
> > > > committing it.
> > >
> > > Yeah.  I am planning to look more closely at what you have here, and
> > > it is going to take me a bit more time though (some more stuff planned
> > > for next CF, an upcoming conference and end/beginning-of-year
> > > vacations), but I think that targetting the beginning of next CF in
> > > January would be OK.
> > >
> > > Overall, I have the impression that the patch looks pretty solid, with
> > > a restriction in place for "init" and "ready" relations, while there
> > > are tests to check all the states that we expect.  Seeing coverage
> > > about all that makes me a happy hacker.
> > >
> > > + * If retain_lock is true, then don't release the locks taken in this 
> > > function.
> > > + * We normally release the locks at the end of transaction but in 
> > > binary-upgrade
> > > + * mode, we expect to release those immediately.
> > >
> > > I think that this should be documented in pg_upgrade_support.c where
> > > the caller expects the locks to be released, and why these should be
> > > released.  There is a risk that this comment becomes obsolete if
> > > AddSubscriptionRelState() with locks released is called in a different
> > > code path.  Anyway, I am not sure to get why this is OK, or even
> > > necessary.  It seems like a good practice to keep the locks on the
> > > subscription until the transaction that updates its state.  If there's
> > > a specific reason explaining why that's better, the patch should tell
> > > why.
> >
> > Added comments for this.
> >
> > > + * However, this shouldn't be a problem as the upgrade ensures
> > > + * that all the transactions were replicated before upgrading the
> > > + * publisher.
> > > This wording looks a bit confusing to me, as "the upgrade" could refer
> > > to the upgrade of a subscriber, but what we want to tell is that the
> > > replay of the transactions is enforced when doing a publisher upgrade.
> > > I'd suggest something like "the upgrade of the publisher ensures that
> > > all the transactions were replicated before upgrading it".
> >
> > Modified
> >
> > > +my $result = $old_sub->safe_psql('postgres',
> > > +   "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 
> > > 'i'");
> > > +is($result, qq(t), "Check that the table is in init state");
> > >
> > > Hmm.  Not sure that this is safe.  Shouldn't this be a
> > > poll_query_until(), polling that the state of the relation is what we
> > > want it to be after requesting a fresh of the publication on the
> > > subscriber?
> >
> > This is not required as the table will be added in init state after
> > "Alter Subscription ... Refresh .." command itself.
> >
> > Thanks for the comments, the attached v24 version patch has the
> > changes for the same.
>
> Thank you for updating the patch.
>
> Here are some minor comments:
>
> +if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
> +ereport(ERROR,
> +errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("relation %u does not exist", relid));
> +
>
> I think the error code should be ERRCODE_UNDEFINED_TABLE, and the
> error message should be something like "relation with OID %u does not
> exist". Or we might not need such checks since an undefined-object
> error is caught by relation_open()?

I have removed this as it will be caught by relation_open.

> ---
> +/* Fetch the existing tuple. */
> +tup = SearchSysCache2(SUBSCRIPTIONNAME, MyDatabaseId,
> +  CStringGetDatum(subname));
> +if (!HeapTupleIsValid(tup))
> +ereport(ERROR,
> +errcode(ERRCODE_UNDEFINED_OBJECT),
> +errmsg("subscription \"%s\" does not
> exist", subname));
> +
> +form = (Form_pg_subscription) GETSTRUCT(tup);
> +subid = form->oid;

Modified

> The above code can be replaced with "get_subscription_oid(subname,
> false)". binary_upgrade_replorigin_advance() has the same code.

Modified

Thanks for the comments, the attached v25 version patch has the
changes for the same.

Regards,
Vignesh
From a227419d761e9e121eeaf2db621d227c6abc6dc0 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 7 Dec 2023 09:50:27 +0530
Subject: [PATCH v25] Allow upgrades to preserve the full subscription's state.

This feature will allow us to replicate the changes on subscriber nodes
after the upgrade.

Previously, only the subscription

Fix bug with indexes on whole-row expressions

2023-12-12 Thread ywgrit

Hi, Thomas Munro and Laurenz Albe.

Since I didn't subscribe to the psql-hackers mailing list before this bug
was raised, please forgive me for not being able to reply to this email by
placing the email message below.
https://www.postgresql.org/message-id/flat/e48a5d9a2d3d72985d61ee254314f5f5f5444a55.ca...@cybertec.at

I forbid to create indexes on whole-row expression in the following patch.
I'd like to hear your opinions.


--
Best Wishes,
ywgrit
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd23ab3b25..e4451b1d36 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -572,9 +572,18 @@ DefineIndex(Oid relationId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	ListCell   *lc;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
+	foreach (lc, stmt->indexParams)
+	{
+		IndexElem *ielem = castNode(IndexElem, lfirst(lc));
+		if (IsA(ielem->expr, Var) && castNode(Var, ielem->expr)->varattno == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+	errmsg("cannot create index on whole-row expression of table '%s'", ielem->indexcolname)));
+	}
 	/*
 	 * Some callers need us to run with an empty default_tablespace; this is a
 	 * necessary hack to be able to reproduce catalog state accurately when

52 matches

Mail list logo