Re: pg_dump: Remove "blob" terminology

2022-12-02 Thread Daniel Gustafsson
> On 2 Dec 2022, at 08:09, Peter Eisentraut  
> wrote:

> fixed

+1 on this version of the patch, LGTM.

> I also put back the old long options forms in the documentation and help but 
> marked them deprecated.

+  --blobs (deprecated)
While not in scope for this patch, I wonder if we should add a similar
(deprecated) marker on other commandline options which are documented to be
deprecated.  -i on bin/postgres comes to mind as one candidate.

--
Daniel Gustafsson   https://vmware.com/





Re: Minimal logical decoding on standbys

2022-12-02 Thread Drouvot, Bertrand

Hi,

On 11/25/22 11:26 AM, Drouvot, Bertrand wrote:

Hi,

On 9/30/22 2:11 PM, Drouvot, Bertrand wrote:

Hi,

On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:

Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:

On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand  wrote:

So you have in mind to check for XLogLogicalInfoActive() first, and if true, 
then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.


Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me that the 
missing part to be able to tell whether or not an index is for a catalog is the 
rd_options->user_catalog_table value of its related heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of the 
related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an 
ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that would 
make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea in mind 
to propagate the information whether a relcache entry is for a catalog from the 
table to the index?



I ended up with the attached proposal to propagate the catalog information to 
the indexes.

The attached adds a new field "isusercatalog" in pg_index to indicate whether 
or not the index is linked to a table that has the storage parameter user_catalog_table 
set to true.

Then it defines new macros, including "IndexIsAccessibleInLogicalDecoding" 
making use of this new field.

This new macro replaces get_rel_logical_catalog() that was part of the previous 
patch version.

What do you think about this approach and the attached?

If that sounds reasonable, then I'll add tap tests for it and try to improve the way 
isusercatalog is propagated to the index(es) in case a reset is done on 
user_catalog_table on the table (currently in this POC patch, it's hardcoded to 
"false" which is the default value for user_catalog_table in boolRelOpts[]) (A 
better approach would be probably to retrieve the value from the table once the reset is 
done and then propagate it to the index(es).)


Please find attached a rebase to propagate the catalog information to the 
indexes.
It also takes care of the RESET on user_catalog_table (adding a new Macro 
"HEAP_DEFAULT_USER_CATALOG_TABLE") and adds a few tests in 
contrib/test_decoding/sql/ddl.sql.


Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

with the previous comments addressed, means mainly:

1/ don't call table_open() in low-level functions in 0001: this is done with a new field 
"isusercatalog" in pg_index to indicate whether or not the index is linked to a table 
that has the storage parameter user_catalog_table set to true (we may want to make this field 
"invisible" though). This new field is then used in the new 
IndexIsAccessibleInLogicalDecoding Macro (through IndexIsUserCatalog).

2/ Renaming the new field generated in the xlog record (to arrange conflict handling) from 
"onCatalogTable" to "onCatalogAccessibleInLogicalDecoding" to avoid any 
confusion (see 0001).

3/ Making sure that "currTLI" is the current one in logical_read_xlog_page() 
(see 0003).

4/ Fixing Walsender/startup process corner case: It's done in 0006 (I thought it is 
better to keep the other patches purely "feature" related and to address this 
corner case separately to ease the review). The fix is making use of a new

condition variable "replayedCV" so that the startup process can broadcast the 
walsender(s) once a replay is done.

Remarks:

- The new confl_active_logicalslot field added in pg_stat_database_conflicts 
(see 0002) is incremented only if the slot being invalidated is active (I think 
it makes more sense in regard to the other fields too). In all the cases 
(active/not

Re: [PATCH] Add native windows on arm64 support

2022-12-02 Thread Niyas Sait



On 02/12/2022 05:02, Michael Paquier wrote:

Thanks for the updated version.  I have been looking at it closely and
it looks like it should be able to do the job (no arm64 machine for
Windows here, sigh).

I have one tiny comment about this part:

-   USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => undef,
+   USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => 1,

Shouldn't we only enable this flag when we are under aarch64?
Similarly, I don't think that it is a good idea to switch on
USE_SSE42_CRC32C_WITH_RUNTIME_CHECK all the time.  We should only set
it when building under x86 and x86_64.



Ok, I will fix ARM64 specific one in the next revision.

--
Niyas




Re: [PATCH] Add native windows on arm64 support

2022-12-02 Thread Niyas Sait

On 02/12/2022 05:41, John Naylor wrote:

I couldn't find something more official for the sse2neon library part.

Not quite sure what this is referring to, but it seems we can just point to
the __aarch64__ section in the same file, which uses the same instruction:

spin_delay(void)
{
   __asm__ __volatile__(
   " isb; \n");
}

...and which already explains the choice with a comment.


Good point. Will add the comment.


+  if cc.get_id() == 'msvc'
+cdata.set('USE_ARMV8_CRC32C', false)
+cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+have_optimized_crc = true
+  else

That seems like a heavy-handed way to force it. Could we just use the same
gating in the test program that the patch puts in the code of interest?
Namely:

+#ifndef _MSC_VER
 #include 
+#endif
I took a similar approach as x86 MSVC code. I don't think the test 
program would work with MSVC. The compiler options are not MSVC friendly.


--
Niyas




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-02 Thread Amit Kapila
On Fri, Dec 2, 2022 at 2:29 PM Peter Smith  wrote:
>
> 3. pa_setup_dsm
>
> +/*
> + * Set up a dynamic shared memory segment.
> + *
> + * We set up a control region that contains a fixed-size worker info
> + * (ParallelApplyWorkerShared), a message queue, and an error queue.
> + *
> + * Returns true on success, false on failure.
> + */
> +static bool
> +pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
>
> IMO that's confusing to say "fixed-sized worker info" when it's
> referring to the ParallelApplyWorkerShared structure and not the other
> ParallelApplyWorkerInfo.
>
> Might be better to say:
>
> "a fixed-size worker info (ParallelApplyWorkerShared)" -> "a
> fixed-size struct (ParallelApplyWorkerShared)"
>
> ~~~
>

I find the existing wording better than what you are proposing. We can
remove the structure name if you think that is better but IMO, current
wording is good.

>
> 6. pa_free_worker_info
>
> + /*
> + * Ensure this worker information won't be reused during worker
> + * allocation.
> + */
> + ParallelApplyWorkersList = list_delete_ptr(ParallelApplyWorkersList,
> +winfo);
>
> SUGGESTION 1
> Removing from the worker pool ensures this information won't be reused
> during worker allocation.
>
> SUGGESTION 2 (more simply)
> Remove from the worker pool.
>

+1 for the second suggestion.

> ~~~
>
> 7. HandleParallelApplyMessage
>
> + /*
> + * The actual error must have been reported by the parallel
> + * apply worker.
> + */
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("logical replication parallel apply worker exited abnormally"),
> + errcontext("%s", edata.context)));
>
> Maybe it's better to remove the comment, but replace it with an
> errhint that tells the user "For the cause of this error see the error
> logged by the logical replication parallel apply worker."
>

I am not sure if such an errhint is a good idea, anyway, I think both
the errors will be adjacent in the LOGs unless there is some other
error in the short span.
> ~~~
>
> 17. apply_handle_stream_stop
>
> + case TRANS_PARALLEL_APPLY:
> + elog(DEBUG1, "applied %u changes in the streaming chunk",
> + parallel_stream_nchanges);
> +
> + /*
> + * By the time parallel apply worker is processing the changes in
> + * the current streaming block, the leader apply worker may have
> + * sent multiple streaming blocks. This can lead to parallel apply
> + * worker start waiting even when there are more chunk of streams
> + * in the queue. So, try to lock only if there is no message left
> + * in the queue. See Locking Considerations atop
> + * applyparallelworker.c.
> + */
>
> SUGGESTION (minor rewording)
>
> By the time the parallel apply worker is processing the changes in the
> current streaming block, the leader apply worker may have sent
> multiple streaming blocks. To the parallel apply from waiting
> unnecessarily, try to lock only if there is no message left in the
> queue. See Locking Considerations atop applyparallelworker.c.
>

I have proposed the additional line (This can lead to parallel apply
worker start waiting even when there are more chunk of streams in the
queue.) because it took me some time to understand this particular
scenario.

-- 
With Regards,
Amit Kapila.




Re: ExecRTCheckPerms() and many prunable partitions

2022-12-02 Thread Alvaro Herrera
Hello,

On 2022-Dec-02, Amit Langote wrote:

> This sounds like a better idea than adding a new AttrMap, so done this
> way in the attached 0001.

Thanks for doing that!  I have pushed it, but I renamed
ri_RootToPartitionMap to ri_RootToChildMap and moved it to another spot
in ResultRelInfo, which allows to simplify the comments.

> I've also merged into 0002 the delta patch I had posted earlier to add
> a copy of RTEPermInfos into the flattened permInfos list instead of
> adding the Query's copy.

Great.  At this point I have no other comments, except that in both
parse_relation.c and rewriteManip.c you've chosen to add the new
functions at the bottom of each file, which is seldom a good choice.
I think in the case of CombineRangeTables it should be the very first
function in the file, before all the walker-type stuff; and for
Add/GetRTEPermissionInfo I would suggest that right below
addRangeTableEntryForENR might be a decent choice (need to fix the .h
files to match, of course.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)




Re: Fix gin index cost estimation

2022-12-02 Thread Ronan Dunklau
Le mardi 25 octobre 2022, 16:18:57 CET Tom Lane a écrit :
> Alexander Korotkov  writes:
> > I think Tom's point was that it's wrong to add a separate entry-tree CPU
> > cost estimation to another estimation, which tries (very inadequately) to
> > estimate the whole scan cost. Instead, I propose writing better
> > estimations
> > for both entry-tree CPU cost and data-trees CPU cost and replacing
> > existing
> > CPU estimation altogether.
> 
> Great idea, if someone is willing to do it ...
> 
>   regards, tom lane

Hello,

Sorry for the delay, but here is an updated patch which changes the costing in 
the following way:

- add a descent cost similar to the btree one is charged for the initial 
entry-tree
- additionally, a charge is applied per page in both the entry tree and 
posting trees / lists
- instead of charging the quals to each tuple, charge them per entry only. We 
still charge cpu_index_tuple_cost per tuple though.

With those changes, no need to tweak the magic number formula estimating the 
number of pages. Maybe we can come up with something better for estimating 
those later on ?

Best regards,

--
Ronan Dunklau
>From 49bc71f25967f9367196803be363509922aac8ab Mon Sep 17 00:00:00 2001
From: Ronan Dunklau 
Date: Mon, 12 Sep 2022 15:40:18 +0200
Subject: [PATCH v3] Fix gin costing.

GIN index scans were not taking any descent CPU-based cost into account. That made
them look cheaper than other types of indexes when they shouldn't be.

We use the same heuristic as for btree indexes, but multiplying it by
the number of searched entries.

Additionnally, the cpu cost for the tree was based largely on
genericcostestimate. For a GIN index, we should not charge index quals
per tuple, but per entry. On top of this, charge cpu_index_tuple_cost
per actual tuple.

This should fix the cases where a GIN index is preferred over a btree,
and the ones where a memoize node is not added on top of the GIN index
scan because it seemed too cheap.

Per report of Hung Nguyen.
---
 src/backend/utils/adt/selfuncs.c | 50 +---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f116924d3c..1d07ae8e95 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7445,6 +7445,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 qual_arg_cost,
 spc_random_page_cost,
 outer_scans;
+	Cost		descentCost;
 	Relation	indexRel;
 	GinStatsData ginStats;
 	ListCell   *lc;
@@ -7669,6 +7670,41 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 */
 	dataPagesFetched = ceil(numDataPages * partialScale);
 
+	*indexStartupCost = 0;
+	*indexTotalCost = 0;
+
+	/*
+	 * Add a CPU-cost component to represent the costs of initial entry btree
+	 * descent.  We don't charge any I/O cost for touching upper btree levels,
+	 * since they tend to stay in cache, but we still have to do about log2(N)
+	 * comparisons to descend a btree of N leaf tuples.  We charge one
+	 * cpu_operator_cost per comparison.
+	 *
+	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
+	 * ones after the first one are not startup cost so far as the overall
+	 * plan is concerned, so add them only to "total" cost.
+	 */
+	if (numEntries > 1)			/* avoid computing log(0) */
+	{
+		descentCost = ceil(log(numEntries) / log(2.0)) * cpu_operator_cost;
+		*indexStartupCost += descentCost * counts.searchEntries;
+		*indexTotalCost += counts.arrayScans * descentCost * counts.searchEntries;
+	}
+
+	/*
+	 * Add a cpu cost per entry-page fetched. This is not amortized over a loop.
+	 */
+	*indexStartupCost += entryPagesFetched * 50.0 * cpu_operator_cost;
+	*indexTotalCost += entryPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost;
+
+	/*
+	 * Add a cpu cost per data-page fetched. This is also not amortized over a loop.
+	 * We only charge one data page for the startup cost, and everything else to
+	 * the total cost.
+	 */
+	*indexStartupCost += 50.0 * cpu_operator_cost;
+	*indexTotalCost += dataPagesFetched * counts.arrayScans * 50.0 * cpu_operator_cost;
+
 	/*
 	 * Calculate cache effects if more than one scan due to nestloops or array
 	 * quals.  The result is pro-rated per nestloop scan, but the array qual
@@ -7692,7 +7728,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * Here we use random page cost because logically-close pages could be far
 	 * apart on disk.
 	 */
-	*indexStartupCost = (entryPagesFetched + dataPagesFetched) * spc_random_page_cost;
+	*indexStartupCost += (entryPagesFetched + dataPagesFetched) * spc_random_page_cost;
 
 	/*
 	 * Now compute the number of data pages fetched during the scan.
@@ -7720,6 +7756,9 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	if (dataPagesFetchedBySel > dataPagesFetched)
 		dataPagesFetched = dataPagesFetchedBySel;
 
+	/* Add once a

Re: generic plans and "initial" pruning

2022-12-02 Thread Amit Langote
On Thu, Dec 1, 2022 at 9:43 PM Amit Langote  wrote:
> On Thu, Dec 1, 2022 at 8:21 PM Alvaro Herrera  wrote:
> > On 2022-Dec-01, Amit Langote wrote:
> > > Hmm, how about keeping the [Merge]Append's parent relation's RT index
> > > in the PartitionPruneInfo and passing it down to
> > > ExecInitPartitionPruning() from ExecInit[Merge]Append() for
> > > cross-checking?  Both Append and MergeAppend already have a
> > > 'apprelids' field that we can save a copy of in the
> > > PartitionPruneInfo.  Tried that in the attached delta patch.
> >
> > Ah yeah, that sounds about what I was thinking.  I've merged that in and
> > pushed to github, which had a strange pg_upgrade failure on Windows
> > mentioning log files that were not captured by the CI tooling.  So I
> > pushed another one trying to grab those files, in case it wasn't an
> > one-off failure.  It's running now:
> >   https://cirrus-ci.com/task/5857239638999040
> >
> > If all goes well with this run, I'll get this 0001 pushed.
>
> Thanks for pushing 0001.
>
> Rebased 0002 attached.

Thought it might be good for PartitionPruneResult to also have
root_parent_relids that matches with the corresponding
PartitionPruneInfo.  ExecInitPartitionPruning() does a sanity check
that the root_parent_relids of a given pair of PartitionPrune{Info |
Result} match.

Posting the patch separately as the attached 0002, just in case you
might think that the extra cross-checking would be an overkill.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com


v26-0002-Add-root_parent_relids-to-PartitionPruneResult.patch
Description: Binary data


v26-0001-Optimize-AcquireExecutorLocks-by-locking-only-un.patch
Description: Binary data


Re: Logical Replication Custom Column Expression

2022-12-02 Thread Ashutosh Bapat
On Wed, Nov 30, 2022 at 2:09 PM Stavros Koureas
 wrote:
>
>
>
> Στις Τρί 29 Νοε 2022 στις 3:27 μ.μ., ο/η Ashutosh Bapat 
>  έγραψε:
> > That would be too restrictive - not necessarily in your application
> > but generally. There could be some tables where consolidating rows
> > with same PK from different publishers into a single row in subscriber
> > would be desirable. I think we need to enable the property for every
> > subscriber that intends to add publisher column to the desired and
> > subscribed tables. But there should be another option per table which
> > will indicate that receiver should add publisher when INSERTING row to
> > that table.
>
> So we are discussing the scope level of this property, if this property will 
> be implemented on subscriber level or on subscriber table.
> In that case I am not sure how this will be implemented as currently postgres 
> subscribers can have multiple tables streamed from a single publisher.
> In that case we may have an additional syntax on subscriber, for example:
>
> CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432 user=postgres 
> password=XX dbname=publisher1' PUBLICATION pub1 with (enabled = false, 
> create_slot = false, slot_name = NONE, tables = {tableA:union, tableB:none, 
> });
>
> Something like this?

Nope, I think we will need to add a table level property through table
options or receiver can infer it by looking at the table columns -
e.g. existence of origin_id column or some such thing.


-- 
Best Wishes,
Ashutosh Bapat




WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2022-12-02 Thread Bharath Rupireddy
On Fri, Dec 2, 2022 at 6:10 AM Andres Freund  wrote:
>
> On 2022-11-25 16:54:19 +0530, Bharath Rupireddy wrote:
> > On Fri, Nov 25, 2022 at 12:16 AM Andres Freund  wrote:
> > > I think we could improve this code more significantly by avoiding the 
> > > call to
> > > LWLockWaitForVar() for all locks that aren't acquired or don't have a
> > > conflicting insertingAt, that'd require just a bit more work to handle 
> > > systems
> > > without tear-free 64bit writes/reads.
> > >
> > > The easiest way would probably be to just make insertingAt a 64bit atomic,
> > > that transparently does the required work to make even non-atomic 
> > > read/writes
> > > tear free. Then we could trivially avoid the spinlock in
> > > LWLockConflictsWithVar(), LWLockReleaseClearVar() and with just a bit more
> > > work add a fastpath to LWLockUpdateVar(). We don't need to acquire the 
> > > wait
> > > list lock if there aren't any waiters.
> > >
> > > I'd bet that start to have visible effects in a workload with many small
> > > records.
> >
> > Thanks Andres! I quickly came up with the attached patch. I also ran
> > an insert test [1], below are the results. I also attached the results
> > graph. The cirrus-ci is happy with the patch -
> > https://github.com/BRupireddy/postgres/tree/wal_insertion_lock_improvements_v1_2.
> > Please let me know if the direction of the patch seems right.
> > clientsHEADPATCHED
> > 113541499
> > 214511464
> > 430693073
> > 857125797
> > 161133111157
> > 322202022074
> > 644174242213
> > 1287130076638
> > 256103652118944
> > 512111250161582
> > 76899544161987
> > 102496743164161
> > 204872711156686
> > 409654158135713
>
> Nice.

Thanks for taking a look at it.

> > From 293e789f9c1a63748147acd613c556961f1dc5c4 Mon Sep 17 00:00:00 2001
> > From: Bharath Rupireddy 
> > Date: Fri, 25 Nov 2022 10:53:56 +
> > Subject: [PATCH v1] WAL Insertion Lock Improvements
> >
> > ---
> >  src/backend/access/transam/xlog.c |  8 +++--
> >  src/backend/storage/lmgr/lwlock.c | 56 +--
> >  src/include/storage/lwlock.h  |  7 ++--
> >  3 files changed, 41 insertions(+), 30 deletions(-)
> >
> > diff --git a/src/backend/access/transam/xlog.c 
> > b/src/backend/access/transam/xlog.c
> > index a31fbbff78..b3f758abb3 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -376,7 +376,7 @@ typedef struct XLogwrtResult
> >  typedef struct
> >  {
> >   LWLock  lock;
> > - XLogRecPtr  insertingAt;
> > + pg_atomic_uint64insertingAt;
> >   XLogRecPtr  lastImportantAt;
> >  } WALInsertLock;
> >
> > @@ -1482,6 +1482,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
> >   {
> >   XLogRecPtr  insertingat = InvalidXLogRecPtr;
> >
> > + /* Quickly check and continue if no one holds the lock. */
> > + if (!IsLWLockHeld(&WALInsertLocks[i].l.lock))
> > + continue;
>
> I'm not sure this is quite right - don't we need a memory barrier. But I don't
> see a reason to not just leave this code as-is. I think this should be
> optimized entirely in lwlock.c

Actually, we don't need that at all as LWLockWaitForVar() will return
immediately if the lock is free. So, I removed it.

> I'd probably split the change to an atomic from other changes either way.

Done. I've added commit messages to each of the patches.

I've also brought the patch from [1] here as 0003.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACXtQdrGXtb%3DrbUOXddm1wU1vD9z6q_39FQyX0166dq%3D%3DA%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v2-0001-Make-insertingAt-64-bit-atomic.patch
Description: Binary data


v2-0002-Add-fastpath-to-LWLockUpdateVar.patch
Description: Binary data


v2-0003-Make-lastImportantAt-64-bit-atomic.patch
Description: Binary data


Re: [PATCH] Add native windows on arm64 support

2022-12-02 Thread Niyas Sait

Hello,

I've attached a new revision of the patch (v5) and includes following 
changes,


1. Add support for meson build system
2. Extend MSVC scripts to handle ARM64 platform.
3. Add arm64 definition of spin_delay function.
4. Exclude arm_acle.h import with MSVC compiler.

V4->V5: * Added reference to Microsoft arm64 intrinsic documentation
* Conditionally enable USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK
* Fixed styling issue spotted by Ian Lawerence Barwick
V3->V4: Add support for meson build system
V2->V3: Removed ASLR enablement and rebased on master.
V1->V2: Rebased on top of master

--
NiyasFrom 58b82107088456fc71d239f4e1b995d91a94b4ef Mon Sep 17 00:00:00 2001
From: Niyas Sait 
Date: Tue, 22 Feb 2022 13:07:24 +
Subject: [PATCH v5] Enable postgres native build for windows-arm64 platform

Following changes are included

- Extend MSVC scripts to handle ARM64 platform
- Add arm64 definition of spin_delay function
- Exclude arm_acle.h import for MSVC
- Add support for meson build
---
 meson.build  | 33 +++-
 src/include/storage/s_lock.h | 10 +-
 src/port/pg_crc32c_armv8.c   |  2 ++
 src/tools/msvc/MSBuildProject.pm | 16 
 src/tools/msvc/Mkvcbuild.pm  |  9 +++--
 src/tools/msvc/Solution.pm   | 11 ---
 src/tools/msvc/gendef.pl |  8 
 7 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/meson.build b/meson.build
index 725e10d815..e354ad7650 100644
--- a/meson.build
+++ b/meson.build
@@ -1944,7 +1944,13 @@ int main(void)
 
 elif host_cpu == 'arm' or host_cpu == 'aarch64'
 
-  prog = '''
+  if cc.get_id() == 'msvc'
+cdata.set('USE_ARMV8_CRC32C', false)
+cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+have_optimized_crc = true
+  else
+
+prog = '''
 #include 
 
 int main(void)
@@ -1960,18 +1966,19 @@ int main(void)
 }
 '''
 
-  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
without -march=armv8-a+crc',
-  args: test_c_args)
-# Use ARM CRC Extension unconditionally
-cdata.set('USE_ARMV8_CRC32C', 1)
-have_optimized_crc = true
-  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
with -march=armv8-a+crc',
-  args: test_c_args + ['-march=armv8-a+crc'])
-# Use ARM CRC Extension, with runtime check
-cflags_crc += '-march=armv8-a+crc'
-cdata.set('USE_ARMV8_CRC32C', false)
-cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
-have_optimized_crc = true
+if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
without -march=armv8-a+crc',
+args: test_c_args)
+  # Use ARM CRC Extension unconditionally
+  cdata.set('USE_ARMV8_CRC32C', 1)
+  have_optimized_crc = true
+elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
with -march=armv8-a+crc',
+args: test_c_args + ['-march=armv8-a+crc'])
+  # Use ARM CRC Extension, with runtime check
+  cflags_crc += '-march=armv8-a+crc'
+  cdata.set('USE_ARMV8_CRC32C', false)
+  cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+  have_optimized_crc = true
+endif
   endif
 endif
 
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 8b19ab160f..ab6a6e0281 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -708,13 +708,21 @@ typedef LONG slock_t;
 #define SPIN_DELAY() spin_delay()
 
 /* If using Visual C++ on Win64, inline assembly is unavailable.
- * Use a _mm_pause intrinsic instead of rep nop.
+ * Use _mm_pause (x64) or __isb (arm64) intrinsic instead of rep nop.
  */
 #if defined(_WIN64)
 static __forceinline void
 spin_delay(void)
 {
+#ifdef _M_ARM64
+   /*
+* See spin_delay aarch64 inline assembly definition above for details
+* ref: 
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
+   */
+   __isb(_ARM64_BARRIER_SY);
+#else
_mm_pause();
+#endif
 }
 #else
 static __forceinline void
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index 9e301f96f6..981718752f 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
  */
 #include "c.h"
 
+#ifndef _MSC_VER
 #include 
+#endif
 
 #include "port/pg_crc32c.h"
 
diff --git a/src/tools/msvc/MSBuildProject.pm b/src/tools/msvc/MSBuildProject.pm
index 58590fdac2..274ddc8860 100644
--- a/src/tools/msvc/MSBuildProject.pm
+++ b/src/tools/msvc/MSBuildProject.pm
@@ -310,10 +310,18 @@ sub WriteItemDefinitionGroup
  : ($self->{type} eq "dll" ? 'DynamicLibrary' : 'StaticLibrary');
my $libs = $self->GetAdditionalLinkerDependencies($cfgname, ';');
 
-   my $targetmachine =
- $self->{platform} eq 'Win32' ? 'MachineX86' : 'MachineX64';
-   my $arch =
- $self->{platform} eq 'Win32' ? 'x86' : 'x86_64';
+   my $targetmachine;
+   my $arch;
+   if ($self->{platform} eq 'Win32') {
+

Re: Missing MaterialPath support in reparameterize_path_by_child

2022-12-02 Thread Ashutosh Bapat
Hi Tom,

On Fri, Dec 2, 2022 at 8:25 AM Tom Lane  wrote:
>
> Whilst fooling with my outer-join-aware-Vars patch, I tripped
> across a multi-way join query that failed with
>   ERROR:  could not devise a query plan for the given query
> when enable_partitionwise_join is on.
>
> I traced that to the fact that reparameterize_path_by_child()
> omits support for MaterialPath, so that if the only surviving
> path(s) for a child join include materialization steps, we'll
> fail outright to produce a plan for the parent join.
>
> Unfortunately, I don't have an example that produces such a
> failure against HEAD.  It seems certain to me that such cases
> exist, though, so I'd like to apply and back-patch the attached.

>From this comment, that I wrote back when I implemented that function,
I wonder if we thought MaterialPath wouldn't appear on the inner side
of nestloop join. But that can't be the case. Or probably we didn't
find MaterialPath being there from our tests.
 * This function is currently only applied to the inner side of a nestloop
 * join that is being partitioned by the partitionwise-join code.  Hence,
 * we need only support path types that plausibly arise in that context.
But I think it's good to have MaterialPath there.

>
> I'm suspicious now that reparameterize_path() should be
> extended likewise, but I don't really have any hard
> evidence for that.

I think we need it there since the scope of paths under appendrel has
certainly expanded a lot because of partitioned table optimizations.

The patch looks good to me.

-- 
Best Wishes,
Ashutosh Bapat




RE: Perform streaming logical transactions by background workers and parallel apply

2022-12-02 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for making the patch. Followings are my comments for v54-0003 and 0004.

0003

pa_free_worker()

+   /* Unlink any files that were needed to serialize partial changes. */
+   if (winfo->serialize_changes)
+   stream_cleanup_files(MyLogicalRepWorker->subid, 
winfo->shared->xid);
+

I think this part is not needed, because the LA cannot reach here if 
winfo->serialize_changes is true. Moreover stream_cleanup_files() is done in 
pa_free_worker_info().

LogicalParallelApplyLoop()

The parallel apply worker wakes up every 0.1s even if we are in the 
PARTIAL_SERIALIZE mode. Do you have idea to reduce that?

```
+   pa_spooled_messages();
```

Comments are needed here, like "Changes may be serialize...".

pa_stream_abort()

```
+   /*
+* Reopen the file and set the file position to 
the saved
+* position.
+*/
+   if (reopen_stream_fd)
+   {
+   charpath[MAXPGPATH];
+
+   changes_filename(path, 
MyLogicalRepWorker->subid, xid);
+   stream_fd = 
BufFileOpenFileSet(&MyParallelShared->fileset,
+   
   path, O_RDONLY, false);
+   BufFileSeek(stream_fd, fileno, offset, 
SEEK_SET);
+   }
```

MyParallelShared->serialize_changes may be used instead of reopen_stream_fd.


worker.c

```
-#include "storage/buffile.h"
```

I think this include should not be removed.


handle_streamed_transaction()

```
+   if (apply_action == TRANS_LEADER_SEND_TO_PARALLEL)
+   pa_send_data(winfo, s->len, s->data);
+   else
+   stream_write_change(action, &original_msg);
```

Comments are needed here, 0001 has that bu removed in 0002.
There are some similar lines.


```
+   /*
+* It is possible that while sending this change to 
parallel apply
+* worker we need to switch to serialize mode.
+*/
+   if (winfo->serialize_changes)
+   pa_set_fileset_state(winfo->shared, FS_READY);
```

There are three same parts in the code, can we combine them to common part?

apply_spooled_messages()

```
+   /*
+* Break the loop if the parallel apply worker has finished 
applying
+* the transaction. The parallel apply worker should have 
closed the
+* file before committing.
+*/
+   if (am_parallel_apply_worker() &&
+   MyParallelShared->xact_state == PARALLEL_TRANS_FINISHED)
+   goto done;
```

I thnk pfree(buffer) and pfree(s2.data) should not be skippied.
And this part should be at below "nchanges++;"


0004

set_subscription_retry()

```
+   LockSharedObject(SubscriptionRelationId, MySubscription->oid, 0,
+AccessShareLock);
+
```

I think AccessExclusiveLock should be aquired instead of AccessShareLock.
In AlterSubscription(), LockSharedObject(AccessExclusiveLock) seems to be used.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-12-02 Thread Ashutosh Bapat
Hi Dilip,

On Tue, Nov 29, 2022 at 9:38 AM Dilip Kumar  wrote:
>
> >
> > You are right we need this in ReorderBufferProcessPartialChange() as
> > well.  I will fix this in the next version.
>
> Fixed this in the attached patch.
>

I focused my attention on SnapBuildXactNeedsSkip() usages and I see
they are using different end points of WAL record
1 decode.clogicalmsg_decode   594
SnapBuildXactNeedsSkip(builder, buf->origptr)))
2 decode.cDecodeTXNNeedSkip  1250 return
(SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
3 reorderbuffer.c AssertTXNLsnOrder   897 if
(SnapBuildXactNeedsSkip(ctx->snapshot_builder,
ctx->reader->EndRecPtr))
4 reorderbuffer.c ReorderBufferCanStartStreaming 3922
!SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
5 snapbuild.c SnapBuildXactNeedsSkip  429
SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)

The first two are using origin ptr and the last two are using end ptr.
you have fixed the fourth one. Do we need to fix the third one as
well?

Probably we need to create two wrappers (macros) around
SnapBuildXactNeedsSkip(), one which accepts a XLogRecordBuffer and
other which accepts XLogReaderState. Then use those. That way at least
we have logic unified as to which XLogRecPtr to use.

--
Best Wishes,
Ashutosh Bapat




Re: Fix gin index cost estimation

2022-12-02 Thread Alexander Korotkov
Hi, Ronan!

On Fri, Dec 2, 2022 at 1:19 PM Ronan Dunklau  wrote:
> Sorry for the delay, but here is an updated patch which changes the costing in
> the following way:
>
> - add a descent cost similar to the btree one is charged for the initial
> entry-tree
> - additionally, a charge is applied per page in both the entry tree and
> posting trees / lists
> - instead of charging the quals to each tuple, charge them per entry only. We
> still charge cpu_index_tuple_cost per tuple though.
>
> With those changes, no need to tweak the magic number formula estimating the
> number of pages. Maybe we can come up with something better for estimating
> those later on ?

Thank you for your patch.  Couple of quick questions.
1) What magic number 50.0 stands for?  I think we at least should make
it a macro.
2) "We only charge one data page for the startup cost" – should this
be dependent on number of search entries?

--
Regards,
Alexander Korotkov




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-02 Thread Amit Kapila
On Fri, Dec 2, 2022 at 4:57 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> handle_streamed_transaction()
>
> ```
> +   if (apply_action == TRANS_LEADER_SEND_TO_PARALLEL)
> +   pa_send_data(winfo, s->len, s->data);
> +   else
> +   stream_write_change(action, &original_msg);
> ```
>
> Comments are needed here, 0001 has that bu removed in 0002.
> There are some similar lines.
>

I have suggested removing it because they were just saying what is
evident from the code and doesn't seem to be adding any value. I would
say they were rather confusing.

-- 
With Regards,
Amit Kapila.




Re: pg_upgrade: Make testing different transfer modes easier

2022-12-02 Thread Daniel Gustafsson
> On 1 Dec 2022, at 16:18, Peter Eisentraut  
> wrote:
> 
> I wanted to test the different pg_upgrade transfer modes (--link, --clone), 
> but that was not that easy, because there is more than one place in the test 
> script you have to find and manually change.  So I wrote a little patch to 
> make that easier.  It's still manual, but it's a start.  (In principle, we 
> could automatically run the tests with each supported mode in a loop, but 
> that would be very slow.)

Wouldn't it be possible, and less change-code-manual, to accept this via an
extension to PROVE_FLAGS?  Any options after :: to prove are passed to the
test(s) [0] so we could perhaps inspect @ARGV for the mode if we invent a new
way to pass arguments.  Something along the lines of the untested sketch
below in the pg_upgrade test:

+# Optionally set the file transfer mode for the tests via arguments to PROVE
+my $mode = (@ARGV);
+$mode = '--copy' unless defined;

.. together with an extension to Makefile.global.in ..

-   $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if 
$(PROVE_TESTS),$(PROVE_TESTS),t/*.pl)
+   $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if 
$(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) $(PROVE_TEST_ARGS)

.. should *I think* allow for passing the mode to the tests via:

make -C src/bin/pg_upgrade check PROVE_TEST_ARGS=":: --link"

The '::' part should of course ideally be injected automatically but the above
is mostly thinking out loud pseudocode so I didn't add that.

This could probably benefit other tests as well, to make it eas{y|ier} to run
extended testing on certain buildfarm animals or in the CFBot CI on specific
patches in the commitfest.

> While doing that, I also found it strange that the default transfer mode 
> (referred to as "copy" internally) did not have any external representation, 
> so it is awkward to refer to it in text, and obscure to see where it is used 
> for example in those test scripts.  So I added an option --copy, which 
> effectively does nothing, but it's not uncommon to have options that select 
> default behaviors explicitly.  (I also thought about something like a "mode" 
> option with an argument, but given that we already have --link and --clone, 
> this seemed the most sensible.)

Agreed, +1 on adding --copy regardless of the above.

--
Daniel Gustafsson   https://vmware.com/

[0] https://perldoc.perl.org/prove#Arguments-to-Tests



Re: Optimizing Node Files Support

2022-12-02 Thread John Naylor
On Thu, Dec 1, 2022 at 8:02 PM Ranier Vilela  wrote:
>
> Hi,
>
> I believe that has room for improving generation node files.
>
> The patch attached reduced the size of generated files by 27 kbytes.
> From 891 kbytes to 864 kbytes.
>
> About the patch:
> 1. Avoid useless attribution when from->field is NULL, once that
> the new node is palloc0.
>
> 2. Avoid useless declaration variable Size, when it is unnecessary.

Not useless -- it prevents a multiple evaluation hazard, which this patch
introduces.

> 3. Optimize comparison functions like memcmp and strcmp, using
>  a short-cut comparison of the first element.

Not sure if the juice is worth the squeeze. Profiling would tell.

> 4. Switch several copy attributions like COPY_SCALAR_FIELD or
COPY_LOCATION_FIELD
> by one memcpy call.

My first thought is, it would cause code churn.

> 5. Avoid useless attribution, ignoring the result of pg_strtok when it is
unnecessary.

Looks worse.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Report roles in pg_upgrade pg_ prefix check

2022-12-02 Thread Daniel Gustafsson
> On 29 Nov 2022, at 00:24, Michael Paquier  wrote:
> 
> On Mon, Nov 28, 2022 at 09:58:46AM +0100, Daniel Gustafsson wrote:
>> We are a bit inconsistent in how much details we include in the report
>> textfiles, so could do that without breaking any consistency in reporting.
>> Looking at other checks, the below format would match what we already do 
>> fairly
>> well:
>> 
>>  (oid=)
>> 
>> Done in the attached.
> 
> WFM.  Thanks!

Took another look at it, and applied it. Thanks!

--
Daniel Gustafsson   https://vmware.com/





Re: Using AF_UNIX sockets always for tests on Windows

2022-12-02 Thread Andrew Dunstan

On 2022-12-01 Th 21:10, Andres Freund wrote:
> Hi,
>
> On 2022-12-01 20:56:18 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> On 2022-12-01 20:30:36 -0500, Tom Lane wrote:
 If we remove that, won't we have a whole lot of code that's not
 tested at all on any platform, ie all the TCP-socket code?
>>> There's some coverage via the auth and ssl tests. But I agree it's an
>>> issue. But to me the fix for that seems to be to add a dedicated test for
>>> that, rather than relying on windows to test our socket code - that's quite 
>>> a
>>> few separate code paths from the tcp support of other platforms.
>> IMO that's not the best way forward, because you'll always have
>> nagging questions about whether a single-purpose test covers
>> everything that needs coverage.
> Still seems better than not having any coverage in our development
> environments...
>
>
>> I think the best place to be in would be to be able to run the whole test
>> suite using either TCP or UNIX sockets, on any platform (with stuff like the
>> SSL test overriding the choice as needed).
> I agree that that's useful. But it seems somewhat independent from the
> majority of the proposed changes. To be able to test force-tcp-everywhere we
> don't need e.g.  code for setting sspi auth in pg_regress etc - it's afaik
> just needed so there's a secure way of running tests at all on windows.
>
> I think 0003 should be "trimmed" to only change the default for
> $use_unix_sockets on windows and to remove PG_TEST_USE_UNIX_SOCKETS. Whoever
> wants to, can then add a new environment variable to force tap tests to use
> tcp.
>

Not sure if it's useful here, but a few months ago I prepared patches to
remove the config-auth option of pg_regress, which struck me as more
than odd, and replace it with a perl module. I didn't get around to
finishing them, but the patches as of then are attached.

I agree that having some switch that says "run everything with TCP" or
"run (almost) everything with Unix sockets" would be good.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From 5c7fad30aa2db2d47d0cc3869f132ca9d14a8814 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Tue, 26 Apr 2022 10:49:26 -0400
Subject: [PATCH 1/2] Do config_auth in perl code for TAP tests and
 vcregress.pl.

That means there is no longer any need to call pg_regress --config-auth
to set up Windows authentication. Instead a simple perl function call
does the work.
---
 contrib/basebackup_to_shell/t/001_basic.pl   |   2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   2 +-
 src/bin/pg_ctl/t/001_start_stop.pl   |   4 +-
 src/bin/pg_dump/t/010_dump_connstr.pl|  16 ++-
 src/bin/pg_rewind/t/RewindTest.pm|   2 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm |   7 +-
 src/test/perl/PostgreSQL/Test/ConfigAuth.pm  | 115 +++
 src/test/recovery/t/001_stream_rep.pl|   2 +-
 src/tools/msvc/vcregress.pl  |   8 +-
 9 files changed, 135 insertions(+), 23 deletions(-)
 create mode 100644 src/test/perl/PostgreSQL/Test/ConfigAuth.pm

diff --git a/contrib/basebackup_to_shell/t/001_basic.pl b/contrib/basebackup_to_shell/t/001_basic.pl
index 350d42079a..b5de23b48f 100644
--- a/contrib/basebackup_to_shell/t/001_basic.pl
+++ b/contrib/basebackup_to_shell/t/001_basic.pl
@@ -21,7 +21,7 @@ my $node = PostgreSQL::Test::Cluster->new('primary');
 # Make sure pg_hba.conf is set up to allow connections from backupuser.
 # This is only needed on Windows machines that don't use UNIX sockets.
 $node->init('allows_streaming' => 1,
-			'auth_extra' => [ '--create-role', 'backupuser' ]);
+			'auth_extra' => [ extra_roles => 'backupuser' ]);
 
 $node->append_conf('postgresql.conf',
    "shared_preload_libraries = 'basebackup_to_shell'");
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 056fcf3597..a0f4fd7b92 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -29,7 +29,7 @@ umask(0077);
 
 # Initialize node without replication settings
 $node->init(extra => ['--data-checksums'],
-			auth_extra => [ '--create-role', 'backupuser' ]);
+			auth_extra => [ extra_roles => 'backupuser' ]);
 $node->start;
 my $pgdata = $node->data_dir;
 
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index fdffd76d99..c92b06ad85 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -4,6 +4,7 @@
 use strict;
 use warnings;
 
+use PostgreSQL::Test::ConfigAuth;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -20,8 +21,7 @@ command_exit_is([ 'pg_ctl', 'start', '-D', "$tempdir/nonexistent" ],
 
 command_ok([ 'pg_ctl', 'initdb', '-D', "$tempdir/data", '-o', '-N' ],
 	'pg_ctl initdb');
-command_ok([ $ENV{PG_REGRESS}, '--config-auth', "$tempdir/data" ],
-	'configure authentication');
+config_auth("$tempdir/data");
 my $node_

Re: Missing MaterialPath support in reparameterize_path_by_child

2022-12-02 Thread Richard Guo
On Fri, Dec 2, 2022 at 7:21 PM Ashutosh Bapat 
wrote:

> > I'm suspicious now that reparameterize_path() should be
> > extended likewise, but I don't really have any hard
> > evidence for that.
>
> I think we need it there since the scope of paths under appendrel has
> certainly expanded a lot because of partitioned table optimizations.


I tried to see if the similar error can be triggered because of the lack
of MaterialPath support in reparameterize_path but didn't succeed.
Instead I see the optimization opportunity here if we can extend
reparameterize_path.  As an example, consider query

create table t (a int, b int);
insert into t select i, i from generate_series(1,1)i;
create index on t(a);
analyze t;

explain (costs off)
select * from (select * from t t1 union all select * from t t2 TABLESAMPLE
system_time (10)) s join (select * from t t3 limit 1) ss on s.a > ss.a;

Currently parameterized append path is not possible because MaterialPath
is not supported in reparameterize_path.  The current plan looks like

 QUERY PLAN

 Nested Loop
   Join Filter: (t1.a > t3.a)
   ->  Limit
 ->  Seq Scan on t t3
   ->  Append
 ->  Seq Scan on t t1
 ->  Materialize
   ->  Sample Scan on t t2
 Sampling: system_time ('10'::double precision)
(9 rows)

If we extend reparameterize_path to support MaterialPath, we would have
the additional parameterized append path and generate a better plan as
below

 QUERY PLAN

 Nested Loop
   ->  Limit
 ->  Seq Scan on t t3
   ->  Append
 ->  Index Scan using t_a_idx on t t1
   Index Cond: (a > t3.a)
 ->  Materialize
   ->  Sample Scan on t t2
 Sampling: system_time ('10'::double precision)
 Filter: (a > t3.a)
(10 rows)

So I also agree it's worth doing.

BTW, the code changes I'm using:

--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3979,6 +3979,17 @@ reparameterize_path(PlannerInfo *root, Path *path,
   apath->path.parallel_aware,
   -1);
}
+   case T_Material:
+   {
+   MaterialPath *matpath = (MaterialPath *) path;
+   Path *spath = matpath->subpath;
+
+   spath = reparameterize_path(root, spath,
+   required_outer,
+   loop_count);
+
+   return (Path *) create_material_path(rel, spath);
+   }

Thanks
Richard


Re: Fix gin index cost estimation

2022-12-02 Thread Ronan Dunklau
Le vendredi 2 décembre 2022, 12:33:33 CET Alexander Korotkov a écrit :
> Hi, Ronan!
> Thank you for your patch.  Couple of quick questions.
> 1) What magic number 50.0 stands for?  I think we at least should make
> it a macro.

This is what is used in other tree-descending  estimation functions, so I used 
that too. Maybe a DEFAULT_PAGE_CPU_COST macro would work for both ? If so I'll 
separate this into two patches, one introducing the macro for the other 
estimation functions, and this patch for gin.

> 2) "We only charge one data page for the startup cost" – should this
> be dependent on number of search entries?

Good point, multiplying it by the number of search entries would do the trick. 

Thank you for looking at this !

Regards,

--
Ronan Dunklau







Re: Removing another gen_node_support.pl special case

2022-12-02 Thread Peter Eisentraut

On 29.11.22 22:34, Tom Lane wrote:

I wrote:

I notice that EquivalenceClass is already marked as no_copy_equal,
which means that gen_node_support.pl can know that emitting a
recursive node-copy or node-compare request is a bad idea.  What
do you think of using the patch as it stands, plus a cross-check
that we don't emit COPY_NODE_FIELD or COMPARE_NODE_FIELD if the
target node type is no_copy or no_equal?


Concretely, it seems like something like the attached could be
useful, independently of the other change.


Yes, right now you can easily declare things that don't make sense. 
Cross-checks like these look useful.






Re: Missing MaterialPath support in reparameterize_path_by_child

2022-12-02 Thread Richard Guo
On Fri, Dec 2, 2022 at 8:49 PM Richard Guo  wrote:

> BTW, the code changes I'm using:
>
> --- a/src/backend/optimizer/util/pathnode.c
> +++ b/src/backend/optimizer/util/pathnode.c
> @@ -3979,6 +3979,17 @@ reparameterize_path(PlannerInfo *root, Path *path,
>apath->path.parallel_aware,
>-1);
> }
> +   case T_Material:
> +   {
> +   MaterialPath *matpath = (MaterialPath *) path;
> +   Path *spath = matpath->subpath;
> +
> +   spath = reparameterize_path(root, spath,
> +   required_outer,
> +   loop_count);
> +
> +   return (Path *) create_material_path(rel, spath);
> +   }
>

BTW, the subpath needs to be checked if it is null after being
reparameterized, since it might be a path type that is not supported
yet.

--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3979,6 +3979,19 @@ reparameterize_path(PlannerInfo *root, Path *path,
   apath->path.parallel_aware,
   -1);
}
+   case T_Material:
+   {
+   MaterialPath *matpath = (MaterialPath *) path;
+   Path *spath = matpath->subpath;
+
+   spath = reparameterize_path(root, spath,
+   required_outer,
+   loop_count);
+   if (spath == NULL)
+   return NULL;
+
+   return (Path *) create_material_path(rel, spath);
+   }

Thanks
Richard


Re: Error-safe user functions

2022-12-02 Thread Robert Haas
On Thu, Dec 1, 2022 at 5:33 PM Tom Lane  wrote:
> Robert Haas  writes:
> > It sounds to me like we're crafting something that is specific to and
> > can only be used with type input and output functions, so the name
> > probably should reflect that rather than being something totally
> > generic like ereturn() or error_stash() or whatever.
>
> My opinion is exactly the opposite.  Don't we already have a need
> for error-safe type conversions, too, in the JSON stuff?  Even if
> I couldn't point to a need-it-now requirement, I think we will
> eventually find a use for this with some other classes of functions.



But you yourself proposed a new node called IOCallContext. It can't be
right to have the names be specific to I/O functions in one part of
the patch and totally generic in another part.

Hmm, but yesterday I see that you were now calling it FuncCallContext.

I think the design is evolving in your head as you think about this
more, which is totally understandable and actually very good. However,
this is also why I think that you should produce the patch you
actually want instead of letting other people repeatedly submit
patches and then complain that they weren't what you had in mind.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Optimize common expressions in projection evaluation

2022-12-02 Thread David G. Johnston
On Fri, Dec 2, 2022 at 12:52 AM Peifeng Qiu  wrote:

> Hi hackers.
>
> When a star(*) expands into multiple fields, our current
> implementation is to generate multiple copies of the expression
> and do FieldSelects. This is very inefficient because the same
> expression get evaluated multiple times but actually we only need
> to do it once.


And so we implemented the SQL Standard LATERAL and all was well.

Given both how long we didn't have lateral and didn't do something like
this, and how long lateral has been around and this hasn't really come up,
the need for this code seems not that great.  But as to the code itself I'm
unable to properly judge.

David J.


Re: Optimizing Node Files Support

2022-12-02 Thread Ranier Vilela
Hi, thanks for reviewing this.

Em sex., 2 de dez. de 2022 às 09:24, John Naylor <
john.nay...@enterprisedb.com> escreveu:

>
> On Thu, Dec 1, 2022 at 8:02 PM Ranier Vilela  wrote:
> >
> > Hi,
> >
> > I believe that has room for improving generation node files.
> >
> > The patch attached reduced the size of generated files by 27 kbytes.
> > From 891 kbytes to 864 kbytes.
> >
> > About the patch:
> > 1. Avoid useless attribution when from->field is NULL, once that
> > the new node is palloc0.
> >
> > 2. Avoid useless declaration variable Size, when it is unnecessary.
>
> Not useless -- it prevents a multiple evaluation hazard, which this patch
> introduces.
>
It's doubting, that patch introduces some hazard here.
But I think that casting size_t (typedef Size) to size_t is worse and is
unnecessary.
Adjusted in the v1 patch.


> > 3. Optimize comparison functions like memcmp and strcmp, using
> >  a short-cut comparison of the first element.
>
> Not sure if the juice is worth the squeeze. Profiling would tell.
>
This is a cheaper test and IMO can really optimize, avoiding a function
call.


> > 4. Switch several copy attributions like COPY_SCALAR_FIELD or
> COPY_LOCATION_FIELD
> > by one memcpy call.
>
> My first thought is, it would cause code churn.
>
It's a weak argument.
Reduced 27k from source code, really worth it.


> > 5. Avoid useless attribution, ignoring the result of pg_strtok when it
> is unnecessary.
>
> Looks worse.
>
Better to inform the compiler that we really don't need the result.

regards,
Ranier Vilela


v1-optimize_gen_nodes_support.patch
Description: Binary data


Re: initdb: Refactor PG_CMD_PUTS loops

2022-12-02 Thread Andrew Dunstan


On 2022-12-01 Th 05:02, Peter Eisentraut wrote:
> Keeping the SQL commands that initdb runs in string arrays before
> feeding them to PG_CMD_PUTS() seems unnecessarily verbose and
> inflexible.  In some cases, the array only has one member.  In other
> cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a
> command, but that would require breaking up the loop or using
> workarounds like replace_token().  This patch unwinds all that; it's
> much simpler that way.


Looks reasonable. (Most of this dates back to 2003/2004, the very early
days of initdb.c.)


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Add support for DEFAULT specification in COPY FROM

2022-12-02 Thread Israel Barth Rubio
Hello all,

I'm submitting a new version of the patch. Instead of changing signature
of several functions in order to use the defaults parameter, it is now
storing
that in the cstate structure, which is already passed to all functions that
were previously modified.

Best regards,
Israel.

Em sex., 7 de out. de 2022 às 17:54, Israel Barth Rubio <
barthisr...@gmail.com> escreveu:

> Hello Zhihong,
>
> > For the last question, please take a look at:
> >
> > #define MemSetAligned(start, val, len) \
> >
> > which is called by palloc0().
>
> Oh, I totally missed that. Thanks for the heads up!
>
> I'm attaching the new patch version, which contains both the fix
> to the problem reported by Andres, and removes this useless
> MemSet call.
>
> Best regards,
> Israel.
>


v6-0001-Added-support-for-DEFAULT-in-COPY-FROM.patch
Description: Binary data


Re: Error-safe user functions

2022-12-02 Thread Tom Lane
Robert Haas  writes:
> I think the design is evolving in your head as you think about this
> more, which is totally understandable and actually very good. However,
> this is also why I think that you should produce the patch you
> actually want instead of letting other people repeatedly submit
> patches and then complain that they weren't what you had in mind.

OK, Corey hasn't said anything, so I will have a look at this over
the weekend.

regards, tom lane




Re: pg_dump: Remove "blob" terminology

2022-12-02 Thread Tom Lane
Peter Eisentraut  writes:
> On 30.11.22 09:07, Daniel Gustafsson wrote:
>> Should BLOB be changed to BLOBS here (and in similar comments) to make it
>> clearer that it refers to the archive entry and the concept of a binary large
>> object in general?

> I changed this one and went through it again to tidy it up a bit more. 
> (There are both "BLOB" and "BLOBS" archive entries, so both forms still 
> exist in the code now.)

I've not read this patch and don't have time right this moment, but
I wanted to make a note about something to have in the back of your
head: we really need to do something about situations with $BIGNUM
large objects.  Currently those tend to run pg_dump or pg_restore
out of memory because of TOC bloat, and we've seen multiple field
reports of that actually happening.

The scheme I've vaguely thought about, but not got round to writing,
is to merge all blobs with the same owner and ACL into one TOC entry.
One would hope that would get it down to a reasonable number of
TOC entries in practical applications.  (Perhaps there'd need to be
a switch to make this optional.)

I'm not asking you to make that happen as part of this patch, but
please don't refactor things in a way that will make it harder.

regards, tom lane




Is this an oversight in reparameterizing Memoize path?

2022-12-02 Thread Richard Guo
When reviewing other patch I noticed there might be an oversight for
MemoizePath in reparameterize_path.  In reparameterize_path we are
supposed to increase the path's parameterization to required_outer.
However, AFAICS for MemoizePath we just re-create the same path thus its
parameterization does not get increased.

I'm not sure if this has consequences in practice.  Just from reading
the codes, it seems this may cause assertion failure after the call of
reparameterize_path.

Assert(bms_equal(PATH_REQ_OUTER(path), required_outer));

Thanks
Richard


Re: Error-safe user functions

2022-12-02 Thread Andrew Dunstan


On 2022-12-02 Fr 09:12, Tom Lane wrote:
> Robert Haas  writes:
>> I think the design is evolving in your head as you think about this
>> more, which is totally understandable and actually very good. However,
>> this is also why I think that you should produce the patch you
>> actually want instead of letting other people repeatedly submit
>> patches and then complain that they weren't what you had in mind.
> OK, Corey hasn't said anything, so I will have a look at this over
> the weekend.
>
>   


Great. Let's hope we can get this settled early next week and then we
can get to work on the next tranche of functions, those that will let
the SQL/JSON work restart.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: fixing CREATEROLE

2022-12-02 Thread Robert Haas
On Mon, Nov 28, 2022 at 8:33 PM Robert Haas  wrote:
> Hmm, that's an interesting alternative to what I actually implemented.
> Some people might like it better, because it puts the behavior fully
> under the control of the CREATEROLE user, which a number of you seem
> to favor.

Here's an updated patch set.

0001 adds more precise and extensive documentation for the current
(broken) state of affairs. I propose to back-patch this to all
supported branches. It also removes a  suggesting that you should
use a CREATEDB & CREATEROLE role instead of a superuser, because that
is pretty pointless as things stand, and is too simplistic for the new
system that I'm proposing to put in place, too.

0002 and 0003 are refactoring, unchanged from v1.

0004 is the core fix to CREATEROLE. It has been updated from the
previous version with documentation and some bug fixes.

0005 adopts David's suggestion: instead of giving the superuser a way
to control the options on the implicit grant, give CREATEROLE users a
way to grant newly-created roles to themselves automatically. I made
this a GUC, which means that the person setting up the system could
configure a default in postgresql.conf, but a user who doesn't prefer
that default can also override it using ALTER ROLE .. SET or ~/.psqlrc
or whatever. This is simpler than what I had before, doesn't involve a
catalog change, makes it clear that the behavior is not
security-critical, and puts the decision fully in the hands of the
CREATEROLE user rather than being partly controlled by that user and
partly by the superuser. Hopefully that's an improvement.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Improve-documentation-of-the-CREATEROLE-attibute.patch
Description: Binary data


v2-0002-Refactor-permissions-checking-for-role-grants.patch
Description: Binary data


v2-0003-Pass-down-current-user-ID-to-AddRoleMems-and-DelR.patch
Description: Binary data


v2-0004-Restrict-the-privileges-of-CREATEROLE-users.patch
Description: Binary data


v2-0005-Add-new-GUC-createrole_self_grant.patch
Description: Binary data


Re: pg_dump: Remove "blob" terminology

2022-12-02 Thread Daniel Gustafsson
> On 2 Dec 2022, at 15:18, Tom Lane  wrote:

> I'm not asking you to make that happen as part of this patch, but
> please don't refactor things in a way that will make it harder.

I have that on my TODO as well since 
7da8823d83a2b66bdd917aa6cb2c5c2619d86011.ca...@credativ.de,
and having read this patch I don't think it moves the needle in any
way which complicates that.

--
Daniel Gustafsson   https://vmware.com/





Re: [PATCH] Allow specification of custom slot for custom nodes

2022-12-02 Thread Alexander Korotkov
On Sat, Nov 26, 2022 at 2:05 PM Ian Lawrence Barwick  wrote:
> 2022年11月22日(火) 5:50 Alexander Korotkov :
> >
> > On Mon, Nov 21, 2022 at 4:34 PM Pavel Borisov  
> > wrote:
> > > The following review has been posted through the commitfest application:
> > > make installcheck-world:  tested, passed
> > > Implements feature:   tested, passed
> > > Spec compliant:   not tested
> > > Documentation:not tested
> > >
> > > I've looked at this patch and don't see any problems with it. It is 
> > > minimally invasive, it doesn't affect functionality unless anyone (e.g. 
> > > extension) sets its own slotOps in CustomScanState.
> > > Furthermore, the current patch very slightly modifies patch 0b03e5951bf0 
> > > with the intention of introducing extensibility. So I think adding more 
> > > extensibility regarding different tuple formats is an excellent thing to 
> > > do.
> > >
> > > I'm going to mark it as RfC if there are no objections.
> >
> > Thank you for your feedback.  I also don't see how this patch could
> > affect anybody.
> > I'm going to push this if there are no objections.
>
> I see this was pushed (cee1209514) so have closed it in the CF app.

Yes, I forgot to do this.  Thank you.

--
Regards,
Alexander Korotkov




Re: Is this an oversight in reparameterizing Memoize path?

2022-12-02 Thread Tom Lane
Richard Guo  writes:
> When reviewing other patch I noticed there might be an oversight for
> MemoizePath in reparameterize_path.  In reparameterize_path we are
> supposed to increase the path's parameterization to required_outer.
> However, AFAICS for MemoizePath we just re-create the same path thus its
> parameterization does not get increased.

Yeah, that sure looks wrong.  At minimum we should be recursively
fixing the subpath.  (It looks like doing that and re-calling
create_memoize_path might be sufficient.)

According to [1] our code coverage for reparameterize_path is just
awful.  MemoizePath in reparameterize_pathlist_by_child isn't
tested either ...

regards, tom lane

[1] 
https://coverage.postgresql.org/src/backend/optimizer/util/pathnode.c.gcov.html




Re: [PATCH] Check snapshot argument of index_beginscan and family

2022-12-02 Thread Alexander Korotkov
Hi!

On Mon, Nov 28, 2022 at 1:30 PM Aleksander Alekseev
 wrote:
> Thanks for the feedback!
>
> > I think it's a nice catch and worth fixing. The one thing I don't
> > agree with is using asserts for handling the error that can appear
> > because most probably the server is built with assertions off and in
> > this case, there still will be a crash in this case. I'd do this with
> > report ERROR. Otherwise, the patch looks right and worth committing.
>
> Normally a developer is not supposed to pass NULLs there so I figured
> having extra if's in the release builds doesn't worth it. Personally I
> wouldn't mind using ereport() but my intuition tells me that the
> committers are not going to approve this :)
>
> Let's see what the rest of the community thinks.

I think this is harmless assertion patch.  I'm going to push this if
no objections.

--
Regards,
Alexander Korotkov




Re: Missing MaterialPath support in reparameterize_path_by_child

2022-12-02 Thread Tom Lane
Ashutosh Bapat  writes:
> On Fri, Dec 2, 2022 at 8:25 AM Tom Lane  wrote:
>> Unfortunately, I don't have an example that produces such a
>> failure against HEAD.  It seems certain to me that such cases
>> exist, though, so I'd like to apply and back-patch the attached.

> From this comment, that I wrote back when I implemented that function,
> I wonder if we thought MaterialPath wouldn't appear on the inner side
> of nestloop join. But that can't be the case. Or probably we didn't
> find MaterialPath being there from our tests.
>  * This function is currently only applied to the inner side of a nestloop
>  * join that is being partitioned by the partitionwise-join code.  Hence,
>  * we need only support path types that plausibly arise in that context.
> But I think it's good to have MaterialPath there.

So thinking about this a bit: the reason it is okay if reparameterize_path
fails is that it's not fatal.  We just go on our way without making
a parameterized path for that appendrel.  However, if
reparameterize_path_by_child fails for every available child path,
we end up with "could not devise a query plan", because the
partitionwise-join code is brittle and won't tolerate failure
to build a parent-join path.  Seems like we should be willing to
fall back to a non-partitionwise join in that case.

regards, tom lane




Re: Add tracking of backend memory allocated to pg_stat_activity

2022-12-02 Thread Reid Thompson
On Mon, 2022-11-28 at 10:59 -0800, Andres Freund wrote:
> On 2022-11-26 22:10:06 -0500, Reid Thompson wrote:
> >    - zero allocated bytes after fork to avoid double counting
> > postmaster allocations
> 
> I still don't understand this - postmaster shouldn't be counted at
> all. It
> doesn't have a PgBackendStatus. There simply shouldn't be any tracked
> allocations from it.
> 
> Greetings,
> 
> Andres Freund

Hi Andres,
I based this on the following.

It appears to me that Postmaster populates the local variable that
*my_allocated_bytes points to. That allocation is passed to forked
children, and if not zeroed out, will be double counted as part of
the child allocation. Is this invalid?

$ ps -ef|grep postgres
postgres6389   1  0 Dec01 ?00:00:17 /usr/sbin/pgbouncer -d 
/etc/pgbouncer/pgbouncer.ini
rthompso 2937799   1  0 09:45 ?00:00:00 
/tmp/postgres/install/pg-stats-memory/bin/postgres -D /var/tmp/pg-stats-memory 
-p 5433
rthompso 2937812 2937799  0 09:45 ?00:00:00 postgres: checkpointer 
rthompso 2937813 2937799  0 09:45 ?00:00:00 postgres: background writer 
rthompso 2937816 2937799  0 09:45 ?00:00:00 postgres: walwriter 
rthompso 2937817 2937799  0 09:45 ?00:00:00 postgres: autovacuum 
launcher 
rthompso 2937818 2937799  0 09:45 ?00:00:00 postgres: logical 
replication launcher 
rthompso 2938877 2636586  0 09:46 pts/400:00:00 
/usr/lib/postgresql/12/bin/psql -h localhost -p 5433 postgres
rthompso 2938909 2937799  0 09:46 ?00:00:00 postgres: rthompso postgres 
127.0.0.1(44532) idle
rthompso 2942164 1987403  0 09:49 pts/300:00:00 grep postgres

Bracketing fork_process() calls with logging to print *my_allocated_bytes 
immediately prior and after fork_process...
To me, this indicates that the forked children for this invocation
(checkpointer, walwriter, autovac launcher, client backend, autovac worker, etc)
are inheriting 240672 bytes from postmaster.  

$ ccat logfile 
2022-12-02 09:45:05.871 EST [2937799] LOG:  starting PostgreSQL 16devel on 
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 
64-bit
2022-12-02 09:45:05.872 EST [2937799] LOG:  listening on IPv4 address 
"127.0.0.1", port 5433
2022-12-02 09:45:05.874 EST [2937799] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5433"
parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672
parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672
parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672
child StartChildProcess process. pid: 2937812 *my_allocated_bytes: 240672
child StartChildProcess process. pid: 2937813 *my_allocated_bytes: 240672
child StartChildProcess process. pid: 2937814 *my_allocated_bytes: 240672
2022-12-02 09:45:05.884 EST [2937814] LOG:  database system was shut down at 
2022-12-02 09:41:13 EST
parent StartChildProcess process. pid: 2937799 *my_allocated_bytes: 240672
parent StartAutoVacLauncher process. pid: 2937799 *my_allocated_bytes: 240672
child StartChildProcess process. pid: 2937816 *my_allocated_bytes: 240672
parent do_start_bgworker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacLauncher process. pid: 2937817 *my_allocated_bytes: 240672
2022-12-02 09:45:05.889 EST [2937799] LOG:  database system is ready to accept 
connections
child do_start_bgworker process. pid: 2937818 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2938417 *my_allocated_bytes: 240672
parent BackendStartup process. pid: 2937799 *my_allocated_bytes: 240672
child BackendStartup process. pid: 2938909 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2938910 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2939340 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2939665 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2940038 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2940364 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2940698 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2941317 *my_allocated_bytes: 240672
parent StartAutoVacWorker process. pid: 2937799 *my_allocated_bytes: 240672
child StartAutoVacWorker process. pid: 2941825 *my_allocated_bytes: 240672


Reid Thompson
Senior Software Engineer
Crunchy Data, Inc.

reid.t

Re: Failed Assert in pgstat_assoc_relation

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 01:03:35 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On December 1, 2022 9:48:48 PM PST, Tom Lane  wrote:
> >> I'd suggest putting in an assertion that the relkind isn't changing,
> >> instead.
> 
> > Sounds like a plan. Will you do that when you remove the table-to-view 
> > hack? 
> 
> I'd suggest committing it concurrently with the v15 fix, instead,
> so that there's a cross-reference to what some future hacker might
> need to install if they remove the assertion.

Good idea.


> I guess that means that the table-to-view removal has to go in
> first.  I should be able to take care of that tomorrow, or if
> you're in a hurry I don't mind if you commit it for me.

No particular hurry from my end.

Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
> On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> > CommitFest 2022-11 is currently underway, so if you are interested
> > in moving this patch forward, now would be a good time to update it.
> 
> No replies after 4 weeks, so I have marked this entry as returned
> with feedback.  I am still wondering what would be the best thing to
> do here..

IMO this a bugfix, I don't think we can just close the entry, even if Matthias
doesn't have time / energy to push it forward.


I think the big issue with the patch as it stands is that it will typically
cause PANICs on failure, because the record-too-large ERROR be a in a critical
section. That's still better than generating a record that can't be replayed,
but it's not good.

There's not all that many places with potentially huge records. I wonder if we
ought to modify at least the most prominent ones to prepare the record before
the critical section. I think the by far most prominent real-world case is
RecordTransactionCommit(). I think we could rename XactLogCommitRecord() to
XactBuildCommitRecord() build the commit record, then have the caller do
START_CRIT_SECTION(), set DELAY_CHKPT_START, and only then do the
XLogInsert().

That'd even have the nice side-effect of reducing the window in which
DELAY_CHKPT_START is set a bit.

Greetings,

Andres Freund




Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-12-02 Thread Andres Freund
Hi,

On 2022-07-26 18:58:02 +0200, Matthias van de Meent wrote:
> - updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
> macros (now in xlogrecord.h), with a max length of the somewhat
> arbitrary 1020MiB.
>  This leaves room for approx. 4MiB of per-record allocation overhead
> before you'd hit MaxAllocSize, and also detaches the dependency on
> memutils.h.
> 
> - Retained the check in XLogRegisterData, so that we check against
> integer overflows in the registerdata code instead of only an assert
> in XLogRecordAssemble where it might be too late.
> - Kept the inline static elog-ing function (as per Andres' suggestion
> on 2022-03-14; this decreases binary sizes)

I don't think it should be a static inline. It should to be a *non* inlined
function, so we don't include the code for the elog in the callers.


> +/*
> + * Error due to exceeding the maximum size of a WAL record, or registering
> + * more datas than are being accounted for by the XLog infrastructure.
> + */
> +static inline void
> +XLogErrorDataLimitExceeded()
> +{
> + elog(ERROR, "too much WAL data");
> +}

I think this should be pg_noinline, as mentioned above.


>  /*
>   * Begin constructing a WAL record. This must be called before the
>   * XLogRegister* functions and XLogInsert().
> @@ -348,14 +359,29 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator 
> *rlocator, ForkNumber forknum,
>   * XLogRecGetData().
>   */
>  void
> -XLogRegisterData(char *data, int len)
> +XLogRegisterData(char *data, uint32 len)
>  {
>   XLogRecData *rdata;
>  
> - Assert(begininsert_called);
> + Assert(begininsert_called && XLogRecordLengthIsValid(len));
> +
> + /*
> +  * Check against max_rdatas; and ensure we don't fill a record with
> +  * more data than can be replayed. Records are allocated in one chunk
> +  * with some overhead, so ensure XLogRecordLengthIsValid() for that
> +  * size of record.
> +  *
> +  * Additionally, check that we don't accidentally overflow the
> +  * intermediate sum value on 32-bit systems by ensuring that the
> +  * sum of the two inputs is no less than one of the inputs.
> +  */
> + if (num_rdatas >= max_rdatas ||
> +#if SIZEOF_SIZE_T == 4
> +  mainrdata_len + len < len ||
> +#endif
> + !XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len))
> + XLogErrorDataLimitExceeded();

This is quite a complicated check, and the SIZEOF_SIZE_T == 4 bit is fairly
ugly.

I think we should make mainrdata_len a uint64, then we don't have to worry
about it overflowing on 32bit systems. And TBH, we don't care about some minor
inefficiency on 32bit systems.



> @@ -399,8 +425,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>   elog(ERROR, "no block with id %d registered with WAL insertion",
>block_id);
>  
> - if (num_rdatas >= max_rdatas)
> - elog(ERROR, "too much WAL data");
> + /*
> +  * Check against max_rdatas; and ensure we don't register more data per
> +  * buffer than can be handled by the physical data format; 
> +  * i.e. that regbuf->rdata_len does not grow beyond what
> +  * XLogRecordBlockHeader->data_length can hold.
> +  */
> + if (num_rdatas >= max_rdatas ||
> + regbuf->rdata_len + len > UINT16_MAX)
> + XLogErrorDataLimitExceeded();
> +
>   rdata = &rdatas[num_rdatas++];
>  
>   rdata->data = data;

This partially has been applied in ffd1b6bb6f8, I think we should consider
adding XLogErrorDataLimitExceeded() separately too.


>   rdt_datas_last->next = regbuf->rdata_head;
> @@ -858,6 +907,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>   for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
>   COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
>  
> + /*
> +  * Ensure that the XLogRecord is not too large.
> +  *
> +  * XLogReader machinery is only able to handle records up to a certain
> +  * size (ignoring machine resource limitations), so make sure we will
> +  * not emit records larger than those sizes we advertise we support.
> +  */
> + if (!XLogRecordLengthIsValid(total_len))
> + XLogErrorDataLimitExceeded();
> +
>   /*
>* Fill in the fields in the record header. Prev-link is filled in 
> later,
>* once we know where in the WAL the record will be inserted. The CRC 
> does

I think this needs to mention that it'll typically cause a PANIC.


Greetings,

Andres Freund




Re: Failed Assert in pgstat_assoc_relation

2022-12-02 Thread Tom Lane
I wrote:
> I guess that means that the table-to-view removal has to go in
> first.  I should be able to take care of that tomorrow, or if
> you're in a hurry I don't mind if you commit it for me.

Done now, feel free to deal with the pgstat problem.

regards, tom lane




Re: Add tracking of backend memory allocated to pg_stat_activity

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 11:09:30 -0500, Reid Thompson wrote:
> It appears to me that Postmaster populates the local variable that
> *my_allocated_bytes points to. That allocation is passed to forked
> children, and if not zeroed out, will be double counted as part of
> the child allocation. Is this invalid?

I don't think we should count allocations made before backend_status.c has
been initialized.

Greetings,

Andres Freund




Re: pg_dump: Remove "blob" terminology

2022-12-02 Thread Andrew Dunstan


On 2022-12-02 Fr 09:18, Tom Lane wrote:
> we really need to do something about situations with $BIGNUM
> large objects.  Currently those tend to run pg_dump or pg_restore
> out of memory because of TOC bloat, and we've seen multiple field
> reports of that actually happening.
>
> The scheme I've vaguely thought about, but not got round to writing,
> is to merge all blobs with the same owner and ACL into one TOC entry.
> One would hope that would get it down to a reasonable number of
> TOC entries in practical applications.  (Perhaps there'd need to be
> a switch to make this optional.)


+1 for fixing this. Your scheme seems reasonable. This has been a pain
point for a long time. I'm not sure what we'd gain by making the fix
optional.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: refactor ExecGrant_*() functions

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 08:30:55 +0100, Peter Eisentraut wrote:
> From 200879e5edfc1ce93b7af3cbfafc1f618626cbe9 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut 
> Date: Fri, 2 Dec 2022 08:16:53 +0100
> Subject: [PATCH] Refactor ExecGrant_*() functions
>
> Instead of half a dozen of mostly-duplicate ExecGrant_Foo() functions,
> write one common function ExecGrant_generic() that can handle most of
> them.

I'd name it ExecGrant_common() or such instead - ExecGrant_generic() sounds
like it will handle arbitrary things, which it doesn't. And, as you mention,
we could implement e.g. ExecGrant_Language() as using ExecGrant_common() +
additional checks.

Perhaps it'd be useful to add a callback to ExecGrant_generic() that can
perform additional checks, so that e.g. ExecGrant_Language() can easily be
implemented using ExecGrant_generic()?


> 1 file changed, 34 insertions(+), 628 deletions(-)

Very neat.


It seems wrong that most (all?) ExecGrant_* functions have the
  foreach(cell, istmt->objects)
loop. But that's a lot easier to address once the code has been
deduplicated radically.

Greetings,

Andres Freund




Bogus rte->relkind for EXCLUDED pseudo-relation

2022-12-02 Thread Tom Lane
In the wake of b23cd185f (pushed just now), I tried adding Asserts
to rewriteHandler.c that relkinds in RTEs don't change, as attached.
This blew up the regression tests immediately.  On investigation,
I find that

(1) ON CONFLICT's EXCLUDED pseudo-relation is assigned
rte->relkind = RELKIND_COMPOSITE, a rather horrid hack
installed by commit ad2278379.

(2) If a stored rule involves ON CONFLICT, then while loading
the rule AcquireRewriteLocks overwrites that with the actual
relkind, ie RELKIND_RELATION.  Or it did without the
attached Assert, anyway.

It appears to me that this means whatever safeguards are created
by the use of RELKIND_COMPOSITE will fail to apply in a rule.
Maybe that's okay because the relevant behaviors only occur at
parse time not rewrite/planning/execution, but even to write that
is to doubt how reliable and future-proof the assumption is.

I'm inclined to think we'd be well advised to undo that aspect of
ad2278379 and solve it some other way.  Maybe a new RTEKind would
be a better idea.  Alternatively, we could drop rewriteHandler.c's
attempts to update relkind.  Theoretically that's safe now, but
I hadn't quite wanted to just pull that trigger right away ...

regards, tom lane

diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index fb0c687bd8..49e0f54355 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -193,6 +193,7 @@ AcquireRewriteLocks(Query *parsetree,
  * While we have the relation open, update the RTE's relkind,
  * just in case it changed since this rule was made.
  */
+Assert(rte->relkind == rel->rd_rel->relkind);
 rte->relkind = rel->rd_rel->relkind;
 
 table_close(rel, NoLock);
@@ -3223,6 +3224,7 @@ rewriteTargetView(Query *parsetree, Relation view)
 	 * While we have the relation open, update the RTE's relkind, just in case
 	 * it changed since this view was made (cf. AcquireRewriteLocks).
 	 */
+	Assert(base_rte->relkind == base_rel->rd_rel->relkind);
 	base_rte->relkind = base_rel->rd_rel->relkind;
 
 	/*


Re: pg_dump: Remove "blob" terminology

2022-12-02 Thread Tom Lane
Andrew Dunstan  writes:
> On 2022-12-02 Fr 09:18, Tom Lane wrote:
>> The scheme I've vaguely thought about, but not got round to writing,
>> is to merge all blobs with the same owner and ACL into one TOC entry.
>> One would hope that would get it down to a reasonable number of
>> TOC entries in practical applications.  (Perhaps there'd need to be
>> a switch to make this optional.)

> +1 for fixing this. Your scheme seems reasonable. This has been a pain
> point for a long time. I'm not sure what we'd gain by making the fix
> optional.

Well, what this would lose is the ability to selectively restore
individual large objects using "pg_restore -L".  I'm not sure who
out there might be depending on that, but if we assume that's the
empty set I fear we'll find out it's not.  So a workaround switch
seemed possibly worth the trouble.  I don't have a position yet
on which way ought to be the default.

regards, tom lane




Re: [RFC] building postgres with meson - v13

2022-12-02 Thread Andres Freund
Hi,

On 2022-09-22 04:29:15 -0400, Andrew Dunstan wrote:
> Now I'll start on buildfarm support. Given my current commitments, this will
> take me a while, but I hope to have a working client by about the beginning
> of November.

Just checking: Any progress on this? Anything I can help with?

I'd like to move towards dropping src/tools/msvc at some point not too far
away, and we can't do so before having buildfarm support. I was just reminded
of this by looking at the windows-arm support patch...

Greetings,

Andres Freund




Re: Bogus rte->relkind for EXCLUDED pseudo-relation

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 12:34:36 -0500, Tom Lane wrote:
> In the wake of b23cd185f (pushed just now), I tried adding Asserts
> to rewriteHandler.c that relkinds in RTEs don't change, as attached.
> This blew up the regression tests immediately.  On investigation,
> I find that
>
> (1) ON CONFLICT's EXCLUDED pseudo-relation is assigned
> rte->relkind = RELKIND_COMPOSITE, a rather horrid hack
> installed by commit ad2278379.

Is it that horrid? I guess we can add a full blown relkind for it, but that'd
not really change that we'd logic to force AcquireRewriteLocks() to keep it's
hand off the relkind?

We don't really have a different way to represent something that looks like a
table's tuple, but without system columns, and that shouldn't affected by RLS
rewrite magic. We could add a distinct RELKIND of course, but that'd afaict
look very similar to RELKIND_COMPOSITE_TYPE.


> I'm inclined to think we'd be well advised to undo that aspect of
> ad2278379 and solve it some other way.  Maybe a new RTEKind would
> be a better idea.  Alternatively, we could drop rewriteHandler.c's
> attempts to update relkind.  Theoretically that's safe now, but
> I hadn't quite wanted to just pull that trigger right away ...

I think it'd be good to not have rewriteHandler.c update relkind, even if we
undo the RELKIND_COMPOSITE aspect of ad2278379. Changing relkind seems fairly
dangerous to me, particularly because we don't ever expect that to happen
now. I think it might make sense to make it an elog() rather than an Assert()
though.

Greetings,

Andres Freund




Re: Failed Assert in pgstat_assoc_relation

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 12:15:37 -0500, Tom Lane wrote:
> I wrote:
> > I guess that means that the table-to-view removal has to go in
> > first.  I should be able to take care of that tomorrow, or if
> > you're in a hurry I don't mind if you commit it for me.
> 
> Done now, feel free to deal with the pgstat problem.

Thanks.  I'm out for a few hours without proper computer access, couldn't
quite get it finished inbetween your push and now. Will deal with it once I
get back.

Greetings,

Andres Freund




Re: Error-safe user functions

2022-12-02 Thread Corey Huinker
On Fri, Dec 2, 2022 at 9:12 AM Tom Lane  wrote:

> Robert Haas  writes:
> > I think the design is evolving in your head as you think about this
> > more, which is totally understandable and actually very good. However,
> > this is also why I think that you should produce the patch you
> > actually want instead of letting other people repeatedly submit
> > patches and then complain that they weren't what you had in mind.
>
> OK, Corey hasn't said anything, so I will have a look at this over
> the weekend.
>
> regards, tom lane
>

Sorry, had several life issues intervene. Glancing over what was discussed
because it seems pretty similar to what I had in mind. Will respond back in
detail shortly.


Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2022-12-02 Thread Jacob Champion
On Thu, Dec 1, 2022 at 9:26 PM Michael Paquier  wrote:
> On Mon, Nov 07, 2022 at 05:04:14PM -0800, Jacob Champion wrote:
> > The macOS/OpenSSL 3.0.0 failure is still unfixed.
>
> Err, could you look at that?  I am switching the patch as waiting on
> author.

Thanks for the nudge -- running with OpenSSL 3.0.7 in CI did not fix
the issue. I suspect a problem with our error stack handling...

Separately from this, our brew cache in Cirrus is extremely out of
date. Is there something that's supposed to be running `brew update`
(or autoupdate) that is stuck or broken?

Thanks,
--Jacob




Re: Error-safe user functions

2022-12-02 Thread Corey Huinker
On Fri, Dec 2, 2022 at 9:34 AM Andrew Dunstan  wrote:

>
> On 2022-12-02 Fr 09:12, Tom Lane wrote:
> > Robert Haas  writes:
> >> I think the design is evolving in your head as you think about this
> >> more, which is totally understandable and actually very good. However,
> >> this is also why I think that you should produce the patch you
> >> actually want instead of letting other people repeatedly submit
> >> patches and then complain that they weren't what you had in mind.
> > OK, Corey hasn't said anything, so I will have a look at this over
> > the weekend.
> >
> >
>
>
> Great. Let's hope we can get this settled early next week and then we
> can get to work on the next tranche of functions, those that will let
> the SQL/JSON work restart.
>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>
I'm still working on organizing my patch, but it grew out of a desire to do
this:

CAST(value AS TypeName DEFAULT expr)

This is a thing that exists in other forms in other databases and while it
may look unrelated, it is essentially the SQL/JSON casts within a nested
data structure issue, just a lot simpler.

My original plan had been two new params to all _in() functions: a boolean
error_mode and a default expression Datum.

After consulting with Jeff Davis and Michael Paquier, the notion of
modifying fcinfo itself two booleans:
  allow_error (this call is allowed to return if there was an error with
INPUT) and
  has_error (this function has the concept of a purely-input-based error,
and found one)

The nice part about this is that unaware functions can ignore these values,
and custom data types that did not check these values would continue to
work as before. It wouldn't respect the CAST default, but that's up to the
extension writer to fix, and that's a pretty acceptable failure mode.

Where this gets tricky is arrays and complex types: the default expression
applies only to the object explicitly casted, so if somebody tried CAST
('{"123","abc"}'::text[] AS integer[] DEFAULT '{0}') the inner casts need
to know that they _can_ allow input errors, but have no default to offer,
they need merely report their failure upstream...and that's where the
issues align with the SQL/JSON issue.

In pursuing this, I see that my method was simultaneously too little and
too much. Too much in the sense that it alters the structure for all fmgr
functions, though in a very minor and back-compatible way, and too little
in the sense that we could actually return the ereport info in a structure
and leave it to the node to decide whether to raise it or not. Though I
should add that there many situations where we don't care about the
specifics of the error, we just want to know that one existed and move on,
so time spent forming that return structure would be time wasted.

The one gap I see so far in the patch presented is that it returns a null
value on bad input, which might be ok if the node has the default, but that
then presents the node with having to understand whether it was a null
because of bad input vs a null that was expected.


Re: Error-safe user functions

2022-12-02 Thread Tom Lane
Corey Huinker  writes:
> I'm still working on organizing my patch, but it grew out of a desire to do
> this:
> CAST(value AS TypeName DEFAULT expr)
> This is a thing that exists in other forms in other databases and while it
> may look unrelated, it is essentially the SQL/JSON casts within a nested
> data structure issue, just a lot simpler.

Okay, maybe that's why I was thinking we had a requirement for
failure-free casts.  Sure, you can transform it to the other thing
by always implementing this as a cast-via-IO, but you could run into
semantics issues that way.  (If memory serves, we already have cases
where casting X to Y gives a different result from casting X to text
to Y.)

> My original plan had been two new params to all _in() functions: a boolean
> error_mode and a default expression Datum.
> After consulting with Jeff Davis and Michael Paquier, the notion of
> modifying fcinfo itself two booleans:
>   allow_error (this call is allowed to return if there was an error with
> INPUT) and
>   has_error (this function has the concept of a purely-input-based error,
> and found one)

Hmm ... my main complaint about that is the lack of any way to report
the details of the error.  I realize that a plain boolean failure flag
might be enough for our immediate use-cases, but I don't want to expend
the amount of effort this is going to involve and then later find we
have a use-case where we need the error details.

The sketch that's in my head at the moment is to make use of the existing
"context" field of FunctionCallInfo: if that contains a node of some
to-be-named type, then we are requesting that errors not be thrown
but instead be reported by passing back an ErrorData using a field of
that node.  The issue about not constructing an ErrorData if the outer
caller doesn't need it could perhaps be addressed by adding some boolean
flag fields in that node, but the details of that need not be known to
the functions reporting errors this way; it'd be a side channel from the
outer caller to elog.c.

The main objection I can see to this approach is that we only support
one context value per call, so you could not easily combine this
functionality with existing use-cases for the context field.  A quick
census of InitFunctionCallInfoData calls finds aggregates, window
functions, triggers, and procedures, none of which seem like plausible
candidates for wanting no-error behavior, so I'm not too concerned
about that.  (Maybe the error-reporting node could be made a sub-node
of the context node in any future cases where we do need it?)

> The one gap I see so far in the patch presented is that it returns a null
> value on bad input, which might be ok if the node has the default, but that
> then presents the node with having to understand whether it was a null
> because of bad input vs a null that was expected.

Yeah.  That's something we could probably get away with for the case of
input functions only, but I think explicit out-of-band signaling that
there was an error is a more future-proof solution.

regards, tom lane




Re: Error-safe user functions

2022-12-02 Thread Corey Huinker
On Fri, Dec 2, 2022 at 1:46 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > I'm still working on organizing my patch, but it grew out of a desire to
> do
> > this:
> > CAST(value AS TypeName DEFAULT expr)
> > This is a thing that exists in other forms in other databases and while
> it
> > may look unrelated, it is essentially the SQL/JSON casts within a nested
> > data structure issue, just a lot simpler.
>
> Okay, maybe that's why I was thinking we had a requirement for
> failure-free casts.  Sure, you can transform it to the other thing
> by always implementing this as a cast-via-IO, but you could run into
> semantics issues that way.  (If memory serves, we already have cases
> where casting X to Y gives a different result from casting X to text
> to Y.)
>

Yes, I was setting aside the issue of direct cast functions for my v0.1


>
> > My original plan had been two new params to all _in() functions: a
> boolean
> > error_mode and a default expression Datum.
> > After consulting with Jeff Davis and Michael Paquier, the notion of
> > modifying fcinfo itself two booleans:
> >   allow_error (this call is allowed to return if there was an error with
> > INPUT) and
> >   has_error (this function has the concept of a purely-input-based error,
> > and found one)
>
> Hmm ... my main complaint about that is the lack of any way to report
> the details of the error.  I realize that a plain boolean failure flag
> might be enough for our immediate use-cases, but I don't want to expend
> the amount of effort this is going to involve and then later find we
> have a use-case where we need the error details.
>

I agree, but then we're past a boolean for allow_error, and we probably get
into a list of modes like this:

CAST_ERROR_ERROR  /* default ereport(), what we do now */
CAST_ERROR_REPORT_FULL /* report that the cast failed, everything that you
would have put in the ereport() instead put in a struct that gets returned
to caller */
CAST_ERROR_REPORT_SILENT /* report that the cast failed, but nobody cares
why so don't even form the ereport strings, good for bulk operations */
CAST_ERROR_WARNING /* report that the cast failed, but emit ereport() as a
warning */
CAST_ERROR_[NOTICE,LOG,DEBUG1,..DEBUG5] /* same, but some other loglevel */


>
> The sketch that's in my head at the moment is to make use of the existing
> "context" field of FunctionCallInfo: if that contains a node of some
> to-be-named type, then we are requesting that errors not be thrown
> but instead be reported by passing back an ErrorData using a field of
> that node.  The issue about not constructing an ErrorData if the outer
> caller doesn't need it could perhaps be addressed by adding some boolean
> flag fields in that node, but the details of that need not be known to
> the functions reporting errors this way; it'd be a side channel from the
> outer caller to elog.c.
>

That should be a good place for it, assuming it's not already used like
fn_extra is. It would also squash those cases above into just three: ERROR,
REPORT_FULL, and REPORT_SILENT, leaving it up to the node what type of
erroring/logging is appropriate.


>
> The main objection I can see to this approach is that we only support
> one context value per call, so you could not easily combine this
> functionality with existing use-cases for the context field.  A quick
> census of InitFunctionCallInfoData calls finds aggregates, window
> functions, triggers, and procedures, none of which seem like plausible
> candidates for wanting no-error behavior, so I'm not too concerned
> about that.  (Maybe the error-reporting node could be made a sub-node
> of the context node in any future cases where we do need it?)
>

A subnode had occurred to me when fiddling about with fn_extra, so so that
applies here, but if we're doing a sub-node, then maybe it's worth it's own
parameter. I struggled with that because of how few of these functions will
use it vs how often they're executed.


>
> > The one gap I see so far in the patch presented is that it returns a null
> > value on bad input, which might be ok if the node has the default, but
> that
> > then presents the node with having to understand whether it was a null
> > because of bad input vs a null that was expected.
>
> Yeah.  That's something we could probably get away with for the case of
> input functions only, but I think explicit out-of-band signaling that
> there was an error is a more future-proof solution.
>

I think we'll run into it fairly soon, because if I recall correctly,
SQL/JSON has a formatting spec that essentially means that we're not
calling input functions, we're calling TO_CHAR() and TO_DATE(), but they're
very similiar to input functions.

One positive side effect of all this is we can get a isa(value, typname)
construct like this "for free", we just try the cast and return the value.

I'm still working on my patch even though it will probably be sidelined in
the hopes that it informs us of any subsequent issues.


Re: O(n) tasks cause lengthy startups and checkpoints

2022-12-02 Thread Nathan Bossart
On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart  
> wrote:
>> The test appears to reliably create snapshot and mapping files, so if the
>> directories are empty at some point after the checkpoint at the end, we can
>> be reasonably certain the custodian took action.  I didn't add explicit
>> checks that there are files in the directories before the checkpoint
>> because a concurrent checkpoint could make such checks unreliable.
> 
> I think you're right. I added sqls to see if the snapshot and mapping
> files count > 0, see [1] and the cirrus-ci members are happy too -
> https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> think we can consider adding these count > 0 checks to tests.

My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable.  In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.

> Having said above, I'm okay to process ShutdownRequestPending as early
> as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> alongside ShutdownRequestPending?

I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

> While thinking about this, one thing that really struck me is what
> happens if we let the custodian exit, say after processing
> ShutdownRequestPending immediately or after a restart, leaving other
> queued tasks? The custodian will never get to work on those tasks
> unless the requestors (say checkpoint or some other process) requests
> it to do so after restart. Maybe, we don't need to worry about it.
> Maybe we need to worry about it. Maybe it's an overkill to save the
> custodian's task state to disk so that it can come up and do the
> leftover tasks upon restart.

Yes, tasks will need to be retried when the server starts again.  The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well.  Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-02 Thread Nathan Bossart
On Thu, Dec 01, 2022 at 04:21:30PM -0800, Nathan Bossart wrote:
> Okay, here is a new version of the patch.  This seems to clear up
> everything that I could find via the tests.

I cleaned up the patch a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 72adfb291ee84b8a20f6634639b15eff6d48cf9a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 21 Nov 2022 16:01:01 -0800
Subject: [PATCH v5 1/1] wake up logical workers as needed instead of relying
 on periodic wakeups

---
 src/backend/access/transam/xact.c|  3 ++
 src/backend/catalog/pg_subscription.c|  4 +++
 src/backend/commands/alter.c |  7 
 src/backend/commands/subscriptioncmds.c  |  4 +++
 src/backend/replication/logical/worker.c | 46 
 src/include/replication/logicalworker.h  |  3 ++
 6 files changed, 67 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8086b857b9..dc00e66cfb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -47,6 +47,7 @@
 #include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
 #include "replication/syncrep.h"
@@ -2360,6 +2361,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true, is_parallel_worker);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_LogicalRepWorkers(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2860,6 +2862,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_LogicalRepWorkers(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index a506fc3ec8..ab75506094 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -25,6 +25,7 @@
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "replication/logicalworker.h"
 #include "storage/lmgr.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
@@ -322,6 +323,9 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
 
 	/* Cleanup. */
 	table_close(rel, NoLock);
+
+	/* Wake up the logical replication workers to handle this change quickly. */
+	LogicalRepWorkersWakeupAtCommit(subid);
 }
 
 /*
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 10b6fe19a2..da14e91552 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -59,6 +59,7 @@
 #include "commands/user.h"
 #include "miscadmin.h"
 #include "parser/parse_func.h"
+#include "replication/logicalworker.h"
 #include "rewrite/rewriteDefine.h"
 #include "tcop/utility.h"
 #include "utils/builtins.h"
@@ -410,6 +411,12 @@ ExecRenameStmt(RenameStmt *stmt)
 		   stmt->newname);
 table_close(catalog, RowExclusiveLock);
 
+/*
+ * Wake up the logical replication workers to handle this
+ * change quickly.
+ */
+LogicalRepWorkersWakeupAtCommit(address.objectId);
+
 return address;
 			}
 
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index d673557ea4..e29cdcbff9 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -34,6 +34,7 @@
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
 #include "replication/origin.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
@@ -1358,6 +1359,9 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
 
 	table_close(rel, RowExclusiveLock);
 
+	/* Wake up the logical replication workers to handle this change quickly. */
+	LogicalRepWorkersWakeupAtCommit(subid);
+
 	ObjectAddressSet(myself, SubscriptionRelationId, subid);
 
 	InvokeObjectPostAlterHook(SubscriptionRelationId, subid, 0);
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f9efe6c4c6..d735be4f10 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -253,6 +253,8 @@ WalReceiverConn *LogRepWorkerWalRcvConn = NULL;
 Subscription *MySubscription = NULL;
 static bool MySubscriptionValid = false;
 
+static List *on_commit_wakeup_workers_subids = NIL;
+
 bool		in_remote_transaction = false;
 static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
 
@@ -4092,3 +4094,47 @@ reset_apply_error_context_info(void)
 	apply_error_callback_arg.remote_attnum = -1;
 	set_apply_error_context_xact(InvalidTransactionId, InvalidXLogRecPtr);
 }
+
+/*
+ * Wakeup the stored subscriptions' workers on commit if requested.
+ */
+void
+AtEOXact_LogicalRepWorkers(bool isCommit)
+{
+	if (isCommit && on_commit_wakeup_workers

suppressing useless wakeups in logical/worker.c

2022-12-02 Thread Nathan Bossart
Hi hackers,

I've attached an attempt at porting a similar change to 05a7be9 [0] to
logical/worker.c.  The bulk of the patch is lifted from the walreceiver
patch, but I did need to add a hack for waking up after
wal_retrieve_retry_interval to start sync workers.  This hack involves a
new wakeup variable that process_syncing_tables_for_apply() sets.

For best results, this patch should be applied on top of [1], which is an
attempt at fixing all the stuff that only runs within a reasonable
timeframe because logical worker processes currently wake up at least once
a second.  With the attached patch applied, those periodic wakeups are
gone, so we need to make sure we wake up the logical workers as needed.

[0] 
https://postgr.es/m/CA%2BhUKGJGhX4r2LPUE3Oy9BX71Eum6PBcS8L3sJpScR9oKaTVaA%40mail.gmail.com
[1] https://postgr.es/m/20221122004119.GA132961%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d1af5b4b3073984fba1e5e86b9c034a96b9e235a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 1 Dec 2022 20:50:21 -0800
Subject: [PATCH v1 1/1] suppress unnecessary wakeups in logical/worker.c

---
 src/backend/replication/logical/tablesync.c |  20 +++
 src/backend/replication/logical/worker.c| 156 +++-
 src/include/replication/worker_internal.h   |   3 +
 3 files changed, 142 insertions(+), 37 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 94e813ac53..168a2da3a1 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -418,6 +418,13 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 
 	Assert(!IsTransactionState());
 
+	/*
+	 * If we've made it past our previously-stored special wakeup time, reset
+	 * it so that it can be recalculated as needed.
+	 */
+	if (next_sync_start <= GetCurrentTimestamp())
+		next_sync_start = PG_INT64_MAX;
+
 	/* We need up-to-date sync state info for subscription tables here. */
 	FetchTableStates(&started_tx);
 
@@ -612,6 +619,19 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
  rstate->relid);
 		hentry->last_start_time = now;
 	}
+	else if (found)
+	{
+		TimestampTz retry_time = hentry->last_start_time +
+ (wal_retrieve_retry_interval *
+  INT64CONST(1000));
+
+		/*
+		 * Store when we can start the sync worker so that we
+		 * know how long to sleep.
+		 */
+		if (retry_time < next_sync_start)
+			next_sync_start = retry_time;
+	}
 }
 			}
 		}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f9efe6c4c6..03dc42ceee 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -196,8 +196,6 @@
 #include "utils/syscache.h"
 #include "utils/timeout.h"
 
-#define NAPTIME_PER_CYCLE 1000	/* max sleep time between cycles (1s) */
-
 typedef struct FlushPosition
 {
 	dlist_node	node;
@@ -279,6 +277,26 @@ static XLogRecPtr skip_xact_finish_lsn = InvalidXLogRecPtr;
 /* BufFile handle of the current streaming file */
 static BufFile *stream_fd = NULL;
 
+/*
+ * Reasons to wake up and perform periodic tasks.
+ */
+typedef enum LogRepWorkerWakeupReason
+{
+	LRW_WAKEUP_TERMINATE,
+	LRW_WAKEUP_PING,
+	LRW_WAKEUP_STATUS,
+	NUM_LRW_WAKEUPS
+} LogRepWorkerWakeupReason;
+
+/*
+ * Wake up times for periodic tasks.
+ */
+static TimestampTz wakeup[NUM_LRW_WAKEUPS];
+TimestampTz next_sync_start;
+
+static void LogRepWorkerComputeNextWakeup(LogRepWorkerWakeupReason reason,
+		  TimestampTz now);
+
 typedef struct SubXactInfo
 {
 	TransactionId xid;			/* XID of the subxact */
@@ -2708,10 +2726,9 @@ UpdateWorkerStats(XLogRecPtr last_lsn, TimestampTz send_time, bool reply)
 static void
 LogicalRepApplyLoop(XLogRecPtr last_received)
 {
-	TimestampTz last_recv_timestamp = GetCurrentTimestamp();
-	bool		ping_sent = false;
 	TimeLineID	tli;
 	ErrorContextCallback errcallback;
+	TimestampTz now;
 
 	/*
 	 * Init the ApplyMessageContext which we clean up after each replication
@@ -2740,6 +2757,12 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
+	/* Initialize nap wakeup times. */
+	now = GetCurrentTimestamp();
+	for (int i = 0; i < NUM_LRW_WAKEUPS; i++)
+		LogRepWorkerComputeNextWakeup(i, now);
+	next_sync_start = PG_INT64_MAX;
+
 	/* This outer loop iterates once per wait. */
 	for (;;)
 	{
@@ -2756,6 +2779,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 
 		len = walrcv_receive(LogRepWorkerWalRcvConn, &buf, &fd);
 
+		now = GetCurrentTimestamp();
 		if (len != 0)
 		{
 			/* Loop to process all available data (without blocking). */
@@ -2779,9 +2803,9 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 	int			c;
 	StringInfoData s;
 
-	/* Reset timeout. */
-	last_recv_timestamp = GetCurrentTimestamp();
-	ping_sent =

Re: Questions regarding distinct operation implementation

2022-12-02 Thread Ankit Kumar Pandey


On 02/12/22 00:40, Ankit Kumar Pandey wrote:



On 25/11/22 11:00, Ankit Kumar Pandey wrote:


On 25/11/22 02:14, David Rowley wrote:
On Fri, 25 Nov 2022 at 06:57, Ankit Kumar Pandey 
 wrote:

Please let me know any opinions on this.

I think if you're planning on working on this then step 1 would have
to be checking the SQL standard to see which set of rows it asks
implementations to consider for duplicate checks when deciding if the
transition should be performed or not.  Having not looked, I don't
know if this is the entire partition or just the rows in the current
frame.

Depending on what you want, an alternative today would be to run a
subquery to uniquify the rows the way you want and then do the window
function stuff.

David
Thanks David, these are excellent pointers, I will look into SQL 
standard first and so on.



Hi,

Looking further into it, I am bit clear about expectations of having 
distinct in Windows Aggregates (although I couldn't got hands on SQL 
standard as it is not in public domain but distinct in windows 
aggregate is supported by Oracle and I am using it as reference).


For table (mytable):

id, name

1, A

1, A

10, B

3, A

1, A


/select avg(distinct id) over (partition by name)/ from mytable (in 
oracle db) yields:


2

2

2

2

10


From this, it is seen distinct is taken across the all rows in the 
partition.


I also thought of using a subquery approach: /select avg(id) over 
(partition by name) from (select distinct(id), name from mytable)/


but this obviously doesn't yield right answer because result should 
contain same number of rows as input. This implies we need to find 
partition first and then remove duplicates within the partition.


Can we avoid any ordering/sort until existing logic finds if value is 
in frame (so as to respect any /order by/ clause given by user), and 
once it is determined that tuple is in frame, skip the tuple if it is 
a duplicate? If aforementioned approach is right, question is how do 
we check if it is duplicate? Should we create a lookup table (as 
tuples coming to advance_windowaggregate can be in arbitrary order)? 
Or any other approach would be better?


Any opinion on this will be appreciated.

--
Regards,
Ankit Kumar Pandey


Hi,

I am still looking at this but unable to move ahead as I am not able to 
use prior use-cases (normal aggregates) to implement distinct in window 
function because they both differ in design (and window function is bit 
unique in general). One approach (among others) that I thought was that 
during spool_tuples, rescan tuplestore and add new tuples only if they 
are not already present. This is not very efficient because of multiple 
read operation on tuplestore, only for checking if tuple already exists 
and other issues (like tuplestore_in_memory forcing entire partition to 
get spooled in one go) etc.


Any ideas will be much appreciated.

Thanks.

--
Regards,
Ankit Kumar Pandey


Re: Removing another gen_node_support.pl special case

2022-12-02 Thread Tom Lane
Peter Eisentraut  writes:
> On 29.11.22 22:34, Tom Lane wrote:
>> Concretely, it seems like something like the attached could be
>> useful, independently of the other change.

> Yes, right now you can easily declare things that don't make sense. 
> Cross-checks like these look useful.

Checking my notes from awhile back, there was one other cross-check
that I thought was pretty high-priority: verifying that array_size
fields precede their array fields.  Without that, a read function
will fail entirely, and a compare function might index off the
end of an array depending on which array-size field it chooses
to believe.  It seems like an easy mistake to make, too.

I added that and pushed.

regards, tom lane




Re: Error-safe user functions

2022-12-02 Thread Robert Haas
On Fri, Dec 2, 2022 at 1:46 PM Tom Lane  wrote:
> The main objection I can see to this approach is that we only support
> one context value per call, so you could not easily combine this
> functionality with existing use-cases for the context field.  A quick
> census of InitFunctionCallInfoData calls finds aggregates, window
> functions, triggers, and procedures, none of which seem like plausible
> candidates for wanting no-error behavior, so I'm not too concerned
> about that.  (Maybe the error-reporting node could be made a sub-node
> of the context node in any future cases where we do need it?)

I kind of wonder why we don't just add another member to FmgrInfo.
It's 48 bytes right now and this would increase the size to 56 bytes,
so it's not as if we're increasing the number of cache lines or even
using up all of the remaining byte space. It's an API break, but
people have to recompile for new major versions anyway, so I guess I
don't really see the downside.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Error-safe user functions

2022-12-02 Thread Tom Lane
Robert Haas  writes:
> On Fri, Dec 2, 2022 at 1:46 PM Tom Lane  wrote:
>> The main objection I can see to this approach is that we only support
>> one context value per call, so you could not easily combine this
>> functionality with existing use-cases for the context field.

> I kind of wonder why we don't just add another member to FmgrInfo.
> It's 48 bytes right now and this would increase the size to 56 bytes,

This'd be FunctionCallInfoData not FmgrInfo.

I'm not terribly concerned about the size of FunctionCallInfoData,
but I am concerned about the number of cycles spent to initialize it,
because we do that pretty durn often.  So I don't really want to add
fields to it without compelling use-cases, and I don't see one here.

regards, tom lane




Re: Using WaitEventSet in the postmaster

2022-12-02 Thread Thomas Munro
On Fri, Dec 2, 2022 at 3:36 PM Thomas Munro  wrote:
> On Fri, Dec 2, 2022 at 2:40 PM Andres Freund  wrote:
> > It doesn't seem trivial (but not impossible either) to make SetLatch() 
> > robust
> > against arbitrary corruption. So it seems easier to me to just put the latch
> > in process local memory, and do a SetLatch() in postmaster's SIGUSR1 
> > handler.
>
> Alright, good idea, I'll do a v2 like that.

Here's an iteration like that.  Still WIP grade.  It passes, but there
must be something I don't understand about this computer program yet,
because if I move the "if (pending_..." section up into the block
where WL_LATCH_SET has arrived (instead of testing those variables
every time through the loop), a couple of tests leave zombie
(unreaped) processes behind, indicating that something funky happened
to the state machine that I haven't yet grokked.  Will look more next
week.

By the way, I think if we do this and then also do
s/select(/WaitLatchOrSocket(/ in auth.c's RADIUS code, then we could
then drop a chunk of newly unreachable code in
src/backend/port/win32/socket.c (though maybe I missed something; it's
quite hard to grep for "select" in a SQL database :-D).  There's also
a bunch of suspect stuff in there about UDP that is already dead
thanks to the pgstats work.
From 9f60cb42b222952ab94d0d4d89017c1390400196 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 9 Nov 2022 22:59:58 +1300
Subject: [PATCH v2] Give the postmaster a WaitEventSet and a latch.

Traditionally, the postmaster's architecture was quite unusual.  It did
a lot of work inside signal handlers, which were only unblocked while
waiting in select() to make that safe.

Switch to a more typical architecture, where signal handlers just set
flags and use a latch to close races.  Now the postmaster looks like
all other PostgreSQL processes, multiplexing its event processing in
epoll_wait()/kevent()/poll()/WaitForMultipleObjects() depending on the
OS.

Changes:

 * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix, this is
   just another name for WL_SOCKET_READABLE, but Window has a different
   underlying event; this mirrors WL_SOCKET_CONNECTED on the other
   end of a connection)

 * Small adjustments to WES to allow it to run in the postmaster.

 * Allow the postmaster to set up its own local latch.  For now we don't
   want other backends setting the postmaster's latch directly (perhaps
   later we'll figure out how to use a shared latch "robustly", so that
   memory corruption can't interfere with the postmaster's
   cleanup-and-restart responsibilities, but for now there is a two-step
   signal protocol SIGUSR1 -> SIGURG).

 * The existing signal handlers are cut in two: a handle_XXX part that
   sets a pending_XXX variable and sets the local latch, and a
   process_XXX part.

 * Signal handlers are now installed with the regular pqsignal()
   function rather then the special pqsignal_pm() function; the concerns
   about the portability of SA_RESTART vs select() are no longer
   relevant: SUSv2 left it implementation-defined whether select()
   restarts, but didn't add that qualification for poll(), and it doesn't
   matter anyway because we call SetLatch() creating a new reason to wake
   up.

Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com
---
 src/backend/libpq/pqsignal.c|  40 
 src/backend/postmaster/postmaster.c | 330 ++--
 src/backend/storage/ipc/latch.c |  19 ++
 src/backend/tcop/postgres.c |   1 -
 src/backend/utils/init/miscinit.c   |  13 +-
 src/include/libpq/pqsignal.h|   3 -
 src/include/miscadmin.h |   1 +
 src/include/storage/latch.h |   8 +-
 8 files changed, 203 insertions(+), 212 deletions(-)

diff --git a/src/backend/libpq/pqsignal.c b/src/backend/libpq/pqsignal.c
index 1ab34c5214..718043a39d 100644
--- a/src/backend/libpq/pqsignal.c
+++ b/src/backend/libpq/pqsignal.c
@@ -97,43 +97,3 @@ pqinitmask(void)
 	sigdelset(&StartupBlockSig, SIGALRM);
 #endif
 }
-
-/*
- * Set up a postmaster signal handler for signal "signo"
- *
- * Returns the previous handler.
- *
- * This is used only in the postmaster, which has its own odd approach to
- * signal handling.  For signals with handlers, we block all signals for the
- * duration of signal handler execution.  We also do not set the SA_RESTART
- * flag; this should be safe given the tiny range of code in which the
- * postmaster ever unblocks signals.
- *
- * pqinitmask() must have been invoked previously.
- */
-pqsigfunc
-pqsignal_pm(int signo, pqsigfunc func)
-{
-	struct sigaction act,
-oact;
-
-	act.sa_handler = func;
-	if (func == SIG_IGN || func == SIG_DFL)
-	{
-		/* in these cases, act the same as pqsignal() */
-		sigemptyset(&act.sa_mask);
-		act.sa_flags = SA_RESTART;
-	}
-	else
-	{
-		act.sa_mask = BlockSig;
-		act.sa_flags = 0;
-	}
-#ifdef SA_NOCLDSTOP
-	if (signo == SIGCHLD)

Fwd: Perform streaming logical transactions by background workers and parallel apply

2022-12-02 Thread Peter Smith
-- Forwarded message -
From: Peter Smith 
Date: Sat, Dec 3, 2022 at 8:03 AM
Subject: Re: Perform streaming logical transactions by background
workers and parallel apply
To: Amit Kapila 


On Fri, Dec 2, 2022 at 8:57 PM Amit Kapila  wrote:
>
> On Fri, Dec 2, 2022 at 2:29 PM Peter Smith  wrote:
> >
> > 3. pa_setup_dsm
> >
> > +/*
> > + * Set up a dynamic shared memory segment.
> > + *
> > + * We set up a control region that contains a fixed-size worker info
> > + * (ParallelApplyWorkerShared), a message queue, and an error queue.
> > + *
> > + * Returns true on success, false on failure.
> > + */
> > +static bool
> > +pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
> >
> > IMO that's confusing to say "fixed-sized worker info" when it's
> > referring to the ParallelApplyWorkerShared structure and not the other
> > ParallelApplyWorkerInfo.
> >
> > Might be better to say:
> >
> > "a fixed-size worker info (ParallelApplyWorkerShared)" -> "a
> > fixed-size struct (ParallelApplyWorkerShared)"
> >
> > ~~~
> >
>
> I find the existing wording better than what you are proposing. We can
> remove the structure name if you think that is better but IMO, current
> wording is good.
>

Including the structure name was helpful, but "worker info" made me
wrongly think it was talking about ParallelApplyWorkerInfo (e.g.
"worker info" was too much like WorkerInfo). So any different way to
say "worker info" might avoid that confusion.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Bogus rte->relkind for EXCLUDED pseudo-relation

2022-12-02 Thread Tom Lane
Andres Freund  writes:
> On 2022-12-02 12:34:36 -0500, Tom Lane wrote:
>> I find that
>> (1) ON CONFLICT's EXCLUDED pseudo-relation is assigned
>> rte->relkind = RELKIND_COMPOSITE, a rather horrid hack
>> installed by commit ad2278379.

> Is it that horrid?

It's pretty bad IMO.  You didn't even bother to update the comments
for RangeTblEntry to explain that

charrelkind;/* relation kind (see pg_class.relkind) */

might now not be the rel's relkind at all.  Changing RTEKind
would likely be a better way, though of course we couldn't
do that in back branches.

And I think that we do have an issue in the back branches.
According to the commit message for ad2278379,

4) References to EXCLUDED were rewritten by the RLS machinery, as
   EXCLUDED was treated as if it were the underlying relation.

That rewriting would be post-rule-load, so it sure seems to me
that a rule containing EXCLUDED would be improperly subject to
RLS rewriting.  I don't have enough familiarity with RLS to come
up with a test case, and I don't see any relevant examples in
the mailing list threads referenced by ad2278379, but I bet
that it is broken.

The back-branch fix could just be to teach rewriteHandler.c
to not overwrite RELKIND_COMPOSITE_TYPE, perhaps.  We can't
remove the update completely because of the table-to-view case.

regards, tom lane




Think-o in foreign key comments

2022-12-02 Thread Paul Jungwirth

Hello,

I noticed a few places in the new foreign key code where a comment says 
"the ON DELETE SET NULL/DELETE clause". I believe it should say "ON 
DELETE SET NULL/DEFAULT".


These comments were added in d6f96ed94e7, "Allow specifying column list 
for foreign key ON DELETE SET actions." Here is a patch to correct them. 
I don't think you usually create a commitfest entry for tiny fixes like 
this, right? But if you'd like one please let me know and I'll add it.


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.comFrom dc5e317762266b93bfa44bb303fe1882853d8de7 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Fri, 2 Dec 2022 14:02:55 -0800
Subject: [PATCH v1] Fix FK comment think-o

---
 src/backend/commands/tablecmds.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e83f375b5..716793e157 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9467,10 +9467,10 @@ validateFkOnDeleteSetColumns(int numfks, const int16 *fkattnums,
  * numfks is the number of columns in the foreign key
  * pkattnum is the attnum array of referenced attributes.
  * fkattnum is the attnum array of referencing attributes.
- * numfkdelsetcols is the number of columns in the ON DELETE SET NULL/DELETE
+ * numfkdelsetcols is the number of columns in the ON DELETE SET NULL/DEFAULT
  *  (...) clause
  * fkdelsetcols is the attnum array of the columns in the ON DELETE SET
- *  NULL/DELETE clause
+ *  NULL/DEFAULT clause
  * pf/pp/ffeqoperators are OID array of operators between columns.
  * old_check_ok signals that this constraint replaces an existing one that
  * was already validated (thus this one doesn't need validation).
@@ -9686,10 +9686,10 @@ addFkRecurseReferenced(List **wqueue, Constraint *fkconstraint, Relation rel,
  * pkattnum is the attnum array of referenced attributes.
  * fkattnum is the attnum array of referencing attributes.
  * pf/pp/ffeqoperators are OID array of operators between columns.
- * numfkdelsetcols is the number of columns in the ON DELETE SET NULL/DELETE
+ * numfkdelsetcols is the number of columns in the ON DELETE SET NULL/DEFAULT
  *  (...) clause
  * fkdelsetcols is the attnum array of the columns in the ON DELETE SET
- *  NULL/DELETE clause
+ *  NULL/DEFAULT clause
  * old_check_ok signals that this constraint replaces an existing one that
  *		was already validated (thus this one doesn't need validation).
  * lockmode is the lockmode to acquire on partitions when recursing.
-- 
2.25.1



Re: Patch: Global Unique Index

2022-12-02 Thread David Zhang

Thanks a lot for all the comments.

On 2022-11-29 3:13 p.m., Tom Lane wrote:

... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions.  If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail.  The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
In the current POC patch, the deadlock is happening when backend-1 
inserts a value to index X(partition-1), and backend-2 try to insert a 
conflict value right after backend-1 released the buffer block lock but 
before start to check unique on index Y(partition-2). In this case, 
backend-1 holds ExclusiveLock on transaction-1 and waits for ShareLock 
on transaction-2 , while backend-2 holds ExclusiveLock on transaction-2 
and waits for ShareLock on transaction-1. Based on my debugging tests, 
this only happens when backend-1 and backend-2 want to insert a conflict 
value. If this is true, then is it ok to either `deadlock` error out or 
`duplicated value` error out since this is a conflict value? (hopefully 
end users can handle it in a similar way). I think the probability of 
such deadlock has two conditions: 1) users insert a conflict value and 
plus 2) the uniqueness checking happens in the right moment (see above).

That specific issue could perhaps be fixed by having everybody
examine all the indexes in the same order, inserting when you
come to your own partition's index and otherwise just checking
for conflicts.  But that still means serializing insertions
across all the partitions.  And the fact that you need to lock
all the partitions, or even just know what they all are,
Here is the main change for insertion cross-partition uniqueness check 
in `0004-support-global-unique-index-insert-and-update.patch`,
 result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, 
heapRel);


+    if (checkUnique != UNIQUE_CHECK_NO)
+        btinsert_check_unique_gi(itup, rel, heapRel, checkUnique);
+
 pfree(itup);

where, a cross-partition uniqueness check is added after the index tuple 
btree insertion on current partition. The idea is to make sure other 
backends can find out the ongoing index tuple just inserted (but before 
marked as visible yet), and the current partition uniqueness check can 
be skipped as it has already been checked. Based on this change, I think 
the insertion serialization can happen in two cases: 1) two insertions 
happen on the same buffer block (buffer lock waiting); 2) two ongoing 
insertions with duplicated values (transaction id waiting);







Re: Patch: Global Unique Index

2022-12-02 Thread David Zhang

On 2022-11-29 6:16 p.m., Tom Lane wrote:

Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.
Another main change in patch 
`0004-support-global-unique-index-insert-and-update.patch`,

+                search_global:
+                        stack = _bt_search(iRel, insertstate.itup_key,
+                                           &insertstate.buf, BT_READ, 
NULL);

+                        xwait = _bt_check_unique_gi(iRel, &insertstate,
+                                                    hRel, checkUnique, 
&is_unique,

+ &speculativeToken, heapRel);
+                        if (unlikely(TransactionIdIsValid(xwait)))
+                        {
... ...
+                            goto search_global;
+                        }

Here, I am trying to use `BT_READ` to require a LW_SHARED lock on the 
buffer block if a match found using `itup_key` search key. The 
cross-partition uniqueness checking will wait if the index tuple 
insertion on this buffer block has not done yet, otherwise runs the 
uniqueness check to see if there is an ongoing transaction which may 
insert a conflict value. Once the ongoing insertion is done, it will go 
back and check again (I think it can also handle the case that a 
potential conflict index tuple was later marked as dead in the same 
transaction). Based on this change, my test results are:


1) a select-only query will not be blocked by the ongoing insertion on 
index X


2) insertion happening on index Y may wait for the buffer block lock 
when inserting a different value but it does not wait for the 
transaction lock held by insertion on index X.


3) when an insertion inserting a conflict value on index Y,
    3.1) it waits for buffer block lock if the lock has been held by 
the insertion on index X.
    3.2) then, it waits for transaction lock until the insertion on 
index X is done.






Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2022-12-02 Thread Nathan Bossart
On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 6:10 AM Andres Freund  wrote:
>> I'm not sure this is quite right - don't we need a memory barrier. But I 
>> don't
>> see a reason to not just leave this code as-is. I think this should be
>> optimized entirely in lwlock.c
> 
> Actually, we don't need that at all as LWLockWaitForVar() will return
> immediately if the lock is free. So, I removed it.

I briefly looked at the latest patch set, and I'm curious how this change
avoids introducing memory ordering bugs.  Perhaps I am missing something
obvious.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Generate pg_stat_get_* functions with Macros

2022-12-02 Thread Nathan Bossart
Overall, this change looks straightforward, and it saves a couple hundred
lines.

On Tue, Nov 22, 2022 at 08:09:22AM +0100, Drouvot, Bertrand wrote:
> +/* pg_stat_get_numscans */
> +PGSTAT_DEFINE_REL_INT64_FIELD_ACCESSOR(pg_stat_get_, numscans);
> +
> +/* pg_stat_get_tuples_returned */
> +PGSTAT_DEFINE_REL_INT64_FIELD_ACCESSOR(pg_stat_get_, tuples_returned);
> +
> +/* pg_stat_get_tuples_fetched */
> +PGSTAT_DEFINE_REL_INT64_FIELD_ACCESSOR(pg_stat_get_, tuples_fetched);

Can we hard-code the prefix in the macro?  It looks like all of these use
the same one.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add LZ4 compression in pg_dump

2022-12-02 Thread Michael Paquier
On Fri, Dec 02, 2022 at 04:15:10PM +, gkokola...@pm.me wrote:
> You are very correct. However one can glob after the fact. Please find
> 0001 of the attached v14 which attempts to implement it.

+   if ($pgdump_runs{$run}->{glob_pattern})
+   {
+   my $glob_pattern = $pgdump_runs{$run}->{glob_pattern};
+   my @glob_output = glob($glob_pattern);
+   is(scalar(@glob_output) > 0, 1, "glob pattern matched")
+   }

While this is correct in checking that the contents are compressed
under --with-zlib, this also removes the coverage where we make sure
that this command is able to complete under --without-zlib without
compressing any of the table data files.  Hence my point from
upthread: this test had better not use compile_option, but change
glob_pattern depending on if the build uses zlib or not.

In order to check this behavior with defaults_custom_format, perhaps
we could just remove the -Z6 from it or add an extra command for its
default behavior?
--
Michael


signature.asc
Description: PGP signature


RE: Add semi-join pushdown to postgres_fdw

2022-12-02 Thread fujii.y...@df.mitsubishielectric.co.jp
Hi Mr.Pyhalov.

Thank you for work on this useful patch.
I'm starting to review v2 patch.
I have cheked we can apply v2 patch to commit 
ec386948948c1708c0c28c48ef08b9c4dd9d47cc
(Date:Thu Dec 1 12:56:21 2022 +0100).
I briefly looked at this whole thing and did step execute this
by running simple queries such as the followings.

query1) select * from f_t1 a1 where a1.c1 in (select c1 from f_t2);
query2) select * from f_t1 a1 join f_t3 a2 on a1.c1 = a2.c1 where a1.c1 in 
(select c1 from f_t3) ;
query3) update f_t2 set c1 = 1 from f_t1 a1 where a1.c2 = f_t2.c2 and exists 
(select null from f_t2 where c1 = a1.c1);

Although I haven't seen all of v2 patch, for now I have the following questions.

question1) 
  > + if (jointype == JOIN_SEMI && bms_is_member(var->varno, innerrel->relids) 
&& !bms_is_member(var->varno, outerrel->relids))
  It takes time for me to find in what case this condition is true.
  There is cases in which this condition is true for semi-join of two baserels 
  when running query which joins more than two relations such as query2 and 
query3.
  Running queries such as query2, you maybe want to pushdown of only semi-join 
path of 
  joinrel(outerrel) defined by (f_t1 a1 join f_t3 a2 on a1.c1 = a2.c1) and 
baserel(innerrel) f_t3 
  because of safety deparse. So you add this condition.
  Becouase of this limitation, your patch can't push down subquery expression 
  "exists (select null from f_t2 where c1 = a1.c1)" in query3.
  I think, it is one of difficulty points for semi-join pushdown.
  This is my understanding of the intent of this condition and the restrictions 
imposed by this condition.
  Is my understanding right?
  I think if there are comments for the intent of this condition and the 
restrictions imposed by this condition  
  then they help PostgreSQL developper. What do you think?

question2) In foreign_join_ok
  > * Constructing queries representing ANTI joins is hard, hence
  Is this true? Is it hard to expand your approach to ANTI join pushdown?

question3) You use variables whose name is "addl_condXXX" in the following code.
  > appendStringInfo(addl_conds, "EXISTS (SELECT NULL FROM %s", 
join_sql_i.data);
  Does this naming mean additional literal?
  Is there more complehensive naming, such as "subquery_exprXXX"?

question4) Although really detail, there is expression making space such as
  "ft4.c2 = ft2.c2" and one making no space such as "c1=ftupper.c1".
  Is there reason for this difference? If not, need we use same policy for 
making space?

Later, I'm going to look at part of your patch which is used when running more 
complex query.

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation


Re: Failed Assert in pgstat_assoc_relation

2022-12-02 Thread Andres Freund
Hi,

On 2022-12-02 09:51:39 -0800, Andres Freund wrote:
> On 2022-12-02 12:15:37 -0500, Tom Lane wrote:
> > I wrote:
> > > I guess that means that the table-to-view removal has to go in
> > > first.  I should be able to take care of that tomorrow, or if
> > > you're in a hurry I don't mind if you commit it for me.
> > 
> > Done now, feel free to deal with the pgstat problem.
> 
> Thanks.  I'm out for a few hours without proper computer access, couldn't
> quite get it finished inbetween your push and now. Will deal with it once I
> get back.

Pushed that now. I debated for a bit whether to backpatch the test all the
way, but after it took me a while to convince myself that there's no active
problem in the older branches, I decided it's a good idea.

Thanks Vignesh for the bugreports!

Greetings,

Andres Freund




Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-12-02 Thread Bharath Rupireddy
On Fri, Dec 2, 2022 at 12:50 PM Michael Paquier  wrote:
>
> On Thu, Nov 17, 2022 at 11:53:23AM +0530, Bharath Rupireddy wrote:
> > Please do, if you feel so. Thanks for your review.
>
> I don't really mind the addition of the LSN when operating on a given
> record where we are reading a location, like in the five error paths
> for the header validation or the three ones in ReadRecord()

Thanks for reviewing.

> Now this one looks confusing:
> +   XLogSegNoOffsetToRecPtr(openLogSegNo, startoffset,
> +   wal_segment_size, lsn);
> ereport(PANIC,
> (errcode_for_file_access(),
>  errmsg("could not write to log file %s "
> -   "at offset %u, length %zu: %m",
> -   xlogfname, startoffset, nleft)));
> +   "at offset %u, LSN %X/%X, length %zu: %m",
> +   xlogfname, startoffset,
> +   LSN_FORMAT_ARGS(lsn), nleft)));
>
> This does not always refer to an exact LSN of a record as we may be in
> the middle of a write, so I would leave it as-is.
>
> Similarly the addition of wre_lsn would be confusing?  The offset
> looks kind of enough to me when referring to the middle of a page in
> WALReadRaiseError().

Yes, I removed those changes. Even if someone sees an offset of a
record within a WAL file elsewhere, they have the new utility function
(0002) pg_walfile_offset_lsn().

I'm attaching the v4 patch set for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 452f6adc4617230f8843a9339331170cb6f85723 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 3 Dec 2022 03:13:57 +
Subject: [PATCH v4] Emit record LSN in WAL read and validate error messages

The error messages reported during any failures in WAL record read
or validation currently contains offset within the WAL file which
is a bit hard to decode it to WAL LSN while debugging. Reporting
the WAL record LSN at which the WAL read or validation failure
occurrs saves time and makes life easier here.

Author: Bharath Rupireddy
Reviewed-by: Maxim Orlov, Michael Paquier
Reviewed-by: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/CALj2ACWV=FCddsxcGbVOA=cvPyMr75YCFbSQT6g4KDj=gcJK4g@mail.gmail.com
---
 src/backend/access/transam/xlogreader.c   | 15 ++-
 src/backend/access/transam/xlogrecovery.c | 13 -
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 93f667b254..a38a80e049 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1226,9 +1226,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid magic number %04X in WAL segment %s, offset %u",
+			  "invalid magic number %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_magic,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1240,9 +1241,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1281,9 +1283,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
 		/* hmm, first page of file doesn't have a long header? */
 		report_invalid_record(state,
-			  "invalid info bits %04X in WAL segment %s, offset %u",
+			  "invalid info bits %04X in WAL segment %s, LSN %X/%X, offset %u",
 			  hdr->xlp_info,
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1300,9 +1303,10 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 		XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 		report_invalid_record(state,
-			  "unexpected pageaddr %X/%X in WAL segment %s, offset %u",
+			  "unexpected pageaddr %X/%X in WAL segment %s, LSN %X/%X, offset %u",
 			  LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
 			  fname,
+			  LSN_FORMAT_ARGS(recptr),
 			  offset);
 		return false;
 	}
@@ -1325,10 +1329,11 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 			XLogFileName(fname, state->seg.ws_tli, segno, state->segcxt.ws_segsize);
 
 			report_invalid_record(state,
-  "out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",
+  "out-of-sequence timeline ID %u (after %u) in WAL segment %

Re: pg_basebackup: add test about zstd compress option

2022-12-02 Thread Ian Lawrence Barwick
Hi

I was looking at the commitfest entry for this patch [1] as it's been dormant
for quite a while, with the intent of returning it with feedback.

[1] https://commitfest.postgresql.org/40/3835/

2022年8月25日(木) 16:52 Dong Wook Lee :
>
> On Tue, Aug 23, 2022 at 11:37 AM Michael Paquier  wrote:
>
> > It seems to me that checking that the contents generated are valid is
> > equally necessary.  We do that with zlib with gzip --test, and you
> > could use ${ZSTD} in the context of this test.
>
> Thank you for the good points.
> I supplemented the test according to your suggestion.
> However, there was a problem.
> Even though I did export ZSTD on the Makefile , the test runner can't
> find ZSTD when it actually tests.
> ```
> my $zstd = $ENV{ZSTD};
> skip "program zstd is not found in your system", 1
>   if (!defined $zstd
> || $zstd eq '');
> ```
> log: regress_log_010_pg_basebackup
> ```
> ok 183 # skip program zstd is not found in your system.
> ```
> Could you check if I missed anything?

Taking a quick look at the patch itself, as-is it does actually work; maybe
 the zstd binary itself was missing or not in the normal system path?
 It might not have been installed even if the devel library was (IIRC
 this was the case on Rocky Linux).

However the code largely duplicates the preceding gzip test,
 and as Michael mentioned there's still lz4 without coverage.
Attached patch refactors this part of the test so it can be used
for multiple compression methods, similar to the test in
src/bin/pg_verifybackup/t/009_extract.pl mentioned by Robert.
The difference to that test is that we can exercise all the
command line options and directly check the generated files with
the respective binary.

Though on reflection maybe it's overkill and the existing tests
suffice. Anyway leaving the patch here in the interests of pushing
this forward in some direction.

Regards

Ian Barwick
From 73bf5fe94534563de64b77f74930a40823bdb875 Mon Sep 17 00:00:00 2001
From: Ian Barwick 
Date: Sat, 3 Dec 2022 13:09:15 +0900
Subject: [PATCH v3] pg_basebackup: refactor compression tests
 010_pg_basebackup.pl

This adds coverage for zstd and lz4 compression methods, and provides
infrastructure to easily add tests for additional options or new
compression methods.

Based on an earlier patch by Dong Wook Lee .
---
 src/bin/pg_basebackup/Makefile   |   1 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 204 +--
 2 files changed, 148 insertions(+), 57 deletions(-)

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 0035ebcef5..9cf092f574 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -21,6 +21,7 @@ include $(top_builddir)/src/Makefile.global
 # make these available to TAP test scripts
 export LZ4
 export TAR
+export ZSTD
 # Note that GZIP cannot be used directly as this environment variable is
 # used by the command "gzip" to pass down options, so stick with a different
 # name.
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index bcc3dd5f0b..c9382451c4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -815,66 +815,156 @@ rmtree("$tempdir/backup_corrupt4");
 $node->safe_psql('postgres', "DROP TABLE corrupt1;");
 $node->safe_psql('postgres', "DROP TABLE corrupt2;");
 
-note "Testing pg_basebackup with compression methods";
-
-# Check ZLIB compression if available.
-SKIP:
-{
-	skip "postgres was not built with ZLIB support", 7
-	  if (!check_pg_config("#define HAVE_LIBZ 1"));
+# Verify available compression methods and options
+
+my @compression_methods = (
+# ZLIB
+{
+'compression_method' => 'ZLIB',
+'enabled'=> check_pg_config('#define HAVE_LIBZ 1'),
+'executable' => $ENV{GZIP_PROGRAM},
+'expected_files' => ['base.tar.gz', 'pg_wal.tar.gz'],
+'backup_flags' => [
+['--compress', '1'],
+['--gzip'],
+['--compress', 'gzip:1'],
+],
+},
+# ZSTD
+{
+'compression_method' => 'ZSTD',
+'enabled'=> check_pg_config('#define USE_ZSTD 1'),
+'expected_files' => ['base.tar.zst'],
+'executable' => $ENV{ZSTD},
+'backup_flags' => [
+['--compress', 'zstd'],
+['--compress', 'server-zstd'],
+['--compress', 'server-zstd:1'],
+['--compress', 'server-zstd:level=2'],
+['--compress', 'client-zstd'],
+['--compress', 'client-zstd:1'],
+['--compress', 'client-zstd:level=2'],
+# These might fail if the available zstd does not support multiple workers
+['--compress', 'server-zstd:workers=2'],
+['--compress', 'server-zstd:workers=2,level=2'],
+['--compress', 'client-zstd:workers=2'],
+['--compress', 'client-zstd:workers=2,level=2'],
+  

Re: Temporary tables versus wraparound... again

2022-12-02 Thread Greg Stark
On Thu, 1 Dec 2022 at 14:18, Andres Freund  wrote:
>
> Hi,
>
> On 2022-12-01 11:13:01 -0500, Greg Stark wrote:
> > On Sat, 5 Nov 2022 at 11:34, Tom Lane  wrote:
> >
> > > * I find 0001 a bit scary, specifically that it's decided it's
> > > okay to apply extract_autovac_opts, pgstat_fetch_stat_tabentry_ext,
> > > and especially relation_needs_vacanalyze to another session's
> > > temp table.  How safe is that really?
> >
> > I  can look a bit more closely but none of them are doing any thing
> > with the table itself, just the catalog entries which afaik have
> > always been fair game for other sessions. So I'm not really clear what
> > kind of unsafeness you're asking about.
>
> Is that actually true? Don't we skip some locking operations for temporary
> tables, which then also means catalog modifications cannot safely be done in
> other sessions?

This code isn't doing any catalog modifications from other sessions.
The code Tom's referring to is just autovacuum looking at relfrozenxid
and relfrozenmxid and printing warnings if they're approaching the
wraparound limits that would otherwise trigger an anti-wraparound
vacuum.

> I find it problematic that ResetVacStats() bypasses tableam. Normal vacuums
> etc go through tableam but you put a ResetVacStats() besides each call to
> table_relation_nontransactional_truncate().  Seems like this should just be in
> heapam_relation_nontransactional_truncate()?

Ok. Think this patch actually predated the tableam (by a lot. I've had
others actually approach me about whether there's a good solution
because it's been biting them too) and I didn't see that in the merge
forward.

> Is it a good idea to use heap_inplace_update() in ResetVacStats()?

This is a good question. I had the impression it was actually the
right thing to do and there's actually been bugs in the past caused by
*not* using heap_inplace_update() so I think it's actually important
to get this right.

I don't see any documentation explaining what the rules are for when
inplace edits are safe or unsafe or indeed when they're necessary for
correctness. So I searched back through the archives and checked when
it came up.

It seems there are a few issues:

a) Nontransactional operations like vacuum have to use it because they
don't have a transaction. Presumably this is why vacuum normally uses
inplace_update for these stats.

b) in the past SnapshotNow scans would behave incorrectly if we do
normal mvcc updates on rows without exclusive locks protecting against
concurrent scans. I'm not sure if this is still a factor these days
with the types of scans that still exist.

c) There are some constraints having to do with logical replication
that I didn't understand. I hope they don't relate to frozenxid but I
don't know

d) There were also some issues having to do with SInval messages but I
think they were additional constraints that inplace updates needed to
be concerned about.

These truncates are done at end of transaction but precommit so the
transaction is still alive and there obviously should be no concurrent
scans on temporary tables so I think it should be safe to do a regular
mvcc update. Is it a good idea to bloat the catalog though? If you
have many temporary tables and don't actually touch more than a few of
them in your transaction that could be a lot of new tuple inserts on
every commit.

Actually it's only sort of true -- if no persistent xid is created
then we would be creating one just for this. But that shouldn't happen
because we only truncate if the transaction ever "touched" a temporary
table. It occurs to me it could still be kind of a problem if you have
a temporary table that you use once and then your session stays alive
for a long time without using temporary tables. Then it won't be
truncated and the frozenxid won't be advanced :(

It's kind of annoying that we have to put RecentXmin and
Get{Our,}OldestMultiXactId() in the table when truncating and then
keep advancing them even if there's no data in the table. Ideally
wouldn't it be better to be able to have Invalid{Sub,}Xid there and
only initialize it when a first insert is made?

-- 
greg




Re: Think-o in foreign key comments

2022-12-02 Thread Ian Lawrence Barwick
2022年12月3日(土) 7:19 Paul Jungwirth :
>
> Hello,
>
> I noticed a few places in the new foreign key code where a comment says
> "the ON DELETE SET NULL/DELETE clause". I believe it should say "ON
> DELETE SET NULL/DEFAULT".
>
> These comments were added in d6f96ed94e7, "Allow specifying column list
> for foreign key ON DELETE SET actions." Here is a patch to correct them.

LGTM.

I do notice the same patch adds the function "validateFkOnDeleteSetColumns"
but the name in the comment preceding it is "validateFkActionSetColumns",
might as well fix that the same time.

> I don't think you usually create a commitfest entry for tiny fixes like
> this, right? But if you'd like one please let me know and I'll add it.

>From experience usually a committer will pick up trivial fixes like this
within a few days, but if it escapes notice for more than a couple of weeks
a reminder and/or CF entry might be useful to make sure it doesn't get
forgotten.

Regards

Ian Barwick




Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-12-02 Thread Amit Kapila
On Fri, Dec 2, 2022 at 4:58 PM Ashutosh Bapat
 wrote:
>
> Hi Dilip,
>
> On Tue, Nov 29, 2022 at 9:38 AM Dilip Kumar  wrote:
> >
> > >
> > > You are right we need this in ReorderBufferProcessPartialChange() as
> > > well.  I will fix this in the next version.
> >
> > Fixed this in the attached patch.
> >
>
> I focused my attention on SnapBuildXactNeedsSkip() usages and I see
> they are using different end points of WAL record
> 1 decode.clogicalmsg_decode   594
> SnapBuildXactNeedsSkip(builder, buf->origptr)))
> 2 decode.cDecodeTXNNeedSkip  1250 return
> (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
> 3 reorderbuffer.c AssertTXNLsnOrder   897 if
> (SnapBuildXactNeedsSkip(ctx->snapshot_builder,
> ctx->reader->EndRecPtr))
> 4 reorderbuffer.c ReorderBufferCanStartStreaming 3922
> !SnapBuildXactNeedsSkip(builder, ctx->reader->EndRecPtr))
> 5 snapbuild.c SnapBuildXactNeedsSkip  429
> SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
>
> The first two are using origin ptr and the last two are using end ptr.
> you have fixed the fourth one. Do we need to fix the third one as
> well?
>

I think we can change the third one as well but I haven't tested it.
Adding Sawada-San for his inputs as it is added in commit 16b1fe0037.
In any case, I think we can do that as a separate patch because it is
not directly related to the streaming case we are trying to solve as
part of this patch.

> Probably we need to create two wrappers (macros) around
> SnapBuildXactNeedsSkip(), one which accepts a XLogRecordBuffer and
> other which accepts XLogReaderState. Then use those. That way at least
> we have logic unified as to which XLogRecPtr to use.
>

I don't know how that will be an improvement because both those have
the start and end locations of the record.

-- 
With Regards,
Amit Kapila.




Transaction timeout

2022-12-02 Thread Andrey Borodin
Hello,

We have statement_timeout, idle_in_transaction_timeout,
idle_session_timeout and many more! But we have no
transaction_timeout. I've skimmed thread [0,1] about existing timeouts
and found no contraindications to implement transaction_timeout.

Nikolay asked me if I can prototype the feature for testing by him,
and it seems straightforward. Please find attached. If it's not known
to be a bad idea - we'll work on it.

Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/763A0689-F189-459E-946F-F0EC4458980B%40hotmail.com


v1-0001-Intorduce-transaction_timeout.patch
Description: Binary data


docs: add missing id elements for developer GUCs

2022-12-02 Thread Ian Lawrence Barwick
Hi

A few of the developer option GUCs were missing the "id" attribute
in their markup, making it impossible to link to them directly.

Specifically the entries from "trace_locks" to "log_btree_build_stats" here:

https://www.postgresql.org/docs/current/runtime-config-developer.html

Patch applies cleanly back to REL_11_STABLE.

Regards

Ian Barwick
From 68f17b1bb652f6da0636b8b72b2549968a6dd937 Mon Sep 17 00:00:00 2001
From: Ian Barwick 
Date: Sat, 3 Dec 2022 15:44:21 +0900
Subject: [PATCH v3] doc: add missing  id elements for developer
 GUCs

---
 doc/src/sgml/config.sgml | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 39d1c89e33..ff6fcd902a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11099,7 +11099,7 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
- 
+ 
   trace_locks (boolean)
   
trace_locks configuration parameter
@@ -11140,7 +11140,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
  
 
- 
+ 
   trace_lwlocks (boolean)
   
trace_lwlocks configuration parameter
@@ -11160,7 +11160,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
  
 
- 
+ 
   trace_userlocks (boolean)
   
trace_userlocks configuration parameter
@@ -11179,7 +11179,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
  
 
- 
+ 
   trace_lock_oidmin (integer)
   
trace_lock_oidmin configuration parameter
@@ -11198,7 +11198,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
  
 
- 
+ 
   trace_lock_table (integer)
   
trace_lock_table configuration parameter
@@ -11216,7 +11216,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
  
 
- 
+ 
   debug_deadlocks (boolean)
   
debug_deadlocks configuration parameter
@@ -11235,7 +11235,7 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
   
  
 
- 
+ 
   log_btree_build_stats (boolean)
   
log_btree_build_stats configuration parameter

base-commit: cb2e7ddfe571e2a158725200a33f728232059c2e
-- 
2.17.1



Re: Transaction timeout

2022-12-02 Thread Nikolay Samokhvalov
On Fri, Dec 2, 2022 at 9:18 PM Andrey Borodin  wrote:

> Hello,
>
> We have statement_timeout, idle_in_transaction_timeout,
> idle_session_timeout and many more! But we have no
> transaction_timeout. I've skimmed thread [0,1] about existing timeouts
> and found no contraindications to implement transaction_timeout.
>
> Nikolay asked me if I can prototype the feature for testing by him,
> and it seems straightforward. Please find attached. If it's not known
> to be a bad idea - we'll work on it.
>

Thanks!! It was a super quick reaction to my proposal Honestly, I was
thinking about it for several years, wondering why it's still not
implemented.

The reasons to have it should be straightforward – here are a couple of
them I can see.

First one. In the OLTP context, we usually have:
- a hard timeout set in application server
- a hard timeout set in HTTP server
- users not willing to wait more than several seconds – and almost never
being ready to wait for more than 30-60s.

If Postgres allows longer transactions (it does since we cannot reliably
limit their duration now, it's always virtually unlimited), it might be
doing the work that nobody is waiting / is needing anymore, speeding
resources, affecting health, etc.

Why we cannot limit transaction duration reliably? The existing timeouts
(namely, statement_timeout + idle_session_timeout) don't protect from
having transactions consisting of a series of small statements and short
pauses between them. If such behavior happens (e.g., a long series of fast
UPDATEs in a loop). It can be dangerous, affecting general DB health (bloat
issues). This is reason number two – DBAs might want to decide to minimize
the cases of long transactions, setting transaction limits globally (and
allowing to set it locally for particular sessions or for some users in
rare cases).

Speaking of the patch – I just tested it (gitpod:
https://gitpod.io/#https://gitlab.com/NikolayS/postgres/tree/transaction_timeout);
it applies, works as expected for single-statement transactions:

postgres=# set transaction_timeout to '2s';
SET
postgres=# select pg_sleep(3);
ERROR:  canceling transaction due to transaction timeout

But it fails in the "worst" case I've described above – a series of small
statements:

postgres=# set transaction_timeout to '2s';
SET
postgres=# begin; select pg_sleep(1); select pg_sleep(1); select
pg_sleep(1); select pg_sleep(1); select pg_sleep(1); commit;
BEGIN
 pg_sleep
--

(1 row)

 pg_sleep
--

(1 row)

 pg_sleep
--

(1 row)

 pg_sleep
--

(1 row)

 pg_sleep
--

(1 row)

COMMIT
postgres=#


Re: Questions regarding distinct operation implementation

2022-12-02 Thread Ankit Kumar Pandey



On 02/12/22 03:07, David Rowley wrote:

On Fri, 2 Dec 2022 at 08:10, Ankit Kumar Pandey  wrote:

select avg(distinct id) over (partition by name) from mytable (in oracle db) 
yields:
2
2
2
2
10

 From this, it is seen distinct is taken across the all rows in the partition.

Due to the lack of ORDER BY clause, all rows in the partition are in
the window frame at once.  The question is, what *should* happen if
you add an ORDER BY.

Looking at the copy of the standard that I have, I see nothing
explicitly mentioned about aggregates with DISTINCT used as window
functions, however, I do see in the Window Function section:

"The window aggregate functions compute an 
(COUNT, SUM, AVG, etc.), the same as
a group aggregate function, except that the computation aggregates
over the window frame of a row rather than
over a group of a grouped table. The hypothetical set functions are
not permitted as window aggregate functions."

So you could deduce that the DISTINCT would also need to be applied
over the frame too.

The question is, what do you want to make work?  If you're not worried
about supporting DISTINCT when there is an ORDER BY clause and the
frame options are effectively ROWS BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING, then it's going to be much easier to make work.
You never need to worry about rows dropping out of visibility in the
frame. Simply all rows in the partition are in the frame.

You do need to be careful as, if I remember correctly, we do support
some non-standard things here. I believe the standard requires an
ORDER BY when specifying frame options. I think we didn't see any
technical reason to apply that limitation, so didn't.  That means in
Postgres, you can do things like:

select avg(id) over (partition by name ROWS BETWEEN CURRENT ROW AND 3
FOLLOWING) from mytable;

but that's unlikely to work on most other databases without adding an ORDER BY.

So if you are going to limit this to only being supported without an
ORDER BY, then you'll need to ensure that the specified frame options
don't cause your code to break.  I'm unsure, but this might be a case
of checking for FRAMEOPTION_NONDEFAULT unless it's
FRAMEOPTION_START_UNBOUNDED_PRECEDING|FRAMEOPTION_END_UNBOUNDED_FOLLOWING.
You'll need to study that a bit more than I just did though.

One way to make that work might be to add code to
eval_windowaggregates() around the call to advance_windowaggregate(),
you can see the row being aggregated is set by:

winstate->tmpcontext->ecxt_outertuple = agg_row_slot;

what you'd need to do here is change the code so that you put all the
rows to aggregate into a tuplesort then sort them by the distinct
column and instead, feed the tuplesort rows to
advance_windowaggregate(). You'd need to add code similar to what is
in process_ordered_aggregate_single() in nodeAgg.c to have the
duplicate consecutive rows skipped.

Just a word of warning on this. This is a hugely complex area of
Postgres. If I was you, I'd make sure and spend quite a bit of time
reading nodeWindowAgg.c and likely much of nodeAgg.c. Any changes we
accept in that area are going to have to be very carefully done. Make
sure you're comfortable with the code before doing too much. It would
be very easy to end up with a giant mess if you try to do this without
fully understanding the implications of your changes.  Also, you'll
need to show you've not regressed the performance of the existing
features with the code you've added.

Good luck!

David


Thanks a lot David, this is of an immense help. I will go through 
mentioned pointers, biggest being that this is complex piece and will 
take its due course.


Thanks again

--
Regards,
Ankit Kumar Pandey





Re: Questions regarding distinct operation implementation

2022-12-02 Thread Ankit Kumar Pandey



On 02/12/22 03:21, David G. Johnston wrote:
 The main concern, I suspect, isn't implementation ability, it is 
speed and memory consumption.


Hi David,

Shouldn't this be an acceptable tradeoff if someone wants to perform 
extra operation in plain old aggregates? Although I am not sure how much 
this extra memory and compute usage is considered as acceptable.


--
Regards,
Ankit Kumar Pandey