date:20220920

On 2022-Sep-19, Bharath Rupireddy wrote:

> We have a bunch of messages [1] that have an offset, but not LSN in
> the error message. Firstly, is there an easiest way to figure out LSN
> from offset reported in the error messages? If not, is adding LSN to
> these messages along with offset a good idea? Of course, we can't just
> convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
> something meaningful like reporting the LSN of the page that we are
> reading-in or writing-out etc.

Maybe add errcontext() somewhere that reports the LSN would be
appropriate.  For example, the page_read() callbacks have the LSN
readily available, so the ones in backend could install the errcontext
callback; or perhaps ReadPageInternal can do it #ifndef FRONTEND.  Not
sure what is best of those options, but either of those sounds better
than sticking the LSN in a lower-level routine that doesn't necessarily
have the info already.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Add 64-bit XIDs into PostgreSQL 15

Hi,

Is this patch target to PG16 now?

I want to have a look at these patches, but apply on master failed:

Applying: Use 64-bit numbering of SLRU pages.
Applying: Use 64-bit format to output XIDs
Applying: Use 64-bit FullTransactionId instead of Epoch:xid
Applying: Use 64-bit pages representation in SLRU callers.
error: patch failed: src/backend/access/transam/multixact.c:1228
error: src/backend/access/transam/multixact.c: patch does not apply
Patch failed at 0004 Use 64-bit pages representation in SLRU callers.
hint: Use 'git am --show-current-patch=diff' to see the failed patch


Regards,
Zhang Mingli

Re: missing indexes in indexlist with partitioned tables

2022-09-20 Thread David Rowley

Thank you for having a look at the patch.

On Tue, 20 Sept 2022 at 18:41, Amit Langote  wrote:
> Agreed, though the patch's changes to tests does not seem to have to
> do with join removal?  I don't really understand what the test changes
> are all about.  I wonder why the patch doesn't instead add the test
> case that Arne showed in the file he attached with [1].
> [1] 
> https://www.postgresql.org/message-id/2641568c18de40e8b1528fc9d4d80127%40index.de

I adjusted a test in partition_join.sql to add an additional column to
the fract_t table.  Before the change that table only had a single
column and due to the query's join condition being USING(id), none of
the columns from the left joined table were being used.  That resulted
in the updated code performing a left join removal as it was passing
the checks for no columns being used in the left joined table in
analyzejoins.c.  The test in partition_join.sql claims to be testing
"partitionwise join with fractional paths", so I thought we'd better
not have a query that the planner removes the join when we're meant to
be testing joins.

It probably wouldn't hurt to have a new test to ensure left join
removals work with a partitioned table. That should go in join.sql
along with the other join removal tests. I didn't study Arne's patch
to see what test he added. I was only interested in writing enough
code so I could check there was no good reason not to add the
partitioned index into RelOptInfo.indexlist. Arne sent me an off-list
message to say he's planning on working on the patch that uses the
existing field instead of the new one he originally added. Let's hold
off for that patch.

David

Re: missing indexes in indexlist with partitioned tables

2022-09-20 Thread Amit Langote

On Tue, Sep 20, 2022 at 4:53 PM David Rowley  wrote:
> Thank you for having a look at the patch.
>
> On Tue, 20 Sept 2022 at 18:41, Amit Langote  wrote:
> > Agreed, though the patch's changes to tests does not seem to have to
> > do with join removal?  I don't really understand what the test changes
> > are all about.  I wonder why the patch doesn't instead add the test
> > case that Arne showed in the file he attached with [1].
> > [1] 
> > https://www.postgresql.org/message-id/2641568c18de40e8b1528fc9d4d80127%40index.de
>
> I adjusted a test in partition_join.sql to add an additional column to
> the fract_t table.  Before the change that table only had a single
> column and due to the query's join condition being USING(id), none of
> the columns from the left joined table were being used.  That resulted
> in the updated code performing a left join removal as it was passing
> the checks for no columns being used in the left joined table in
> analyzejoins.c.  The test in partition_join.sql claims to be testing
> "partitionwise join with fractional paths", so I thought we'd better
> not have a query that the planner removes the join when we're meant to
> be testing joins.

Ah, got it, thanks for the explanation.

> It probably wouldn't hurt to have a new test to ensure left join
> removals work with a partitioned table. That should go in join.sql
> along with the other join removal tests.

Makes sense.

> I didn't study Arne's patch
> to see what test he added. I was only interested in writing enough
> code so I could check there was no good reason not to add the
> partitioned index into RelOptInfo.indexlist. Arne sent me an off-list
> message to say he's planning on working on the patch that uses the
> existing field instead of the new one he originally added. Let's hold
> off for that patch.

Ok, sure.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Add 64-bit XIDs into PostgreSQL 15

Hi,

With these patches, it seems that we don’t need to handle wraparound in
GetNextLocalTransactionId() too, as LocalTransactionId is unit64 now.

```
LocalTransactionId
GetNextLocalTransactionId(void)
{
    LocalTransactionId result;

    /* loop to avoid returning InvalidLocalTransactionId at wraparound */
    do
    {
        result = nextLocalTransactionId++;
    } while (!LocalTransactionIdIsValid(result));

    return result;
}
```

Regards,
Zhang Mingli

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-20 Thread Masahiko Sawada

On Fri, Sep 16, 2022 at 4:54 PM John Naylor
 wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
> >
> > On Mon, Aug 15, 2022 at 10:39 PM John Naylor
> >  wrote:
> > >
> > > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > > left this for last as a possible refactoring, it might save some work.
>
> v6 demonstrates why this should have been put off towards the end. (more 
> below)
>
> > > In any case, I'll take a look at the latest patch next month.
>
> Since the CF entry said "Needs Review", I began looking at v5 again
> this week. Hopefully not too much has changed, but in the future I
> strongly recommend setting to "Waiting on Author" if a new version is
> forthcoming. I realize many here share updated patches at any time,
> but I'd like to discourage the practice especially for large patches.

Understood. Sorry for the inconveniences.

>
> > I've updated the radix tree patch. It's now separated into two patches.
> >
> > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> > better names) that are similar to the pg_lfind8() family but they
> > return the index of the key in the vector instead of true/false. The
> > patch includes regression tests.
>
> I don't want to do a full review of this just yet, but I'll just point
> out some problems from a quick glance.
>
> +/*
> + * Return the index of the first element in the vector that is greater than
> + * or eual to the given scalar. Return sizeof(Vector8) if there is no such
> + * element.
>
> That's a bizarre API to indicate non-existence.
>
> + *
> + * Note that this function assumes the elements in the vector are sorted.
> + */
>
> That is *completely* unacceptable for a general-purpose function.
>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
>
> I don't think we should try to force the non-simd case to adopt the
> special semantics of vector comparisons. It's much easier to just use
> the same logic as the assert builds.
>
> +#ifdef USE_SSE2
> + return (uint32) _mm_movemask_epi8(v);
> +#elif defined(USE_NEON)
> + static const uint8 mask[16] = {
> +1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +  };
> +
> +uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
> vshrq_n_s8(v, 7));
> +uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
> +
> +return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
>
> For Arm, we need to be careful here. This article goes into a lot of
> detail for this situation:
>
> https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
>
> Here again, I'd rather put this off and focus on getting the "large
> details" in good enough shape so we can got towards integrating with
> vacuum.

Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.

I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.


Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

I've gone through the v11-0001 patch in more detail.

Here are some more review comments (nothing functional I think -
mostly just wording)

==

1. src/backend/executor/execReplication.c - build_replindex_scan_key

- * This is not generic routine, it expects the idxrel to be replication
- * identity of a rel and meet all limitations associated with that.
+ * This is not generic routine, it expects the idxrel to be an index
+ * that planner would choose if the searchslot includes all the columns
+ * (e.g., REPLICA IDENTITY FULL on the source).
  */
-static bool
+static int
 build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
  TupleTableSlot *searchslot)


(I know this is not caused by your patch but maybe fix it at the same time?)

"This is not generic routine, it expects..." -> "This is not a generic
routine - it expects..."

~~~

2.

+ IndexInfo  *indexInfo PG_USED_FOR_ASSERTS_ONLY;
+
+ /*
+ * This attribute is an expression, and
+ * SuitablePathsForRepIdentFull() was called earlier while the
+ * index for subscriber is selected. There, the indexes comprising
+ * *only* expressions have already been eliminated.
+ *
+ * We sanity check this now.
+ */
+#ifdef USE_ASSERT_CHECKING
+ indexInfo = BuildIndexInfo(idxrel);
+ Assert(!IndexOnlyOnExpression(indexInfo));
+#endif

2a.
"while the index for subscriber is selected..." -> "when the index for
the subscriber was selected...”

~

2b.
Because there is only one declaration in this code block you could
simplify this a bit if you wanted to.

SUGGESTION
/*
 * This attribute is an expression, and
 * SuitablePathsForRepIdentFull() was called earlier while the
 * index for subscriber is selected. There, the indexes comprising
 * *only* expressions have already been eliminated.
 *
 * We sanity check this now.
 */
#ifdef USE_ASSERT_CHECKING
IndexInfo  *indexInfo = BuildIndexInfo(idxrel);
Assert(!IndexOnlyOnExpression(indexInfo));
#endif

~~~

3. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex

+ /* Start an index scan. */
+ scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
 retry:
  found = false;

It might be better to have a blank line before that ‘retry’ label,
like in the original code.

==

4. src/backend/replication/logical/relation.c

+/* see LogicalRepPartMapEntry for details in logicalrelation.h */
 static HTAB *LogicalRepPartMap = NULL;

Personally, I'd word that something like:
"/* For LogicalRepPartMap details see LogicalRepPartMapEntry in
logicalrelation.h */"

but YMMV.

~~~

5. src/backend/replication/logical/relation.c -
GenerateDummySelectPlannerInfoForRelation

+/*
+ * This is not a generic function, helper function for
+ * GetCheapestReplicaIdentityFullPath. The function creates
+ * a dummy PlannerInfo for the given relationId as if the
+ * relation is queried with SELECT command.
+ */
+static PlannerInfo *
+GenerateDummySelectPlannerInfoForRelation(Oid relationId)

(mentioned this one in my previous post)

"This is not a generic function, helper function" -> "This is not a
generic function. It is a helper function"

~~~

6. src/backend/replication/logical/relation.c -
GetCheapestReplicaIdentityFullPath

+/*
+ * Generate all the possible paths for the given subscriber relation,
+ * for the cases that the source relation is replicated via REPLICA
+ * IDENTITY FULL. The function returns the cheapest Path among the
+ * eligible paths, see SuitablePathsForRepIdentFull().
+ *
+ * The function guarantees to return a path, because it adds sequential
+ * scan path if needed.
+ *
+ * The function assumes that all the columns will be provided during
+ * the execution phase, given that REPLICA IDENTITY FULL guarantees
+ * that.
+ */
+static Path *
+GetCheapestReplicaIdentityFullPath(Relation localrel)


"for the cases that..." -> "for cases where..."

~~~

7.

+ /*
+ * Currently it is not possible for planner to pick a partial index or
+ * indexes only on expressions. We still want to be explicit and eliminate
+ * such paths proactively.

"for planner..." -> "for the planner..."

==

8. .../subscription/t/032_subscribe_use_index.pl - general

8a.
(remove the 'we')
"# we wait until..." -> "# wait until..." X many occurrences

~

8b.
(remove the 'we')
"# we show that..." -> “# show that..." X many occurrences

~~~

9.

There is inconsistent wording for some of your test case start/end comments

9a.
e.g.
start: SUBSCRIPTION USES INDEX UPDATEs MULTIPLE ROWS
end: SUBSCRIPTION USES INDEX MODIFIES MULTIPLE ROWS

~

9b.
e.g.
start: SUBSCRIPTION USES INDEX WITH MULTIPLE COLUMNS
end: SUBSCRIPTION USES INDEX MODIFIES MULTIPLE ROWS

~~~

10.

I did not really understand the point of having special subscription names
tap_sub_rep_full_0
tap_sub_rep_full_2
tap_sub_rep_full_3
tap_sub_rep_full_4
etc...

Since you drop/recreate these for each test case can't they just be
called 'tap_sub_rep_full'?

~~~

11. SUBSCRIPTION USES INDEX WITH MULTIPLE COLUMNS

+# updates 200 rows
+$node_publisher->safe_psql('postgres',
+ "D

Re: Add common function ReplicationOriginName.

Hi Amit,

> I think it is better to use Size. Even though, it may not fail now as
> the size of names for origin will always be much lesser but it is
> better if we are consistent. If we agree with this, then as a first
> patch, we can make it to use Size in existing places and then
> introduce this new function.

OK, here is the updated patchset.

* 0001 replaces int's with Size's in the existing code
* 0002 applies Peter's patch on top of 0001

-- 
Best regards,
Aleksander Alekseev


v2-0001-Pass-Size-as-a-2nd-argument-for-snprintf-in-table.patch
Description: Binary data


v2-0002-Add-common-function-ReplicationOriginName.patch
Description: Binary data

Re: Summary function for pg_buffercache

Hi,

Also I suggest changing the names of the columns in order to make them
> consistent with the rest of the system. If you consider pg_stat_activity
> and family [1] you will notice that the columns are named
> (entity)_(property), e.g. backend_xid, backend_type, client_addr, etc. So
> instead of used_buffers and unused_buffers the naming should be
> buffers_used and buffers_unused.
>
> [1]: https://www.postgresql.org/docs/current/monitoring-stats.html


I changed these names and updated the patch.

However I have somewhat mixed feelings about avg_usagecount. Generally
>> AVG() is a relatively useless methric for monitoring. What if the user
>> wants MIN(), MAX() or let's say a 99th percentile? I suggest splitting it
>> into usagecount_min, usagecount_max and usagecount_sum. AVG() can be
>> derived as usercount_sum / used_buffers.
>>
>
> Won't be usagecount_max almost always 5 as "BM_MAX_USAGE_COUNT" set to 5
> in buf_internals.h? I'm not sure about how much usagecount_min would add
> either.
> A usagecount is always an integer between 0 and 5, it's not
> something unbounded. I think the 99th percentile would be much better than
> average if strong outlier values could occur. But in this case, I feel like
> an average value would be sufficiently useful as well.
> usagecount_sum would actually be useful since average can be derived from
> it. If you think that the sum of usagecounts has a meaning just by itself,
> it makes sense to include it. Otherwise, wouldn't showing directly averaged
> value be more useful?
>

Aleksander, do you still think the average usagecount is a bit useless? Or
does it make sense to you to keep it like this?

> I suggest we focus on saving the memory first and then think about the
> > performance, if necessary.
>
> Personally I think the locks part is at least as important - it's what
> makes
> the production impact higher.
>

I agree that it's important due to its high impact. I'm not sure how to
avoid any undefined behaviour without locks though.
Even with locks, performance is much better. But is it good enough for
production?


Thanks,
Melih


v6-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: About displaying NestLoopParam

2022-09-20 Thread Richard Guo

On Fri, Sep 16, 2022 at 5:59 PM Richard Guo  wrote:

> I'm not saying this is a bug, but just curious why param 0 cannot be
> displayed as the referenced expression. And I find the reason is that in
> function find_param_referent(), we have the 'in_same_plan_level' flag
> controlling that if we have emerged from a subplan, i.e. not the same
> plan level any more, we would not look further for the matching
> NestLoopParam. Param 0 suits this situation.
>
> And there is a comment there also saying,
>
> /*
>  * NestLoops transmit params to their inner child only; also, once
>  * we've crawled up out of a subplan, this couldn't possibly be
>  * the right match.
>  */
>

After thinking of this for more time, I still don't see the reason why
we cannot display NestLoopParam after we've emerged from a subplan.

It seems these params are from parameterized subqueryscan and their
values are supplied by an upper nestloop. These params should have been
processed in process_subquery_nestloop_params() that we just add the
PlannerParamItem entries to root->curOuterParams, in the form of
NestLoopParam, using the same PARAM_EXEC slots.

So I propose the patch attached to remove the 'in_same_plan_level' flag
so that we can display NestLoopParam across subplan. Please correct me
if I'm wrong.

Thanks
Richard

v1-0001-Display-NestLoopParam-across-subplan.patch
Description: Binary data

Re: HOT chain validation in verify_heapam()

2022-09-20 Thread Himanshu Upadhyaya

On Mon, Sep 19, 2022 at 10:04 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi Himanshu,
>
> > I have changed this in the attached patch.
>
> If it's not too much trouble could you please base your changes on v4
> that I submitted? I put some effort into writing a proper commit
> message, editing the comments, etc. The easiest way of doing this is
> using `git am` and `git format-patch`.
>
> Please find it attached.

-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
From 3f3290ffb857117385f79f85aa599c588340924b Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Tue, 20 Sep 2022 14:23:31 +0530
Subject: [PATCH v5] Implement HOT chain validation in verify_heapam()

Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev

Discussion: https://postgr.es/m/CAPF61jBBR2-iE-EmN_9v0hcQEfyz_17e5Lbb0%2Bu2%3D9ukA9sWmQ%40mail.gmail.com
---
 contrib/amcheck/verify_heapam.c | 214 
 1 file changed, 214 insertions(+)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index c875f3e5a2..e90e1ef2f3 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
+		OffsetNumber predecessor[MaxOffsetNumber] = {0};
+		OffsetNumber successor[MaxOffsetNumber] = {0};
+		bool		lp_valid[MaxOffsetNumber] = {false};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
 		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
 			 ctx.offnum = OffsetNumberNext(ctx.offnum))
 		{
+			OffsetNumber nextoffnum;
+
 			ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
 
 			/* Skip over unused/dead line pointers */
@@ -469,6 +474,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	report_corruption(&ctx,
 	  psprintf("line pointer redirection to unused item at offset %u",
 			   (unsigned) rdoffnum));
+
+/*
+ * make entry in successor array, redirected tuple will be
+ * validated at the time when we loop over successor array
+ */
+successor[ctx.offnum] = rdoffnum;
 continue;
 			}
 
@@ -504,9 +515,205 @@ verify_heapam(PG_FUNCTION_ARGS)
 			/* It should be safe to examine the tuple's header, at least */
 			ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
 			ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+			lp_valid[ctx.offnum] = true;
 
 			/* Ok, ready to check this next tuple */
 			check_tuple(&ctx);
+
+			/*
+			 * Add the data to the successor array. It will be used later to
+			 * generate the predecessor array.
+			 *
+			 * We need to access the tuple's header to populate the
+			 * predecessor array. However the tuple is not necessarily sanity
+			 * checked yet so delaying construction of predecessor array until
+			 * all tuples are sanity checked.
+			 */
+			nextoffnum = ItemPointerGetOffsetNumber(&(ctx.tuphdr)->t_ctid);
+			if (ItemPointerGetBlockNumber(&(ctx.tuphdr)->t_ctid) == ctx.blkno &&
+nextoffnum != ctx.offnum)
+			{
+successor[ctx.offnum] = nextoffnum;
+			}
+		}
+
+		/*
+		 * Loop over offset and populate predecessor array from all entries
+		 * that are present in successor array.
+		 */
+		ctx.attnum = -1;
+		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
+			 ctx.offnum = OffsetNumberNext(ctx.offnum))
+		{
+			ItemId		curr_lp;
+			ItemId		next_lp;
+			HeapTupleHeader curr_htup;
+			HeapTupleHeader next_htup;
+			TransactionId curr_xmax;
+			TransactionId next_xmin;
+
+			OffsetNumber nextoffnum = successor[ctx.offnum];
+
+			curr_lp = PageGetItemId(ctx.page, ctx.offnum);
+			if (nextoffnum == 0)
+			{
+/*
+ * This is either the last updated tuple in the chain or a
+ * corruption raised for this tuple.
+ */
+continue;
+			}
+			if (ItemIdIsRedirected(curr_lp) && lp_valid[nextoffnum])
+			{
+next_lp = PageGetItemId(ctx.page, nextoffnum);
+next_htup = (HeapTupleHeader) PageGetItem(ctx.page, next_lp);;
+if (!HeapTupleHeaderIsHeapOnly(next_htup))
+{
+	report_corruption(&ctx,
+	  psprintf("redirected tuple at line pointer offset %u is not heap only tuple",
+			   (unsigned) nextoffnum));
+}
+if ((next_htup->t_infomask & HEAP_UPDATED) == 0)
+{
+	report_corruption(&ctx,
+	  psprintf("redirected tuple at line pointer offset %u is not heap updated tuple",
+			   (unsigned) nextoffnum));
+}
+continue;
+			}
+
+			/*
+			 * Add a line pointer offset to the predecessor array. 1) If xmax
+			 * is matching with xmin of next tuple (reaching via its t_ctid).
+			 * 2) If the next tuple is in the same page. Raise corruption if
+			 * we have two tuples having the same predecessor.
+			 *
+			 * We add the offset to the predecessor array irrespective of the
+			 * transaction (t_xmin) status. We will do validation related to
+			 * the transaction status (and

Re: Add common function ReplicationOriginName.

On Tue, Sep 20, 2022 at 6:36 PM Aleksander Alekseev
 wrote:
>
> Hi Amit,
>
> > I think it is better to use Size. Even though, it may not fail now as
> > the size of names for origin will always be much lesser but it is
> > better if we are consistent. If we agree with this, then as a first
> > patch, we can make it to use Size in existing places and then
> > introduce this new function.
>
> OK, here is the updated patchset.
>
> * 0001 replaces int's with Size's in the existing code
> * 0002 applies Peter's patch on top of 0001
>

LGTM. Thanks!

-
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add 64-bit XIDs into PostgreSQL 15

2022-09-20 Thread Justin Pryzby

On Tue, Sep 20, 2022 at 03:37:47PM +0800, Zhang Mingli wrote:
> I want to have a look at these patches, but apply on master failed:

Yeah, it's likely to break every week or more often.

You have a few options:

0) resolve the conflict yourself;

1) apply the patch to the commit that the authors sent it against, or
some commit before the conflicting file(s) were changed in master.  Like
maybe "git checkout -b 64bitxids f66d997fd".

2) Use the last patch that cfbot successfully created.  You can read the
patch on github's web interface, or add cfbot's user as a remote to use
the patch locally for review and/or compilation.  Something like "git
remote add cfbot https://github.com/postgresql-cfbot/postgresql; git
fetch cfbot commitfest/39/3594; git checkout -b 64bitxids
cfbot/commitfest/39/3594".  (Unfortunately, cfbot currently squishes the
patch series into a single commit and loses the commit message).

You could also check the git link in the commitfest, to see if the
author has already rebased it, but haven't yet mailed the rebased patch
to the list.  In this case, that's not true, but you could probably use
the author's branch on github, too.
https://commitfest.postgresql.org/39/3594/

-- 
Justin

Re: why can't a table be part of the same publication as its schema

On 2022-Sep-20, Amit Kapila wrote:

> On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera  
> wrote:

> > This seems a pretty arbitrary restriction.  It feels like you're adding
> > this restriction precisely so that you don't have to write the code to
> > reject the ALTER .. SET SCHEMA if an incompatible configuration is
> > detected.  But we already have such checks in several cases, so I don't
> > see why this one does not seem a good idea.
> >
> I agree that we have such checks at other places as well and one
> somewhat similar is in ATPrepChangePersistence().
> 
> ATPrepChangePersistence()
> {
> ...
> ...
> /*
>  * Check that the table is not part of any publication when changing to
>  * UNLOGGED, as UNLOGGED tables can't be published.
>  */

Right, I think this is a sensible approach.

> However, another angle to look at it is that we try to avoid adding
> restrictions in other DDL commands for defined publications.

Well, it makes sense to avoid restrictions wherever possible.  But here,
the consequence is that you end up with a restriction in the publication
definition that is not very sensible.  Imagine if you said "you can't
add schema S because it contains an unlogged table".  It's absurd.

Maybe this can be relaxed in a future release, but it's quite odd.

> The intention was to be in sync with FOR ALL TABLES.

I guess we can change both (FOR ALL TABLES and IN SCHEMA) later.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Add 64-bit XIDs into PostgreSQL 15

Hi,
On Sep 20, 2022, 17:26 +0800, Justin Pryzby , wrote:
> On Tue, Sep 20, 2022 at 03:37:47PM +0800, Zhang Mingli wrote:
> > I want to have a look at these patches, but apply on master failed:
>
> Yeah, it's likely to break every week or more often.
>
> You have a few options:
>
> 0) resolve the conflict yourself;
>
> 1) apply the patch to the commit that the authors sent it against, or
> some commit before the conflicting file(s) were changed in master. Like
> maybe "git checkout -b 64bitxids f66d997fd".
>
> 2) Use the last patch that cfbot successfully created. You can read the
> patch on github's web interface, or add cfbot's user as a remote to use
> the patch locally for review and/or compilation. Something like "git
> remote add cfbot https://github.com/postgresql-cfbot/postgresql; git
> fetch cfbot commitfest/39/3594; git checkout -b 64bitxids
> cfbot/commitfest/39/3594". (Unfortunately, cfbot currently squishes the
> patch series into a single commit and loses the commit message).
>
> You could also check the git link in the commitfest, to see if the
> author has already rebased it, but haven't yet mailed the rebased patch
> to the list. In this case, that's not true, but you could probably use
> the author's branch on github, too.
> https://commitfest.postgresql.org/39/3594/
>
> --
> Justin
Got it, thanks.


Regards,
Zhang Mingli

Re: Summary function for pg_buffercache

Hi Melih,

> I changed these names and updated the patch.

Thanks for the updated patch!

> Aleksander, do you still think the average usagecount is a bit useless? Or 
> does it make sense to you to keep it like this?

I don't mind keeping the average.

> I'm not sure how to avoid any undefined behaviour without locks though.
> Even with locks, performance is much better. But is it good enough for 
> production?

Potentially you could avoid taking locks by utilizing atomic
operations and lock-free algorithms. But these algorithms are
typically error-prone and not always produce a faster code than the
lock-based ones. I'm pretty confident this is out of scope of this
particular patch.

The patch v6 had several defacts:

* Trailing whitespaces (can be checked by applying the patch with `git am`)
* Wrong code formatting (can be fixed with pgindent)
* Several empty lines were removed which is not related to the
proposed change (can be seen with `git diff`)
* An unlikely division by zero if buffers_used = 0
* Missing part of the commit message added in v4

Here is a corrected patch v7. To me it seems to be in pretty good
shape, unless cfbot and/or other hackers will report any issues.

-- 
Best regards,
Aleksander Alekseev


v7-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: ICU for global collation

2022-09-20 Thread Peter Eisentraut


On 17.09.22 10:33, Marina Polyakova wrote:
Thanks to Kyotaro Horiguchi review we found out that there're 
interesting cases due to the order of some ICU checks:


1. ICU locale vs supported encoding:

1.1.

On 2022-09-15 09:52, Kyotaro Horiguchi wrote:

If I executed initdb as follows, I would be told to specify
--icu-locale option.


$ initdb --encoding sql-ascii --locale-provider icu hoge
...
initdb: error: ICU locale must be specified


However, when I reran the command, it complains about incompatible
encoding this time.  I think it's more user-friendly to check for the
encoding compatibility before the check for missing --icu-locale
option.


This a valid point, but it would require quite a bit of work to move all 
those checks around and re-verify the result, so I don't want to do it 
in PG15.



1.2. (ok?)

$ initdb --encoding sql-ascii --icu-locale en-US hoge
initdb: error: --icu-locale cannot be specified unless locale provider 
"icu" is chosen


$ initdb --encoding sql-ascii --icu-locale en-US --locale-provider icu hoge
...
initdb: error: encoding mismatch
initdb: detail: The encoding you selected (SQL_ASCII) is not supported 
with the ICU provider.
initdb: hint: Rerun initdb and either do not specify an encoding 
explicitly, or choose a matching combination.


$ createdb --encoding sql-ascii --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU locale cannot be 
specified unless locale provider is ICU
$ createdb --encoding sql-ascii --icu-locale en-US --locale-provider icu 
hoge
createdb: error: database creation failed: ERROR:  encoding "SQL_ASCII" 
is not supported with ICU provider


I don't see a problem here.


2. For builds without ICU:

2.1.

$ initdb --locale-provider icu hoge
...
initdb: error: ICU locale must be specified

$ initdb --locale-provider icu --icu-locale en-US hoge
...
initdb: error: ICU is not supported in this build

$ createdb --locale-provider icu hoge
createdb: error: database creation failed: ERROR:  ICU locale must be 
specified


$ createdb --locale-provider icu --icu-locale en-US hoge
createdb: error: database creation failed: ERROR:  ICU is not supported 
in this build


IMO, it would be more user-friendly to inform an unsupported build in 
the first runs too..


Again, this would require reorganizing a bunch of code to get some 
cosmetic benefit, which isn't a good idea now for PG15.



2.2. (ok?)
2.3.


same here


3.

The locale provider is ICU, but it has not yet been set from the 
template database:



$ initdb --locale-provider icu --icu-locale en-US -D data &&
pg_ctl -D data -l logfile start &&
createdb --icu-locale ru-RU --template template0 mydb
...
createdb: error: database creation failed: ERROR:  ICU locale cannot be
specified unless locale provider is ICU


Please see attached patch for a fix.  Does that work for you?From 5bb07e390baf8a471f7d30582469268c8a6d71ca Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Sep 2022 05:50:41 -0400
Subject: [PATCH] Improve ICU option handling in CREATE DATABASE

We check that the ICU locale is only specified if the ICU locale
provider is selected.  But we did that too early.  We need to wait
until we load the settings of the template database, since that could
also set what the locale provider is.

Reported-by: Marina Polyakova 
Discussion: 
https://www.postgresql.org/message-id/9ba4cd1ea6ed6b7b15c0ff15e6f54...@postgrespro.ru
---
 src/backend/commands/dbcommands.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index cba929b7f998..5dfec5c6b056 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -907,10 +907,6 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 errmsg("unrecognized locale provider: 
%s",
locproviderstr)));
}
-   if (diculocale && dblocprovider != COLLPROVIDER_ICU)
-   ereport(ERROR,
-   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
-errmsg("ICU locale cannot be specified unless 
locale provider is ICU")));
if (distemplate && distemplate->arg)
dbistemplate = defGetBoolean(distemplate);
if (dallowconnections && dallowconnections->arg)
@@ -1050,6 +1046,13 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
check_icu_locale(dbiculocale);
}
+   else
+   {
+   if (dbiculocale)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+errmsg("ICU locale cannot be specified 
unless locale provider is ICU")));
+   }
 
/*
 * Check that the new encoding and locale settings match the source
-- 
2.37.3

Re: Proposal to use JSON for Postgres Parser format

2022-09-20 Thread Alexander Korotkov

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane  wrote:
> Peter Geoghegan  writes:
> > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane  wrote:
> >> Our existing format is certainly not great on those metrics, but
> >> I do not see how "let's use JSON!" is a route to improvement.
>
> > The existing format was designed with developer convenience as a goal,
> > though -- despite my complaints, and in spite of your objections.
>
> As Munro adduces nearby, it'd be a stretch to conclude that the current
> format was designed with any Postgres-related goals in mind at all.
> I think he's right that it's a variant of some Lisp-y dump format that's
> probably far hoarier than even Berkeley Postgres.
>
> > If it didn't have to be easy (or even practical) for developers to
> > directly work with the output format, then presumably the format used
> > internally could be replaced with something lower level and faster. So
> > it seems like the two goals (developer ergonomics and faster
> > interchange format for users) might actually be complementary.
>
> I think the principal mistake in what we have now is that the storage
> format is identical to the "developer friendly" text format (plus or
> minus some whitespace).  First we need to separate those.  We could
> have more than one equivalent text format perhaps, and I don't have
> any strong objection to basing the text format (or one of them) on
> JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

--
Regards,
Alexander Korotkov

Re: Proposal to use JSON for Postgres Parser format

2022-09-20 Thread Alexander Korotkov

On Tue, Sep 20, 2022 at 1:00 PM Alexander Korotkov  wrote:
> On Tue, Sep 20, 2022 at 7:48 AM Tom Lane  wrote:
> > Peter Geoghegan  writes:
> > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane  wrote:
> > >> Our existing format is certainly not great on those metrics, but
> > >> I do not see how "let's use JSON!" is a route to improvement.
> >
> > > The existing format was designed with developer convenience as a goal,
> > > though -- despite my complaints, and in spite of your objections.
> >
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
> >
> > > If it didn't have to be easy (or even practical) for developers to
> > > directly work with the output format, then presumably the format used
> > > internally could be replaced with something lower level and faster. So
> > > it seems like the two goals (developer ergonomics and faster
> > > interchange format for users) might actually be complementary.
> >
> > I think the principal mistake in what we have now is that the storage
> > format is identical to the "developer friendly" text format (plus or
> > minus some whitespace).  First we need to separate those.  We could
> > have more than one equivalent text format perhaps, and I don't have
> > any strong objection to basing the text format (or one of them) on
> > JSON.
>
> +1 for considering storage format and text format separately.
>
> Let's consider what our criteria could be for the storage format.
>
> 1) Storage effectiveness (shorter is better) and
> serialization/deserialization effectiveness (faster is better). On
> this criterion, the custom binary format looks perfect.
> 2) Robustness in the case of corruption. It seems much easier to
> detect the data corruption and possibly make some partial manual
> recovery for textual format.
> 3) Standartness. It's better to use something known worldwide or at
> least used in other parts of PostgreSQL than something completely
> custom. From this perspective, JSON/JSONB is better than custom
> things.

(sorry, I've accidentally cut the last paragraph from the message)

It seems that there is no perfect fit for this multi-criteria
optimization, and we should pick what is more important.  Any
thoughts?

--
Regards,
Alexander Korotkov

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-20 Thread Önder Kalacı

Hi Peter,

Thanks for the quick response.


> 1. Commit message
>
> It looks like some small mistake happened. You wrote [1] that my
> previous review comments about the commit message were fixed, but it
> seems the v11 commit message is unchanged since v10.
>
>
Oops, yes you are right, I forgot to push commit message changes. I'll
incorporate all these suggestions on v12.



> ==
>
> 2. src/backend/replication/logical/relation.c -
> GenerateDummySelectPlannerInfoForRelation
>
> +/*
> + * This is not a generic function, helper function for
> + * GetCheapestReplicaIdentityFullPath. The function creates
> + * a dummy PlannerInfo for the given relationId as if the
> + * relation is queried with SELECT command.
> + */
> +static PlannerInfo *
> +GenerateDummySelectPlannerInfoForRelation(Oid relationId)
>
> "generic function, helper function" -> "generic function. It is a
> helper function"
>
>
Fixed.

I'll attach the changes in the next email with v12.

Thanks,
Onder

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-20 Thread Önder Kalacı

Hi Peter,



>
> 1. src/backend/executor/execReplication.c - build_replindex_scan_key
>
> - * This is not generic routine, it expects the idxrel to be replication
> - * identity of a rel and meet all limitations associated with that.
> + * This is not generic routine, it expects the idxrel to be an index
> + * that planner would choose if the searchslot includes all the columns
> + * (e.g., REPLICA IDENTITY FULL on the source).
>   */
> -static bool
> +static int
>  build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
>   TupleTableSlot *searchslot)
>
>
> (I know this is not caused by your patch but maybe fix it at the same
> time?)
>
> "This is not generic routine, it expects..." -> "This is not a generic
> routine - it expects..."
>
>
Fixed


>
> 2.
>
> + IndexInfo  *indexInfo PG_USED_FOR_ASSERTS_ONLY;
> +
> + /*
> + * This attribute is an expression, and
> + * SuitablePathsForRepIdentFull() was called earlier while the
> + * index for subscriber is selected. There, the indexes comprising
> + * *only* expressions have already been eliminated.
> + *
> + * We sanity check this now.
> + */
> +#ifdef USE_ASSERT_CHECKING
> + indexInfo = BuildIndexInfo(idxrel);
> + Assert(!IndexOnlyOnExpression(indexInfo));
> +#endif
>
> 2a.
> "while the index for subscriber is selected..." -> "when the index for
> the subscriber was selected...”
>
>
fixed


> ~
>
> 2b.
> Because there is only one declaration in this code block you could
> simplify this a bit if you wanted to.
>
> SUGGESTION
> /*
>  * This attribute is an expression, and
>  * SuitablePathsForRepIdentFull() was called earlier while the
>  * index for subscriber is selected. There, the indexes comprising
>  * *only* expressions have already been eliminated.
>  *
>  * We sanity check this now.
>  */
> #ifdef USE_ASSERT_CHECKING
> IndexInfo  *indexInfo = BuildIndexInfo(idxrel);
> Assert(!IndexOnlyOnExpression(indexInfo));
> #endif
>
>
Makes sense, no reason to declare above


> ~~~
>
> 3. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex
>
> + /* Start an index scan. */
> + scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
>  retry:
>   found = false;
>
> It might be better to have a blank line before that ‘retry’ label,
> like in the original code.
>

agreed, fixed


>
> ==
>
> 4. src/backend/replication/logical/relation.c
>
> +/* see LogicalRepPartMapEntry for details in logicalrelation.h */
>  static HTAB *LogicalRepPartMap = NULL;
>
> Personally, I'd word that something like:
> "/* For LogicalRepPartMap details see LogicalRepPartMapEntry in
> logicalrelation.h */"
>
> but YMMV.
>

I also don't have any strong opinions on that, updated to your suggestion.


>
> ~~~
>
> 5. src/backend/replication/logical/relation.c -
> GenerateDummySelectPlannerInfoForRelation
>
> +/*
> + * This is not a generic function, helper function for
> + * GetCheapestReplicaIdentityFullPath. The function creates
> + * a dummy PlannerInfo for the given relationId as if the
> + * relation is queried with SELECT command.
> + */
> +static PlannerInfo *
> +GenerateDummySelectPlannerInfoForRelation(Oid relationId)
>
> (mentioned this one in my previous post)
>
> "This is not a generic function, helper function" -> "This is not a
> generic function. It is a helper function"
>

Yes, applied.


>
> ~~~
>
> 6. src/backend/replication/logical/relation.c -
> GetCheapestReplicaIdentityFullPath
>
> +/*
> + * Generate all the possible paths for the given subscriber relation,
> + * for the cases that the source relation is replicated via REPLICA
> + * IDENTITY FULL. The function returns the cheapest Path among the
> + * eligible paths, see SuitablePathsForRepIdentFull().
> + *
> + * The function guarantees to return a path, because it adds sequential
> + * scan path if needed.
> + *
> + * The function assumes that all the columns will be provided during
> + * the execution phase, given that REPLICA IDENTITY FULL guarantees
> + * that.
> + */
> +static Path *
> +GetCheapestReplicaIdentityFullPath(Relation localrel)
>
>
> "for the cases that..." -> "for cases where..."
>
>
sounds good


> ~~~
>
> 7.
>
> + /*
> + * Currently it is not possible for planner to pick a partial index or
> + * indexes only on expressions. We still want to be explicit and eliminate
> + * such paths proactively.
>
> "for planner..." -> "for the planner..."
>

fixed


>
> ==
>
> 8. .../subscription/t/032_subscribe_use_index.pl - general
>
> 8a.
> (remove the 'we')
> "# we wait until..." -> "# wait until..." X many occurrences
>
> ~
>
> 8b.
> (remove the 'we')
> "# we show that..." -> “# show that..." X many occurrences
>

Ok, removed all "we"s in the test

>
> ~~~
>
> 9.
>
> There is inconsistent wording for some of your test case start/end comments
>
> 9a.
> e.g.
> start: SUBSCRIPTION USES INDEX UPDATEs MULTIPLE ROWS
> end: SUBSCRIPTION USES INDEX MODIFIES MULTIPLE ROWS
>
> ~
>
> 9b.
> e.g.
> start: SUBSCRIPTION USES INDEX WITH MULTIPLE COLUMNS
> end: SUBSCRIPTION USES INDEX M

Re: pg_create_logical_replication_slot argument incongruency

2022-09-20 Thread Michael Paquier

On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote:
> OK, patch only for the docs attached.

Thanks, applied.
--
Michael


signature.asc
Description: PGP signature

Re: cataloguing NOT NULL constraints

On 2022-Sep-19, Robert Haas wrote:

> On Wed, Aug 17, 2022 at 2:12 PM Alvaro Herrera  
> wrote:

> > 55489 16devel 1776237=# \d tab
> > Tabla «public.tab»
> >  Columna │  Tipo   │ Ordenamiento │ Nulable  │ Por omisión
> > ─┼─┼──┼──┼─
> >  a   │ integer │  │ not null │
> > Restricciones CHECK:
> > "tab_a_not_null" CHECK (a IS NOT NULL)
> 
> In a table with many columns, most of which are NOT NULL, this is
> going to produce a ton of clutter. I don't like that.
> 
> I'm not sure what a good alternative would be, though.

Perhaps that can be solved by displaying the constraint name in the
table:

 55489 16devel 1776237=# \d tab
 Tabla «public.tab»
  Columna │  Tipo   │ Ordenamiento │ Nulable│ Por omisión
 ─┼─┼──┼┼─
  a   │ integer │  │ tab_a_not_null │ 

(Perhaps we can change the column title also, not sure.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)

Re: cataloguing NOT NULL constraints

On 2022-Sep-19, Isaac Morland wrote:

> I thought I saw some discussion about the SQL standard saying that there is
> a difference between putting NOT NULL in a column definition, and CHECK
> (column_name NOT NULL). So if we're going to take this seriously,

Was it Peter E.'s reply to this thread?

https://postgr.es/m/bac841ed-b86d-e3c2-030d-02a3db067...@enterprisedb.com

because if it wasn't there, then I would like to know what you source
is.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)

Re: cataloguing NOT NULL constraints

On 2022-Sep-19, Matthias van de Meent wrote:

> I'm not sure on the 'good' part of this alternative, but we could go
> with a single row-based IS NOT NULL to reduce such clutter, utilizing
> the `ROW() IS NOT NULL` requirement of a row only matching IS NOT NULL
> when all attributes are also IS NOT NULL:
> 
> Check constraints:
> "tab_notnull_check" CHECK (ROW(a, b, c, d, e) IS NOT NULL)

There's no way to mark this NOT VALID individually or validate it
afterwards, though.

> But the performance of repeated row-casting would probably not be as
> good as our current NULL checks

The NULL checks would still be mostly done by the attnotnull checks
internally, so there shouldn't be too much of a difference.

.. though I'm now wondering if there's additional overhead from checking
the constraint twice on each row: first the attnotnull bit, then the
CHECK itself.  Hmm.  That's probably quite bad.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Summary function for pg_buffercache

Hi hackers,

> Here is a corrected patch v7. To me it seems to be in pretty good
> shape, unless cfbot and/or other hackers will report any issues.

There was a missing empty line in pg_buffercache.out which made the
tests fail. Here is a corrected v8 patch.

-- 
Best regards,
Aleksander Alekseev


v8-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: Summary function for pg_buffercache

Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57
tarihinde şunu yazdı:

> There was a missing empty line in pg_buffercache.out which made the
> tests fail. Here is a corrected v8 patch.
>

I was just sending a corrected patch without the missing line.

Thanks a lot for all these reviews and the corrected patch.

Best,
Melih

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-09-20 Thread Drouvot, Bertrand


Hi,

On 9/20/22 6:30 AM, Michael Paquier wrote:

On Tue, Sep 20, 2022 at 12:09:33AM -0400, Tom Lane wrote:

You have to assume that somebody (a) has a role or DB name starting
with slash, (b) has an explicit reference to that name in their
pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't
notice that things are misbehaving until after some hacker manages
to break into their installation on the strength of the misbehaving
entry.  OK, I'll grant that the probability of (c) is depressingly
close to unity; but each of the other steps seems quite low probability.
All four of them happening in one installation is something I doubt
will happen.


It is the kind of things that could blow up as a CVE and some bad PR
for the project, so I cannot get excited about enforcing this new rule
in an authentication file (aka before a role is authenticated) while
we are talking about 3~4 code paths (?) that would need an extra check
to make sure that no instances have such object names.


I also have the feeling that having (a), (b) and (d) is low probability.

That said, If the user "/john" already exists and has a hba entry then 
this entry will still match with the patch. Problem is that all the 
users that contain "john" would also now match.


But things get worst if say /a is an existing user and hba entry as the 
entry would match any users that contains "a" with the patch.


I assume (maybe i should not) that if objects starting with / already 
exist there is very good reason(s) behind. Then I don't think that 
preventing their creation in the DDL would help (quite the contrary for 
the ones that really need them).


It looks to me that adding a GUC (off by default) to enable/disable the 
regexp usage in the hba could be a fair compromise. It won't block any 
creation starting with a / and won't open more doors (if such objects 
exist) by default.


Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_create_logical_replication_slot argument incongruency

2022-09-20 Thread Florin Irion

Thank you!

Il mar 20 set 2022, 12:29 Michael Paquier  ha scritto:

> On Tue, Sep 20, 2022 at 08:41:56AM +0200, Florin Irion wrote:
> > OK, patch only for the docs attached.
>
> Thanks, applied.
> --
> Michael
>

Re: Proposal to use JSON for Postgres Parser format

2022-09-20 Thread Matthias van de Meent

On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov  wrote:
> On Tue, Sep 20, 2022 at 7:48 AM Tom Lane  wrote:
> > Peter Geoghegan  writes:
> > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane  wrote:
> > >> Our existing format is certainly not great on those metrics, but
> > >> I do not see how "let's use JSON!" is a route to improvement.
> >
> > > The existing format was designed with developer convenience as a goal,
> > > though -- despite my complaints, and in spite of your objections.
> >
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
> >
> > > If it didn't have to be easy (or even practical) for developers to
> > > directly work with the output format, then presumably the format used
> > > internally could be replaced with something lower level and faster. So
> > > it seems like the two goals (developer ergonomics and faster
> > > interchange format for users) might actually be complementary.
> >
> > I think the principal mistake in what we have now is that the storage
> > format is identical to the "developer friendly" text format (plus or
> > minus some whitespace).  First we need to separate those.  We could
> > have more than one equivalent text format perhaps, and I don't have
> > any strong objection to basing the text format (or one of them) on
> > JSON.
>
> +1 for considering storage format and text format separately.
>
> Let's consider what our criteria could be for the storage format.
>
> 1) Storage effectiveness (shorter is better) and
> serialization/deserialization effectiveness (faster is better). On
> this criterion, the custom binary format looks perfect.
> 2) Robustness in the case of corruption. It seems much easier to
> detect the data corruption and possibly make some partial manual
> recovery for textual format.
> 3) Standartness. It's better to use something known worldwide or at
> least used in other parts of PostgreSQL than something completely
> custom. From this perspective, JSON/JSONB is better than custom
> things.

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

In that same thread, I also suggested that we could try to not emit a
Node's fields if they contain their default values while serializing;
such as the common `:location -1` or `:mynodefield <>`. Those fields
still take up space in the format, while conveying no interesting
information (the absense of that field in the struct definition would
convey the same). It would be useful if this new serialized format
would allow us to do similar tricks cheaply.

As for JSON vs JSONB for storage:
I'm fairly certain that JSONB is less compact than JSON (without
taking compression into the picture) due to the 4-byte guaranteed
overhead for each jsonb element; while for JSON that is only 2 bytes
for each (up to 3 when you consider separators, plus potential extra
overhead for escaped values that are unlikely to appear our catalogs).
Some numbers can be stored more efficiently in JSONB, but only large
numbers and small fractions that we're unlikely to hit in system
views: a back-of-the-envelope calculation puts the cutoff point of
efficient storage between strings-of-decimals and Numeric at >10^12, <
-10^11, or very precise fractional values.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com

Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions

On Mon, Sep 19, 2022 at 8:19 PM Ashutosh Sharma  wrote:
>
> Hi All,
>
> Currently, we have pg_current_wal_insert_lsn and pg_walfile_name sql
> functions which gives us information about the next wal insert
> location and the WAL file that the next wal insert location belongs
> to. Can we have a binary version of these sql functions?

+1 for the idea in general.

As said, pg_waldump seems to be the right candidate. I think we want
the lsn of the last WAL record and its info and the WAL file name
given an input data directory or just the pg_wal directory or any
directory where WAL files are located. For instance, one can use this
on an archive location containing archived WAL files or on a node
where pg_receivewal is receiving WAL files. Am I missing any other
use-cases?

pg_waldump currently can't understand compressed and partial files. I
think that we need to fix this as well.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)

On Tue, Sep 20, 2022 at 7:20 AM Michael Paquier  wrote:
>
> The main regression test suite should not include direct calls to
> pg_backup_start() or pg_backup_stop() as these depend on wal_level,
> and we spend a certain amount of resources in keeping the tests a
> maximum portable across such configurations, wal_level = minimal being
> one of them.  One portable example is in 001_stream_rep.pl.

Understood.

> > That's a good idea. I'm marking a flag if the label name overflows (>
> > MAXPGPATH), later using it in do_pg_backup_start() to error out. We
> > could've thrown error directly, but that changes the behaviour - right
> > now, first "
> > wal_level must be set to \"replica\" or \"logical\" at server start."
> > gets emitted and then label name overflow error - I don't want to do
> > that.
>
> -   if (strlen(backupidstr) > MAXPGPATH)
> +   if (state->name_overflowed == true)
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>  errmsg("backup label too long (max %d bytes)",
> It does not strike me as a huge issue to force a truncation of such
> backup label names.  1024 is large enough for basically all users,
> in my opinion.  Or you could just truncate in the internal logic, but
> still complain at MAXPGPATH - 1 as the last byte would be for the zero
> termination.  In short, there is no need to complicate things with
> name_overflowed.

We currently allow MAXPGPATH bytes of label name excluding null
termination. I don't want to change this behaviour. In the attached v9
patch, I'm truncating the larger names to  MAXPGPATH + 1 bytes in
backup state (one extra byte for representing that the name has
overflown, and another extra byte for null termination).

> +   StringInfo  backup_label;   /* backup_label file contents */
> +   StringInfo  tablespace_map; /* tablespace_map file contents */
> +   StringInfo  history_file;   /* history file contents */
> IMV, repeating a point I already made once upthread, BackupState
> should hold none of these.  Let's just generate the contents of these
> files in the contexts where they are needed, making BackupState
> something to rely on to build them in the code paths where they are
> necessary.  This is going to make the reasoning around the memory
> contexts where each one of them is stored much easier and reduce the
> changes of bugs in the long-term.

I've separated out these variables from the backup state, please see
the attached v9 patch.

> > I've also taken help of the error callback mechanism to clean up the
> > allocated memory in case of a failure. For do_pg_abort_backup() cases,
> > I think we don't need to clean the memory because that gets called on
> > proc exit (before_shmem_exit()).
>
> Memory could still bloat while the process running the SQL functions
> is running depending on the error code path, anyway.

I didn't get your point. Can you please elaborate it? I think adding
error callbacks at the right places would free up the memory for us.
Please note that we already are causing memory leaks on HEAD today.

I addressed the above review comments. I also changed a wrong comment
[1] that lies before pg_backup_start() even after the removal of
exclusive backup.

I'm attaching v9 patch set herewith, 0001 - refactors the backup code
with backup state, 0002 - adds error callbacks to clean up the memory
allocated for backup variables. Please review them further.

[1]
 * Essentially what this does is to create a backup label file in $PGDATA,
 * where it will be archived as part of the backup dump.  The label file
 * contains the user-supplied label string (typically this would be used
 * to tell where the backup dump will be stored) and the starting time and
 * starting WAL location for the dump.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From ab1e86ac7fb75a2d2219c7681ead40faf8c01446 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 20 Sep 2022 10:04:29 +
Subject: [PATCH v9] Refactor backup related code

Refactor backup related code, advantages of doing so are following:

1) backup state is more structured now - all in a single structure,
callers can create backup_label contents whenever required, either
during the pg_backup_start or the pg_backup_stop or in between.
2) no parsing of backup_label file lines now, no error checking
for invalid parsing.
3) backup_label and history file contents have most of the things
in common, they can now be created within a single function.
4) makes backup related code extensible and readable.

This introduces new source files xlogbackup.c/.h for backup related
code and adds the new code in there. The xlog.c file has already grown
to ~9000 LOC (as of this time). Eventually, we would want to move
all the backup related code from xlog.c, xlogfuncs.c, elsewhere to
here.

Author: Bharath Rupireddy
Reviewed-by: Michael Paquier
Reviewed-by: Fujii Masao
Discussion: ht

Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Amit Kapila

On Tue, Sep 20, 2022 at 2:57 PM Alvaro Herrera  wrote:
>
> On 2022-Sep-20, Amit Kapila wrote:
>
> > On Mon, Sep 19, 2022 at 8:46 PM Alvaro Herrera  
> > wrote:
>
> > > This seems a pretty arbitrary restriction.  It feels like you're adding
> > > this restriction precisely so that you don't have to write the code to
> > > reject the ALTER .. SET SCHEMA if an incompatible configuration is
> > > detected.  But we already have such checks in several cases, so I don't
> > > see why this one does not seem a good idea.
> > >
> > I agree that we have such checks at other places as well and one
> > somewhat similar is in ATPrepChangePersistence().
> >
> > ATPrepChangePersistence()
> > {
> > ...
> > ...
> > /*
> >  * Check that the table is not part of any publication when changing to
> >  * UNLOGGED, as UNLOGGED tables can't be published.
> >  */
>
> Right, I think this is a sensible approach.
>
> > However, another angle to look at it is that we try to avoid adding
> > restrictions in other DDL commands for defined publications.
>
> Well, it makes sense to avoid restrictions wherever possible.  But here,
> the consequence is that you end up with a restriction in the publication
> definition that is not very sensible.  Imagine if you said "you can't
> add schema S because it contains an unlogged table".  It's absurd.
>
> Maybe this can be relaxed in a future release, but it's quite odd.
>

Yeah, we can relax it in a future release based on some field
experience, or maybe we can keep the current restriction of not
allowing to add a table when the schema of the table is part of the
same publication and try to relax that in a future release based on
field experience.

> > The intention was to be in sync with FOR ALL TABLES.
>
> I guess we can change both (FOR ALL TABLES and IN SCHEMA) later.
>

That sounds reasonable.

-- 
With Regards,
Amit Kapila.

Re: Support pg_attribute_aligned and noreturn in MSVC

On Mon, Sep 19, 2022 at 11:21 PM Michael Paquier  wrote:
>
> On Mon, Sep 19, 2022 at 08:51:37PM -0400, James Coleman wrote:
> > Yes, fixed.
>
> The CF bot is failing compilation on Windows:
> http://commitfest.cputube.org/james-coleman.html
> https://api.cirrus-ci.com/v1/task/5376566577332224/logs/build.log
>
> There is something going on with noreturn() after applying it to
> elog.h:
> 01:11:00.468] c:\cirrus\src\include\utils\elog.h(410,45): error C2085:
> 'ThrowErrorData': not in formal parameter list (compiling source file
> src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj]
> [01:11:00.468] c:\cirrus\src\include\mb\pg_wchar.h(701,80): error
> C2085: 'pgwin32_message_to_UTF16': not in formal parameter list
> (compiling source file src/common/encnames.c)
> [c:\cirrus\libpgcommon.vcxproj]
> [01:11:00.468] c:\cirrus\src\include\utils\elog.h(411,54): error
> C2085: 'pg_re_throw': not in formal parameter list (compiling source
> file src/common/hashfn.c) [c:\cirrus\libpgcommon.vcxproj]
>
> align() seems to look fine, at least.  I'd be tempted to apply the
> align part first, as that's independently useful for itemptr.h.


I don't have access to a Windows machine for testing, but re-reading
the documentation it looks like the issue is that our noreturn macro
is used after the definition while the MSVC equivalent is used before.
I've removed that for now (and commented about it); it's not as
valuable anyway since it's mostly an indicator for code analysis
(human and machine).

James Coleman


v3-0001-Support-pg_attribute_aligned-in-MSVC.patch
Description: Binary data

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

On Tue, Sep 20, 2022 at 9:02 AM Nathan Bossart  wrote:
>
> On Mon, Sep 19, 2022 at 03:16:42PM -0700, Nathan Bossart wrote:
> > It seems like you want the opposite of pg_walfile_name_offset().  Perhaps
> > we could add a function like pg_walfile_offset_lsn() that accepts a WAL
> > file name and byte offset and returns the LSN.
>
> Like so...

Yeah, something like this will be handy for sure, but I'm not sure if
we want this to be in core. Let's hear from others.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Summary function for pg_buffercache

Hi,

Seems like cfbot tests are passing now:
https://cirrus-ci.com/build/4727923671302144

Best,
Melih

Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu
yazdı:

> Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57
> tarihinde şunu yazdı:
>
>> There was a missing empty line in pg_buffercache.out which made the
>> tests fail. Here is a corrected v8 patch.
>>
>
> I was just sending a corrected patch without the missing line.
>
> Thanks a lot for all these reviews and the corrected patch.
>
> Best,
> Melih
>

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

On Tue, Sep 20, 2022 at 12:57 PM Alvaro Herrera  wrote:
>
> On 2022-Sep-19, Bharath Rupireddy wrote:
>
> > We have a bunch of messages [1] that have an offset, but not LSN in
> > the error message. Firstly, is there an easiest way to figure out LSN
> > from offset reported in the error messages? If not, is adding LSN to
> > these messages along with offset a good idea? Of course, we can't just
> > convert offset to LSN using XLogSegNoOffsetToRecPtr() and report, but
> > something meaningful like reporting the LSN of the page that we are
> > reading-in or writing-out etc.
>
> Maybe add errcontext() somewhere that reports the LSN would be
> appropriate.  For example, the page_read() callbacks have the LSN
> readily available, so the ones in backend could install the errcontext
> callback; or perhaps ReadPageInternal can do it #ifndef FRONTEND.  Not
> sure what is best of those options, but either of those sounds better
> than sticking the LSN in a lower-level routine that doesn't necessarily
> have the info already.

All of the error messages [1] have the LSN from which offset was
calculated, I think we can just append that to the error messages
(something like " offset %u, LSN %X/%X: %m") and not complicate
it. Thoughts?

[1]
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not read from WAL segment %s, offset %u: %m",
errmsg("could not write to log file %s "
   "at offset %u, length %zu: %m",
errmsg("unexpected timeline ID %u in WAL segment %s, offset %u",
errmsg("could not read from WAL segment %s, offset %u: read %d of %zu",
pg_log_error("received write-ahead log record for offset %u with no file open",
"invalid magic number %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"invalid info bits %04X in WAL segment %s, offset %u",
"unexpected pageaddr %X/%X in WAL segment %s, offset %u",
"out-of-sequence timeline ID %u (after %u) in WAL segment %s, offset %u",

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

On Mon, Sep 19, 2022 at 4:42 PM Dmitry Koval  wrote:
> Thanks for comments and advice!
> I thought about this problem and discussed about it with colleagues.
> Unfortunately, I don't know of a good general solution.

Yeah, me neither.

> But for specific situation like this (certain partition is not changing)
> we can add CONCURRENTLY modifier.
> Our DDL query can be like
>
> ALTER TABLE...SPLIT PARTITION [CONCURRENTLY];
>
> With CONCURRENTLY modifier we can lock partitioned table in
> ShareUpdateExclusiveLock mode and split partition - in
> AccessExclusiveLock mode. So we don't lock partitioned table in
> AccessExclusiveLock mode and can modify other partitions during SPLIT
> operation (except split partition).
> If smb try to modify split partition, he will receive error "relation
> does not exist" at end of operation (because split partition will be drop).

I think that a built-in DDL command can't really assume that the user
won't modify anything. You'd have to take a ShareLock.

But you might be able to have a CONCURRENTLY variant of the command
that does the same kind of multi-transaction thing as, e.g., CREATE
INDEX CONCURRENTLY. You would probably have to be quite careful about
race conditions (e.g. you commit the first transaction and before you
start the second one, someone drops or detaches the partition you were
planning to merge or split). Might take some thought, but feels
possibly doable. I've never been excited enough about this kind of
thing to want to put a lot of energy into engineering it, because
doing it "manually" feels so much nicer to me, and doubly so given
that we now have ATTACH CONCURRENTLY and DETACH CONCURRENTLY, but it
does seem like a thing some people would probably use and value.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Summary function for pg_buffercache

Hi,

Correct me if I’m wrong.

The doc says we don’t take lock during pg_buffercache_summary, but I see locks 
in the v8 patch, Isn’t it?

```
Similar to pg_buffercache_pages function
 pg_buffercache_summary doesn't take buffer manager
 locks, thus the result is not consistent across all buffers. This is
 intentional. The purpose of this function is to provide a general idea about
 the state of shared buffers as fast as possible. Additionally,
 pg_buffercache_summary allocates much less memory.

```

Regards,
Zhang Mingli
On Sep 20, 2022, 20:10 +0800, Melih Mutlu , wrote:
> Hi,
>
> Seems like cfbot tests are passing now:
> https://cirrus-ci.com/build/4727923671302144
>
> Best,
> Melih
>
> > Melih Mutlu , 20 Eyl 2022 Sal, 14:00 tarihinde şunu 
> > yazdı:
> > > Aleksander Alekseev , 20 Eyl 2022 Sal, 13:57 
> > > tarihinde şunu yazdı:
> > > > > There was a missing empty line in pg_buffercache.out which made the
> > > > > tests fail. Here is a corrected v8 patch.
> > > >
> > > > I was just sending a corrected patch without the missing line.
> > > >
> > > > Thanks a lot for all these reviews and the corrected patch.
> > > >
> > > > Best,
> > > > Melih

Re: Summary function for pg_buffercache

Hi Zhang,

> The doc says we don’t take lock during pg_buffercache_summary, but I see 
> locks in the v8 patch, Isn’t it?
>
> ```
> Similar to pg_buffercache_pages function
>  pg_buffercache_summary doesn't take buffer manager
>  locks [...]
> ```

Correct, the procedure doesn't take the locks of the buffer manager.
It does take the locks of every individual buffer.

I agree that the text is somewhat confusing, but it is consistent with
the current description of pg_buffercache [1]. I think this is a
problem worth addressing but it also seems to be out of scope of the
proposed patch.

[1]: https://www.postgresql.org/docs/current/pgbuffercache.html

-- 
Best regards,
Aleksander Alekseev

Re: Summary function for pg_buffercache

Hi,

Regards,
Zhang Mingli
On Sep 20, 2022, 20:43 +0800, Aleksander Alekseev , 
wrote:
>
> Correct, the procedure doesn't take the locks of the buffer manager.
> It does take the locks of every individual buffer.
Ah, now I get it, thanks.

Re: Summary function for pg_buffercache

Hi Zhang,

Those are two different locks.
The locks that are taken in the patch are for buffer headers. This locks
only the current buffer and makes that particular buffer's info consistent
within itself.

However, the lock mentioned in the doc is for buffer manager which would
prevent changes on any buffer if it's held.
pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer
manager lock. Therefore, consistency across all buffers is not guaranteed.

For pg_buffercache_pages, self-consistent buffer information is useful
since it shows each buffer separately.

For pg_buffercache_summary, even self-consistency may not matter much since
results are aggregated and we can't see individual buffer information.
Consistency across all buffers is also not a concern since its purpose is
to give an overall idea about the state of buffers.

I see that these two different locks in the same context can be confusing.
I hope it is a bit more clear now.

Best,
Melih

>

Re: Summary function for pg_buffercache

Hi,
On Sep 20, 2022, 20:49 +0800, Melih Mutlu , wrote:
> Hi Zhang,
>
> Those are two different locks.
> The locks that are taken in the patch are for buffer headers. This locks only 
> the current buffer and makes that particular buffer's info consistent within 
> itself.
>
> However, the lock mentioned in the doc is for buffer manager which would 
> prevent changes on any buffer if it's held.
> pg_buffercache_summary (and pg_buffercache_pages) does not hold buffer 
> manager lock. Therefore, consistency across all buffers is not guaranteed.
>
> For pg_buffercache_pages, self-consistent buffer information is useful since 
> it shows each buffer separately.
>
> For pg_buffercache_summary, even self-consistency may not matter much since 
> results are aggregated and we can't see individual buffer information.
> Consistency across all buffers is also not a concern since its purpose is to 
> give an overall idea about the state of buffers.
>
> I see that these two different locks in the same context can be confusing. I 
> hope it is a bit more clear now.
>
> Best,
> Melih
Thanks for your explanation, LGTM.

Re: HOT chain validation in verify_heapam()

On Tue, Sep 20, 2022 at 5:00 AM Himanshu Upadhyaya
 wrote:
> Please find it attached.

This patch still has no test cases. Just as we have test cases for the
existing corruption checks, we should have test cases for these new
corruption checks, showing cases where they actually fire.

I think I would be inclined to set lp_valid[x] = true in both the
redirected and non-redirected case, and then have the very first thing
that the second loop does be if (nextoffnum == 0 ||
!lp_valid[ctx.offnum]) continue. I think that would be more clear
about the intent to ignore line pointers that failed validation. Also,
if you did it that way, then you could catch the case of a redirected
line pointer pointing to another redirected line pointer, which is a
corruption condition that the current code does not appear to check.

+/*
+ * Validation via the predecessor array. 1) If the predecessor's
+ * xmin is aborted or in progress, the current tuples xmin should
+ * be aborted or in progress respectively. Also both xmin's must
+ * be equal. 2) If the predecessor's xmin is not frozen, then
+ * current tuple's shouldn't be either. 3) If the predecessor's
+ * xmin is equal to the current tuple's xmin, the current tuple's
+ * cmin should be greater than the predecessor's cmin. 4) If the
+ * current tuple is not HOT then its predecessor's tuple must not
+ * be HEAP_HOT_UPDATED. 5) If the current tuple is HOT then its
+ * predecessor's tuple must be HEAP_HOT_UPDATED.
+ */

This comment needs to be split up into pieces and the pieces need to
be moved closer to the tests to which they correspond.

+  psprintf("unfrozen tuple was
updated to produce a tuple at offset %u which is not frozen",

Shouldn't this say "which is frozen"?

+ * Not a corruption if current tuple is updated/deleted by a
+ * different transaction, means t_cid will point to cmax (that is
+ * command id of deleting transaction) and cid of predecessor not
+ * necessarily will be smaller than cid of current tuple. t_cid

I think that the next person who reads this code is likely to
understand that the CIDs of different transactions are numerically
unrelated. What's less obvious is that if the XID is the same, the
newer update must have a higher CID.

+ * can hold combo command id but we are not worrying here since
+ * combo command id of the next updated tuple (if present) must be
+ * greater than combo command id of the current tuple. So here we
+ * are not checking HEAP_COMBOCID flag and simply doing t_cid
+ * comparison.

I disapprove of ignoring the HEAP_COMBOCID flag. Emitting a message
claiming that the CID has a certain value when that's actually a combo
CID is misleading, so at least a different message wording is needed
in such cases. But it's also not clear to me that the newer update has
to have a higher combo CID, because combo CIDs can be reused. If you
have multiple cursors open in the same transaction, the updates can be
interleaved, and it seems to me that it might be possible for an older
CID to have created a certain combo CID after a newer CID, and then
both cursors could update the same page in succession and end up with
combo CIDs out of numerical order. Unless somebody can make a
convincing argument that this isn't possible, I think we should just
skip this check for cases where either tuple has a combo CID.

+if (TransactionIdEquals(pred_xmin, curr_xmin) &&
+(TransactionIdEquals(curr_xmin, curr_xmax) ||
+ !TransactionIdIsValid(curr_xmax)) && pred_cmin >= curr_cmin)

I don't understand the reason for the middle part of this condition --
TransactionIdEquals(curr_xmin, curr_xmax) ||
!TransactionIdIsValid(curr_xmax). I suppose the comment is meant to
explain this, but I still don't get it. If a tuple with XMIN 12345
CMIN 2 is updated to produce a tuple with XMIN 12345 CMIN 1, that's
corruption, regardless of what the XMAX of the second tuple may happen
to be.

+if (HeapTupleHeaderIsHeapOnly(curr_htup) &&
+!HeapTupleHeaderIsHotUpdated(pred_htup))

+if (!HeapTupleHeaderIsHeapOnly(curr_htup) &&
+HeapTupleHeaderIsHotUpdated(pred_htup))

I think it would be slightly clearer to write these tests the other
way around i.e. check the previous tuple's state first.

+if (!TransactionIdIsValid(curr_xmax) &&
HeapTupleHeaderIsHotUpdated(tuphdr))
+{
+report_corruption(ctx,
+  psprintf("tuple has been updated, but xmax is 0"));
+result = false;
+}

I guess this message needs to say "tuple has been HOT updated, but
xmax is 0" or something like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: why can't a table be part of the same publication as its schema

On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz  wrote:
> For #1 (allowing calls that have schema/table overlap...), there appears
> to be both a patch that allows this (reversing[8]), and a suggestion for
> dealing with a corner-case that is reasonable, i.e. disallowing adding
> schemas to a publication when specifying column-lists. Do we think we
> can have consensus on this prior to the RC1 freeze?

I am not sure whether we can or should rush a fix in that fast, but I
agree with this direction.

> For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on
> the original thread[1][3][4][5][7]. I thought Tom's proposal on the
> syntax[3] was reasonable as it "future proofs" for when we allow other
> schema-scoped objects to be published and give control over which ones
> can be published.

All right, well, I still don't like it and think it's confusing, but
perhaps I'm in the minority.

> I don't think we should change this behavior that's already in logical
> replication. While I understand the reasons why "GRANT ... ALL TABLES IN
> SCHEMA" has a different behavior (i.e. it's not applied to future
> objects) and do not advocate to change it, I have personally been
> affected where I thought a permission would be applied to all future
> objects, only to discover otherwise. I believe it's more intuitive to
> think that "ALL" applies to "everything, always."

Nah, there's room for multiple behaviors here. It's reasonable to want
to add all the tables currently in the schema to a publication (or
grant permissions on them) and it's reasonable to want to include all
current and future tables in the schema in a publication (or grant
permissions on them) too. The reason I don't like the ALL TABLES IN
SCHEMA syntax is that it sounds like the former, but actually is the
latter. Based on your link to the email from Tom, I understand now the
reason why it's like that, but it's still counterintuitive to me.

> For #3 (more superuser-only), in general I do agree that we shouldn't be
> adding more of these. However, we have in this release, and not just to
> this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I
> think it's easier for us to "relax" privileges (e.g. w/predefined roles)
> than to make something "superuser-only" in the future, so I don't see
> this being a blocker for v15. The feature will continue to work for
> users even if we remove "superuser-only" in the future.

Yeah, this is clearly not a release blocker, I think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: On login trigger: take three

2022-09-20 Thread Sergey Shinderuk


On 02.09.2022 18:36, Daniel Gustafsson wrote:

This had bitrotted a fair bit, attached is a rebase along with (mostly)
documentation fixes.  0001 adds a generic GUC for ignoring event triggers and
0002 adds the login event with event trigger support, and hooks it up to the
GUC such that broken triggers wont require single-user mode.  Moving the CF
entry back to Needs Review.



Hello!

There is a race around setting and clearing of dathasloginevt.

Steps to reproduce:

1. Create a trigger:

  CREATE FUNCTION on_login_proc() RETURNS event_trigger AS $$
  BEGIN
RAISE NOTICE 'You are welcome!';
  END;
  $$ LANGUAGE plpgsql;

  CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE 
on_login_proc();


2. Then drop it, but don't start new sessions:

  DROP EVENT TRIGGER on_login_trigger;

3. Create another trigger, but don't commit yet:

  BEGIN;
  CREATE EVENT TRIGGER on_login_trigger ON login EXECUTE PROCEDURE 
on_login_proc();


4. Start a new session. This clears dathasloginevt.

5. Commit the transaction:

  COMMIT;

Now we have a trigger, but it doesn't fire, because dathasloginevt=false.


If two sessions create triggers concurrently, one of them will fail.

Steps:

1. In the first session, start a transaction and create a trigger:

  BEGIN;
  CREATE EVENT TRIGGER on_login_trigger1 ON login EXECUTE PROCEDURE 
on_login_proc();


2. In the second session, create another trigger (this query blocks):

  CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE PROCEDURE 
on_login_proc();


3. Commit in the first session:

  COMMIT;

The second session fails:

  postgres=# CREATE EVENT TRIGGER on_login_trigger2 ON login EXECUTE 
PROCEDURE on_login_proc();

  ERROR:  tuple concurrently updated



What else bothers me is that login triggers execute in an environment 
under user control and one has to be very careful. The example trigger 
from the documentation


+DECLARE

+  hour integer = EXTRACT('hour' FROM current_time);

+  rec boolean;

+BEGIN

+-- 1. Forbid logging in between 2AM and 4AM.

+IF hour BETWEEN 2 AND 4 THEN

+  RAISE EXCEPTION 'Login forbidden';

+END IF;


can be bypassed with PGOPTIONS='-c timezone=...'. Probably this is 
nothing new and concerns any SECURITY DEFINER function, but still...



Best regards,

--
Sergey Shinderukhttps://postgrespro.com/

Re: why can't a table be part of the same publication as its schema

On 2022-Sep-13, Kyotaro Horiguchi wrote:

> At Mon, 12 Sep 2022 04:26:48 +, "houzj.f...@fujitsu.com" 
>  wrote in 

> > I feel we'd better compare the syntax with the existing publication command:
> > FOR ALL TABLES. If you create a publication FOR ALL TABLES, it means 
> > publishing
> > all the tables in the database *including* tables created in the future. I
> > think both the syntax and meaning of ALL TABLES IN SCHEMA are consistent 
> > with
> > the existing FOR ALL TABLES.
> 
> IMHO, I feel closer to Robert. "ALL TABLES IN SCHEMA" sounds like the
> concrete tables at the time of invocation.  While I agree that it is
> not directly comparable to GRANT, 

What if we remove the ALL keyword from there?  That would leave us with
"FOR TABLES IN SCHEMA", which seems to better convey that it doesn't
restrict to current tables in there.

> but if I see "ALTER PUBLICATION p1 ADD SCHEMA s1", I automatically
> translate that into "all tables in the schema s1 at the time of using
> this publication".

... but that translation is wrong if replication supports other kinds of
objects, as it inevitably will in the near future.  Clearly the fact
that we spell out TABLES there is important.  When we add support for
sequences, we could have combinations

ADD [ALL] TABLES IN SCHEMA s
ADD [ALL] SEQUENCES IN SCHEMA s
ADD [ALL] TABLES AND SEQUENCES IN SCHEMA s

and at that point, the unadorned ADD SCHEMA one will become ambiguous.

> At least, it would cause less confusion when it were "ALT PUB p1 DROP
> SCEMA s1" aginst "DROP ALL TABLES IN SCHEMA s1".

I'm not sure what you mean here.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)

Re: On login trigger: take three

2022-09-20 Thread Daniel Gustafsson

> On 20 Sep 2022, at 15:43, Sergey Shinderuk  wrote:
> 
> On 02.09.2022 18:36, Daniel Gustafsson wrote:
>> This had bitrotted a fair bit, attached is a rebase along with (mostly)
>> documentation fixes.  0001 adds a generic GUC for ignoring event triggers and
>> 0002 adds the login event with event trigger support, and hooks it up to the
>> GUC such that broken triggers wont require single-user mode.  Moving the CF
>> entry back to Needs Review.

> There is a race around setting and clearing of dathasloginevt.

Thanks for the report!  The whole dathasloginevt mechanism is IMO too clunky
and unelegant to go ahead with, I wouldn't be surprised if there are other bugs
lurking there.  Since the original authors seems to have retired from the patch
(I've only rebased it to try and help) I am inclined to mark it as returned
with feedback.

--
Daniel Gustafsson   https://vmware.com/

Re: cataloguing NOT NULL constraints

2022-09-20 Thread Isaac Morland

On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera 
wrote:

The NULL checks would still be mostly done by the attnotnull checks
> internally, so there shouldn't be too much of a difference.
>
> .. though I'm now wondering if there's additional overhead from checking
> the constraint twice on each row: first the attnotnull bit, then the
> CHECK itself.  Hmm.  That's probably quite bad.
>

Another reason to treat NOT NULL-implementing constraints differently.

My thinking is that pg_constraint entries for NOT NULL columns are mostly
an implementation detail. I've certainly never cared whether I had an
actual constraint corresponding to my NOT NULL columns. So I think marking
them as such, or a different contype, and excluding them from \d+ display,
probably makes sense. Just need to deal with the issue of trying to create
a constraint and having its name conflict with a NOT NULL constraint. Could
it work to reserve [field name]_notnull for NOT NULL-implementing
constraints? I'd be worried about what happens with field renames; renaming
the constraint automatically seems a bit weird, but maybe…

Re: making relfilenodes 56 bits

2022-09-20 Thread Amul Sul

On Fri, Sep 9, 2022 at 3:32 PM Dilip Kumar  wrote:
>
> On Thu, Sep 8, 2022 at 4:10 PM Dilip Kumar  wrote:
>
> > On a separate note, while reviewing the latest patch I see there is some 
> > risk of using the unflushed relfilenumber in GetNewRelFileNumber() 
> > function.  Basically, in the current code, the flushing logic is tightly 
> > coupled with the logging new relfilenumber logic and that might not work 
> > with all the values of the VAR_RELNUMBER_NEW_XLOG_THRESHOLD.  So the idea 
> > is we need to keep the flushing logic separate from the logging, I am 
> > working on the idea and I will post the patch soon.
>
> I have fixed the issue, so now we will track nextRelFileNumber,
> loggedRelFileNumber and flushedRelFileNumber.  So whenever
> nextRelFileNumber is just VAR_RELNUMBER_NEW_XLOG_THRESHOLD behind the
> loggedRelFileNumber we will log VAR_RELNUMBER_PER_XLOG more
> relfilenumbers.  And whenever nextRelFileNumber reaches the
> flushedRelFileNumber then we will do XlogFlush for WAL upto the last
> loggedRelFileNumber.  Ideally flushedRelFileNumber should always be
> VAR_RELNUMBER_PER_XLOG number behind the loggedRelFileNumber so we can
> avoid tracking the flushedRelFileNumber.  But I feel keeping track of
> the flushedRelFileNumber looks cleaner and easier to understand.  For
> more details refer to the code in GetNewRelFileNumber().
>

Here are a few minor suggestions I came across while reading this
patch, might be useful:

+#ifdef USE_ASSERT_CHECKING
+
+   {

Unnecessary space after USE_ASSERT_CHECKING.
--

+   return InvalidRelFileNumber;/* placate compiler */

I don't think we needed this after the error on the latest branches.
--

+   LWLockAcquire(RelFileNumberGenLock, LW_SHARED);
+   if (shutdown)
+   checkPoint.nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+   else
+   checkPoint.nextRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+
+   LWLockRelease(RelFileNumberGenLock);

This is done for the good reason, I think, it should have a comment
describing different checkPoint.nextRelFileNumber assignment
need and crash recovery perspective.
--

+#define SizeOfRelFileLocatorBackend \
+   (offsetof(RelFileLocatorBackend, backend) + sizeof(BackendId))

Can append empty parenthesis "()" to the macro name, to look like a
function call at use or change the macro name to uppercase?
--

 +   if (val < 0 || val > MAX_RELFILENUMBER)
..
 if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \

How about adding a macro for this condition as RelFileNumberIsValid()?
We can replace all the checks referring to MAX_RELFILENUMBER with this.

Regards,
Amul

Re: cataloguing NOT NULL constraints

On 2022-Sep-20, Isaac Morland wrote:

> On Tue, 20 Sept 2022 at 06:56, Alvaro Herrera 
> wrote:
> 
> > .. though I'm now wondering if there's additional overhead from checking
> > the constraint twice on each row: first the attnotnull bit, then the
> > CHECK itself.  Hmm.  That's probably quite bad.
> 
> Another reason to treat NOT NULL-implementing constraints differently.

Yeah.

> My thinking is that pg_constraint entries for NOT NULL columns are mostly
> an implementation detail. I've certainly never cared whether I had an
> actual constraint corresponding to my NOT NULL columns.

Naturally, all catalog entries are implementation details; a user never
really cares if an entry exists or not, only that the desired semantics
are provided.  In this case, we want the constraint row because it gives
us some additional features, such as the ability to mark NOT NULL
constraints NOT VALID and validating them later, which is a useful thing
to do in large production databases.  We have some hacks to provide part
of that functionality using straight CHECK constraints, but you cannot
cleanly get the `attnotnull` flag set for a column (which means it's
hard to add a primary key, for example).

It is also supposed to fix some inconsistencies such as disallowing to
remove a constraint on a table when it is implied from a constraint on
an ancestor table.  Right now we have ad-hoc protections for partitions,
but we don't do that for legacy inheritance.

That said, the patch I posted for this ~10 years ago used a separate
contype and was simpler than what I ended up with now, but amusingly
enough it was returned at the time with the argument that it would be
better to treat them as normal CHECK constraints; so I want to be very
sure that we're not just going around in circles.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: RFC: Logging plan of the running query

2022-09-20 Thread torikoshia


On 2022-09-19 17:47, Алена Рыбакина wrote:
Thanks for your review and comments!


Hi,

I'm sorry,if this message is duplicated previous this one, but I'm not
sure that the previous message is sent correctly. I sent it from email
address a.rybak...@postgrespro.ru and I couldn't send this one email
from those address.


I've successfully received your mail from both a.rybak...@postgrespro.ru 
and lena.riback...@yandex.ru.



I like idea to create patch for logging query plan. After reviewing
this code and notice some moments and I'd rather ask you some
questions.

Firstly, I suggest some editing in the comment of commit. I think, it 
is
turned out the more laconic and the same clear. I wrote it below since 
I

can't think of any other way to add it.



```
Currently, we have to wait for finishing of the query execution to 
check

its plan.
This is not so convenient in investigation long-running queries on
production
environments where we cannot use debuggers.



To improve this situation there is proposed the patch containing the
pg_log_query_plan()
function which request to log plan of the specified backend process.



By default, only superusers are allowed to request log of the plan
otherwise
allowing any users to issue this request could create cause lots of log
messages
and it can lead to denial of service.

At the next requesting CHECK_FOR_INTERRUPTS(), the target backend logs
its plan at
LOG_SERVER_ONLY level and therefore this plan will appear in the server
log only,
not to be sent to the client.

Thanks, I have incorporated your comments.
Since the latter part of the original message comes from the commit 
message of pg_log_backend_memory_contexts(43620e328617c), so I left it 
as it was for consistency.




Secondly, I have question about deleting USE_ASSERT_CHECKING in lock.h?
It supposed to have been checked in another placed of the code by
matching values. I am worry about skipping errors due to untesting with
assert option in the places where it (GetLockMethodLocalHash)
participates and we won't able to get core file in segfault cases. I
might not understand something, then can you please explain to me?
Since GetLockMethodLocalHash() is only used for assertions, this is only 
defined when USE_ASSERT_CHECKING is enabled.
This patch uses GetLockMethodLocalHash() not only for the assertion 
purpose, so I removed "ifdef USE_ASSERT_CHECKING" for this function.

I belive it does not lead to skip errors.

Thirdly, I have incomprehension of the point why save_ActiveQueryDesc 
is
declared in the pquery.h? I am seemed to save_ActiveQueryDesc to be 
used

in an once time in the ExecutorRun function and  its declaration
superfluous. I added it in the attached patch.

Exactly.


Fourthly, it seems to me there are not enough explanatory comments in
the code. I also added them in the attached patch.

Thanks!

| +   /*
| +* Save value of ActiveQueryDesc before having called
| +* ExecutorRun_hook function due to having reset by
| +* AbortSubTransaction.
| +*/
| +
| save_ActiveQueryDesc = ActiveQueryDesc;
| ActiveQueryDesc = queryDesc;
|
| @@ -314,6 +320,7 @@ ExecutorRun(QueryDesc *queryDesc,
| else
| standard_ExecutorRun(queryDesc, direction, count, 
execute_once);

|
| +   /* We set the actual value of ActiveQueryDesc */
| ActiveQueryDesc = save_ActiveQueryDesc;

Since these processes are needed for nested queries, not only for
AbortSubTransaction[1], added comments on it.

| +/* Function to handle the signal to output the query plan. */
|  extern void HandleLogQueryPlanInterrupt(void);

I feel this comment is unnecessary since the explanation of 
HandleLogQueryPlanInterrupt() is written in explain.c and no functions 
in explain.h have comments in it.



Lastly, I have incomprehension about handling signals since have been
unused it before. Could another signal disabled calling this signal to
log query plan? I noticed this signal to be checked the latest in
procsignal_sigusr1_handler function.
Are you concerned that one signal will not be processed when multiple 
signals are sent in succession?
AFAIU both of them are processed since SendProcSignal flags 
ps_signalFlags for each signal.


```
 SendProcSignal(pid_t pid, ProcSignalReason reason, BackendId backendId)
 {
 volatile ProcSignalSlot *slot;
...(snip)...
278 if (slot->pss_pid == pid)
279 {
280 /* Atomically set the proper flag */
281 slot->pss_signalFlags[reason] = true;
282 /* Send signal */
283 return kill(pid, SIGUSR1);
```

Comments of ProcSignalReason also say 'We can cope with concurrent 
signals for different reasons'.


```C
 /*
  * Reasons for signaling a Postgres child process (a backend or an 
auxiliary
  * process, like checkpointer).  We can cope with concurrent signals 
for different
  * reasons.  However, if the same reason is signaled multiple times in 
quick
  * succession, the process is likely to observe only one notification 
of

Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Mark Dilger

> On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz  wrote:
> 
> "When a partitioned table is added to a publication, all of its existing and 
> future partitions are implicitly considered to be part of the 
> publication."[10]
> 
> Additionally, this is the behavior that is already present in "FOR ALL 
> TABLES":
> 
> "Marks the publication as one that replicates changes for all tables in the 
> database, including tables created in the future."[10]
> 
> I don't think we should change this behavior that's already in logical 
> replication.

The existing behavior in logical replication doesn't have any "IN SCHEMA" 
qualifiers.

> While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a 
> different behavior (i.e. it's not applied to future objects) and do not 
> advocate to change it, I have personally been affected where I thought a 
> permission would be applied to all future objects, only to discover 
> otherwise. I believe it's more intuitive to think that "ALL" applies to 
> "everything, always."

The conversation is focusing on what "ALL TABLES" means, but the ambiguous part 
is what "IN SCHEMA" means.  In GRANT it means "currently in schema, computed 
now."  We are about to create confusion by adding the "IN SCHEMA" phrase to 
publication commands meaning "later in schema, computed then."  A user who 
diligently consults the documentation for one command to discover what "IN 
SCHEMA" means may fairly, but wrongly, assume it means the same thing in 
another command.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Proposal to use JSON for Postgres Parser format

On 2022-Sep-20, Matthias van de Meent wrote:

> Allow me to add: compressability
> 
> In the thread surrounding [0] there were complaints about the size of
> catalogs, and specifically the template database. Significant parts of
> that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> consists mostly of serialized Nodes. If we're going to replace our
> current NodeToText infrastructure, we'd better know we can effectively
> compress this data.

True.  Currently, the largest ev_action values compress pretty well.  I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example).  We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming"  (A. Stepanov)

Re: [PATCH]Feature improvement for MERGE tab completion

On 2022-Sep-18, Fujii Masao wrote:

> The tab-completion code for MERGE was added in the middle of that for LOCK 
> TABLE.
> This would be an oversight of the commit that originally supported 
> tab-completion
> for MERGE. I fixed this issue.

Argh, thanks.

> "MERGE" was tab-completed with just after "EXPLAIN" or "EXPLAIN ANALYZE", etc.
> Since "INTO" always follows "MERGE", it's better to complete with "MERGE INTO"
> there. I replaced "MERGE" with "MERGE INTO" in those tab-completions.

OK, that would be similar to REFRESH MATERIALIZED VIEW.

The rules starting at line 4111 make me a bit nervous, since nowhere
we're restricting them to operating only on MERGE lines.  I don't think
it's a real problem since USING is not terribly common anyway.  Likewise
for the ones with WHEN [NOT] MATCHED.  I kinda wish we had a way to
search for stuff like "keyword MERGE appears earlier in the command",
but we don't have that.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
  (Linus Torvalds)

Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

2022-09-20 Thread Tom Lane

I wrote:
> (That verbiage is from the gcc manual; clang seems to act the same
> except that -Wcast-function-type is selected by -Wall, or perhaps is
> even on by default.)

Nah, scratch that: the reason -Wcast-function-type is on is that
we explicitly enable it, and have done so since de8feb1f3 (v14).
I did not happen to see this warning with gcc because the test runs
I made with this patch already had c35ba141d, whereas I did my
clang test on another machine that wasn't quite up to HEAD.
So we should have good warning coverage for bogus walker signatures
on both compilers.

regards, tom lane

Re: A question about wording in messages

On 2022-Sep-14, Kyotaro Horiguchi wrote:

> At Tue, 13 Sep 2022 22:38:46 -0400, Tom Lane  wrote in 
> > Kyotaro Horiguchi  writes:
> > > I saw the following message recently modified.
> > >> This controls the maximum distance we can read ahead in the WAL to 
> > >> prefetch referenced data blocks.
> > > Maybe the "we" means "PostgreSQL program and you" but I see it
> > > somewhat out of place.
> > 
> > +1, I saw that today and thought it was outside our usual style.
> > The whole thing is awfully verbose for a GUC description, too.
> > Maybe
> > 
> > "Maximum distance to read ahead in WAL to prefetch data blocks."

I failed to notice this issue.  I agree it's unusual and +1 for changing it.

> It seems to sufficiently work for average users and rather easy to
> read, but it looks a short description.

> So, taking the middle of them, how about the following?
> 
> Short: Buffer size for reading ahead in the WAL during recovery.
> Extra: This controls the maximum distance to read ahead in WAL to prefetch 
> data blocks."

But why do we care that it's short?  We don't need it to be long .. we
only need it to explain what it needs to explain.

After spending way too much time editing this line, I ended up with
exactly what Tom proposed, so +1 for his version.  I think "This
controls" adds nothing very useful, and we don't have it anywhere else,
except tcp_keepalives_count from where I also propose to remove it.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)
>From 5b8bf15ed5d9f1a21150039da33a557f027640d4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 20 Sep 2022 18:19:34 +0200
Subject: [PATCH] fix some GUC strings

---
 src/backend/utils/misc/guc_tables.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 550e95056c..ab3140ff3a 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2591,7 +2591,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"wal_decode_buffer_size", PGC_POSTMASTER, WAL_RECOVERY,
 			gettext_noop("Buffer size for reading ahead in the WAL during recovery."),
-			gettext_noop("This controls the maximum distance we can read ahead in the WAL to prefetch referenced data blocks."),
+			gettext_noop("Maximum distance to read ahead in the WAL to prefetch referenced data blocks."),
 			GUC_UNIT_BYTE
 		},
 		&wal_decode_buffer_size,
@@ -3248,7 +3248,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"tcp_keepalives_count", PGC_USERSET, CONN_AUTH_TCP,
 			gettext_noop("Maximum number of TCP keepalive retransmits."),
-			gettext_noop("This controls the number of consecutive keepalive retransmits that can be "
+			gettext_noop("Number of consecutive keepalive retransmits that can be "
 		 "lost before a connection is considered dead. A value of 0 uses the "
 		 "system default."),
 		},
-- 
2.30.2

default sorting behavior for index

2022-09-20 Thread Zhihong Yu

Hi,
I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t.
constraint:

if (indclass->values[i] != defopclass ||
attform->attcollation != index_rel->rd_indcollation[i]
||
attoptions != (Datum) 0 ||
index_rel->rd_indoption[i] != 0)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("index \"%s\" column number %d does not
have default sorting behavior", index_name, i + 1),
 errdetail("Cannot create a primary key or
unique constraint using such an index."),

It seems this first came in via `Indexes with INCLUDE columns and their
support in B-tree`

If the index has DESC sorting order, why it cannot be used to back a
constraint ?
Some concrete sample would help me understand this.

Thanks

Re: cataloguing NOT NULL constraints

On Tue, Sep 20, 2022 at 10:39 AM Alvaro Herrera  wrote:
> That said, the patch I posted for this ~10 years ago used a separate
> contype and was simpler than what I ended up with now, but amusingly
> enough it was returned at the time with the argument that it would be
> better to treat them as normal CHECK constraints; so I want to be very
> sure that we're not just going around in circles.

I don't have an intrinsic view on whether we ought to have one contype
or two, but I think this comment of yours from a few messages ago is
right on point: ".. though I'm now wondering if there's additional
overhead from checking
the constraint twice on each row: first the attnotnull bit, then the
CHECK itself.  Hmm.  That's probably quite bad." For that exact
reason, it seems absolutely necessary to be able to somehow identify
the "redundant" check constraints, so that we don't pay the expense of
revalidating them. Another contype would be one way of identifying
such constraints, but it could probably be done in other ways, too.
Perhaps it could even be discovered dynamically, like when we build a
relcache entry. I actually have no idea what design is best.

I am a little confused as to why we want to do this, though. While
we're on the topic of what is more complicated and simpler, what
functionality do we get out of adding all of these additional catalog
entries that then have to be linked back to the corresponding
attnotnull markings? And couldn't we get that functionality in some
much simpler way? Like, if you want to track whether the NOT NULL
constraint has been validated, we could add an attnotnullvalidated
column, or probably better, change the existing attnotnull column to a
character used as an enum, or maybe an integer bit-field of some kind.
I'm not trying to blow up your patch with dynamite or anything, but
I'm a little suspicious that this may be one of those cases where
pgsql-hackers discussed turns a complicated project into an even more
complicated project to no real benefit.

One thing that I don't particularly like about this whole design is
that it feels like it creates a bunch of catalog bloat. Now all of the
attnotnull flags also generate additional pg_constraint rows. The
catalogs in the default install will be bigger than before, and the
catalogs after user tables are created will be more bigger. If we get
some nifty benefit out of all that, cool! But if we're just
anti-optimizing the catalog storage out of some feeling that the
result will be intellectually purer than some competing design, maybe
we should reconsider. It's not stupid to optimize for common special
cases, and making a column as NOT NULL is probably at least one and
maybe several orders of magnitude more common than putting some other
CHECK constraint on it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

On Fri, Sep 9, 2022 at 6:02 AM Dilip Kumar  wrote:
> [ new patch ]

+typedef pg_int64 RelFileNumber;

This seems really random to me. First, why isn't this an unsigned
type? OID is unsigned and I don't see a reason to change to a signed
type. But even if we were going to change to a signed type, why
pg_int64? That is declared like this:

/* Define a signed 64-bit integer type for use in client API declarations. */
typedef PG_INT64_TYPE pg_int64;

Surely this is not a client API declaration

Note that if we change this a lot of references to INT64_FORMAT will
need to become UINT64_FORMAT.

I think we should use int64 at the SQL level, because we don't have an
unsigned 64-bit SQL type, and a signed 64-bit type can hold 56 bits.
So it would still be Int64GetDatum((int64) rd_rel->relfilenode) or
similar. But internally I think using unsigned is cleaner.

+ * RelFileNumber is unique within a cluster.

Not really, because of CREATE DATABASE. Probably just drop this line.
Or else expand it: we never assign the same RelFileNumber twice within
the lifetime of the same cluster, but there can be multiple relations
with the same RelFileNumber e.g. because CREATE DATABASE duplicates
the RelFileNumber values from the template database. But maybe we
don't need this here, as it's already explained in relfilelocator.h.

+ret = (int8) (tag->relForkDetails[0] >> BUFTAG_RELNUM_HIGH_BITS);

Why not declare ret as ForkNumber instead of casting twice?

+uint64  relnum;
+
+Assert(relnumber <= MAX_RELFILENUMBER);
+Assert(forknum <= MAX_FORKNUM);
+
+relnum = relnumber;

Perhaps it'd be better to write uint64 relnum = relnumber instead of
initializing on a separate line.

+#define RELNUMBERCHARS  20  /* max chars printed by %llu */

Maybe instead of %llu we should say UINT64_FORMAT (or INT64_FORMAT if
there's some reason to stick with a signed type).

+elog(ERROR, "relfilenumber is out of bound");

It would have to be "out of bounds", with an "s". But maybe "is too
large" would be better.

+nextRelFileNumber = ShmemVariableCache->nextRelFileNumber;
+loggedRelFileNumber = ShmemVariableCache->loggedRelFileNumber;
+flushedRelFileNumber = ShmemVariableCache->flushedRelFileNumber;

Maybe it would be a good idea to asset that next <= flushed and
flushed <= logged?

+#ifdef USE_ASSERT_CHECKING
+
+{
+RelFileLocatorBackend rlocator;
+char   *rpath;

Let's add a comment here, like "Because the RelFileNumber counter only
ever increases and never wraps around, it should be impossible for the
newly-allocated RelFileNumber to already be in use. But, if Asserts
are enabled, double check that there's no main-fork relation file with
the new RelFileNumber already on disk."

+elog(ERROR, "cannot forward RelFileNumber during recovery");

forward -> set (or advance)

+if (relnumber >= ShmemVariableCache->loggedRelFileNumber)

It probably doesn't make any difference, but to me it seems better to
test flushedRelFileNumber rather than logRelFileNumber here. What do
you think?

 /*
  * We set up the lockRelId in case anything tries to lock the dummy
- * relation.  Note that this is fairly bogus since relNumber may be
- * different from the relation's OID.  It shouldn't really matter though.
- * In recovery, we are running by ourselves and can't have any lock
- * conflicts.  While syncing, we already hold AccessExclusiveLock.
+ * relation.  Note we are setting relId to just FirstNormalObjectId which
+ * is completely bogus.  It shouldn't really matter though. In recovery,
+ * we are running by ourselves and can't have any lock conflicts.  While
+ * syncing, we already hold AccessExclusiveLock.
  */
 rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
-rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
+rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId;

Boy, this makes me uncomfortable. The existing logic is pretty bogus,
and we're replacing it with some other bogus thing. Do we know whether
anything actually does try to use this for locking?

One notable difference between the existing logic and your change is
that, with the existing logic, we use a bogus value that will differ
from one relation to the next, whereas with this change, it will
always be the same value. Perhaps el->rd_lockInfo.lockRelId.relId =
(Oid) rlocator.relNumber would be a more natural adaptation?

+#define CHECK_RELFILENUMBER_RANGE(relfilenumber)\
+do {\
+if ((relfilenumber) < 0 || (relfilenumber) > MAX_RELFILENUMBER) \
+ereport(ERROR,  \
+errcode(ERRCODE_INVALID_PARAMETER_VALUE),   \
+errmsg("relfilenumber %lld is out of range",\
+(long long) (relfilenumber))); \
+} while (0)

Here, you take the approach of casting the relfilenumber to

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

On Mon, Sep 19, 2022 at 9:09 PM Tom Lane  wrote:
> You have to assume that somebody (a) has a role or DB name starting
> with slash, (b) has an explicit reference to that name in their
> pg_hba.conf, (c) doesn't read the release notes, and (d) doesn't
> notice that things are misbehaving until after some hacker manages
> to break into their installation on the strength of the misbehaving
> entry.  OK, I'll grant that the probability of (c) is depressingly
> close to unity; but each of the other steps seems quite low probability.
> All four of them happening in one installation is something I doubt
> will happen.

I can't argue with (a) or (b), but (d) seems decently likely to me. If
your normal user base consists of people who are authorized to access
your system, what clues would you have that your HBA is silently
failing open?

--Jacob

Re: A question about wording in messages

On 2022-Sep-16, Amit Kapila wrote:

> We only want to commit the changes to 15 (a) if those fixes a problem
> introduced in 15, or (b) those are for a bug fix. I think the error
> message improvements fall into none of those categories, we can map it
> to (b) but I feel those are an improvement in the current messages and
> don't seem critical to me.

IMO at least the GUC one does fix a problem related to the wording of a
user-visible message, which also flows into the translations.  I prefer
to have that one fixed it in 15 also.  The other messages (errors) don't
seem very interesting because they're not as visible, so I don't care if
those are not backpatched.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"This is a foot just waiting to be shot"(Andrew Dunstan)

Auto explain after query timeout

Hopefully I'm not missing something obvious, but as far as I know
there's no way to configure auto explain to work fire
statement_timeout fires.

I'd like to look into this at some point, but I'm wondering if anyone
has thought about it before, and, if so, is there any obvious
impediment to doing so?

Thanks,
James Coleman

Tighten pg_get_object_address argument checking

2022-09-20 Thread Peter Eisentraut


For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user
mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the
array length of the second argument, but not of the first argument.
If the first argument was too long, it would just silently ignore
everything but the first argument.  Fix that by checking the length of
the first argument as well.

I wouldn't be surprised if there were more holes like this in this area. 
 I just happened to find these while working on something related.From eb80c87a083464160a1436e5f983df840b282085 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 20 Sep 2022 13:37:27 -0400
Subject: [PATCH] Tighten pg_get_object_address argument checking

For publication schemas (OBJECT_PUBLICATION_NAMESPACE) and user
mappings (OBJECT_USER_MAPPING), pg_get_object_address() checked the
array length of the second argument, but not of the first argument.
If the first argument was too long, it would just silently ignore
everything but the first argument.  Fix that by checking the length of
the first argument as well.
---
 src/backend/catalog/objectaddress.c  | 10 --
 src/test/regress/expected/object_address.out | 16 +++-
 src/test/regress/sql/object_address.sql  |  2 +-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/objectaddress.c 
b/src/backend/catalog/objectaddress.c
index 8377b4f7d4d1..27616ac2ad26 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -2239,10 +2239,16 @@ pg_get_object_address(PG_FUNCTION_ARGS)
 */
switch (type)
{
+   case OBJECT_PUBLICATION_NAMESPACE:
+   case OBJECT_USER_MAPPING:
+   if (list_length(name) != 1)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("name list length must 
be exactly %d", 1)));
+   /* fall through to check args length */
+   /* FALLTHROUGH */
case OBJECT_DOMCONSTRAINT:
case OBJECT_CAST:
-   case OBJECT_USER_MAPPING:
-   case OBJECT_PUBLICATION_NAMESPACE:
case OBJECT_PUBLICATION_REL:
case OBJECT_DEFACL:
case OBJECT_TRANSFORM:
diff --git a/src/test/regress/expected/object_address.out 
b/src/test/regress/expected/object_address.out
index 4117fc27c9a5..cbb99c7b9f94 100644
--- a/src/test/regress/expected/object_address.out
+++ b/src/test/regress/expected/object_address.out
@@ -105,7 +105,7 @@ BEGIN
('text search template'), ('text search configuration'),
('policy'), ('user mapping'), ('default acl'), ('transform'),
('operator of access method'), ('function of access method'),
-   ('publication relation')
+   ('publication namespace'), ('publication relation')
LOOP
FOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, 
zwei, drei}')
LOOP
@@ -285,10 +285,10 @@ WARNING:  error for policy,{eins,zwei,drei},{}: schema 
"eins" does not exist
 WARNING:  error for policy,{eins,zwei,drei},{integer}: schema "eins" does not 
exist
 WARNING:  error for user mapping,{eins},{}: argument list length must be 
exactly 1
 WARNING:  error for user mapping,{eins},{integer}: user mapping for user 
"eins" on server "integer" does not exist
-WARNING:  error for user mapping,{addr_nsp,zwei},{}: argument list length must 
be exactly 1
-WARNING:  error for user mapping,{addr_nsp,zwei},{integer}: user mapping for 
user "addr_nsp" on server "integer" does not exist
-WARNING:  error for user mapping,{eins,zwei,drei},{}: argument list length 
must be exactly 1
-WARNING:  error for user mapping,{eins,zwei,drei},{integer}: user mapping for 
user "eins" on server "integer" does not exist
+WARNING:  error for user mapping,{addr_nsp,zwei},{}: name list length must be 
exactly 1
+WARNING:  error for user mapping,{addr_nsp,zwei},{integer}: name list length 
must be exactly 1
+WARNING:  error for user mapping,{eins,zwei,drei},{}: name list length must be 
exactly 1
+WARNING:  error for user mapping,{eins,zwei,drei},{integer}: name list length 
must be exactly 1
 WARNING:  error for default acl,{eins},{}: argument list length must be 
exactly 1
 WARNING:  error for default acl,{eins},{integer}: unrecognized default ACL 
object type "i"
 WARNING:  error for default acl,{addr_nsp,zwei},{}: argument list length must 
be exactly 1
@@ -313,6 +313,12 @@ WARNING:  error for function of access 
method,{addr_nsp,zwei},{}: name list leng
 WARNING:  error for function of access method,{addr_nsp,zwei},{integer}: name 
list length must be at least 3
 WARNING:  error for function of access method,{eins,zwei,drei},{}: argument 
list length must be exactly 2
 WARNING:  error for function of access meth

Re: CFM Manager

On Thu, Sep 8, 2022 at 2:34 PM Jacob Champion  wrote:
> I still have yet to update the section "5 to 7 days before end of CF"
> and onward.

Well, I've saved the hardest part for last...

Ibrar, Hamid, have the checklist rewrites been helpful so far? Are you
planning on doing an (optional!) triage, and if so, are there any
pieces in particular you'd like me to document?

--Jacob

Re: oat_post_create expected behavior

2022-09-20 Thread Jeff Davis

On Tue, 2022-08-02 at 13:30 -0700, Mary Xu wrote:
> > Right, same thing I'm saying.  I also think we should discourage
> > people from doing cowboy CCIs inside their OAT hooks, because that
> > makes the testability problem even worse.  Maybe that means we
> > need to uniformly move the CREATE hooks to after a system-provided
> > CCI, but I've not thought hard about the implications of that.
> 
> I like this approach, however, I am relatively new to the PG scene
> and
> am not sure how or what I should look into in terms of the
> implications that Tom mentioned. Are there any tips? What should be
> the next course of action here? I could update my patch to always
> call
> CCI before the create hooks.

I didn't see a clear consensus that we should call OAT_POST_CREATE
after CCI, so I went ahead and updated the comment. We can always
update the behavior later when we do have consensus, but until that
time, at least the comment will be more helpful.

If you are satisfied you can mark the CF issue as "committed", or you
can leave it open if you think it's still unresolved.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Support tls-exporter as channel binding for TLSv1.3

On Mon, Sep 19, 2022 at 5:54 PM Michael Paquier  wrote:
> X509_get_signature_nid() has been introduced in 1.0.2.
> SSL_export_keying_material() is older than that, being present since
> 1.0.1.  Considering the fact that we want to always have
> tls-server-end-point as default, it seems to me that we should always
> publish SCRAM-SHA-256-PLUS and support channel binding when building
> with OpenSSL >= 1.0.2.  The same can be said about the areas where we
> have HAVE_BE_TLS_GET_[PEER_]CERTIFICATE_HASH.

Should we advertise support even if the client is too old to provide
an extended master secret?

> I was planning to continue working on this patch based on your latest
> review.  Anyway, as that's originally your code, I am fine to let you
> take the lead here.  Just let me know which way you prefer, and I'll
> stick to it :)

Well, I'm working on a next version, but it's ballooning in complexity
as I try to navigate the fix for OpenSSL 1.0.1 (which is currently
failing the tests, unsurprisingly). You mentioned not wanting to add
maintenance burden for 1.0.1, and I'm curious to see whether the
approach you have in mind would be easier than what mine is turning
out to be. Maybe we can compare notes?

--Jacob

Re: predefined role(s) for VACUUM and ANALYZE

On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
> I have gone through the thread, and I'd agree with getting more
> granularity when it comes to assigning ACLs to relations rather than
> just an on/off switch for the objects of a given type would be nice.
> I've been looking at the whole use of AclMode and AclItem in the code,
> and I don't quite see why a larger size could have a noticeable
> impact.  There are a few things that could handle a large number of
> AclItems, though, say for array operations like aclupdate().  These
> could be easily checked with some micro-benchmarking or some SQL
> queries that emulate a large number of items in aclitem[] arrays.

I performed a few quick tests with a couple thousand ACLs on my laptop, and
I'm consistently seeing a 4.3% regression.

> Any impact for the column sizes of the catalogs holding ACL
> information?  Just asking while browsing the patch set.

Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
in my testing, I hit a "row is too big" error with the same number of
aclitems in a pg_class row before and after the change.  I might be missing
something in my patch, or maybe I am misunderstanding how arrays of
aclitems are stored on disk.

> Some comments in utils/acl.h need a refresh as the number of lower and
> upper bits looked at from ai_privs changes.

Oops, I missed that one.  I fixed it in the attached patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 15e9a60b990070d36f4e1f749cbfbb638b37a666 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Sep 2022 22:25:29 -0700
Subject: [PATCH v6 1/4] Change AclMode from a uint32 to a uint64.

---
 src/backend/nodes/outfuncs.c|  2 +-
 src/bin/pg_upgrade/check.c  | 35 +
 src/include/catalog/pg_type.dat |  2 +-
 src/include/nodes/parsenodes.h  |  6 +++---
 src/include/utils/acl.h | 28 +-
 5 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 60610e3a4b..dbb9ad52da 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -528,7 +528,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 	WRITE_BOOL_FIELD(lateral);
 	WRITE_BOOL_FIELD(inh);
 	WRITE_BOOL_FIELD(inFromCl);
-	WRITE_UINT_FIELD(requiredPerms);
+	WRITE_UINT64_FIELD(requiredPerms);
 	WRITE_OID_FIELD(checkAsUser);
 	WRITE_BITMAPSET_FIELD(selectedCols);
 	WRITE_BITMAPSET_FIELD(insertedCols);
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f1bc1e6886..615a53a864 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -28,6 +28,7 @@ static void check_for_incompatible_polymorphics(ClusterInfo *cluster);
 static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
+static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -107,6 +108,13 @@ check_and_dump_old_cluster(bool live_check)
 	check_for_reg_data_type_usage(&old_cluster);
 	check_for_isn_and_int8_passing_mismatch(&old_cluster);
 
+	/*
+	 * PG 16 increased the size of the 'aclitem' type, which breaks the on-disk
+	 * format for existing data.
+	 */
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_for_aclitem_data_type_usage(&old_cluster);
+
 	/*
 	 * PG 14 changed the function signature of encoding conversion functions.
 	 * Conversions from older versions cannot be upgraded automatically
@@ -1319,6 +1327,33 @@ check_for_reg_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_for_aclitem_data_type_usage
+ *
+ *	aclitem changed its storage format in 16, so check for it.
+ */
+static void
+check_for_aclitem_data_type_usage(ClusterInfo *cluster)
+{
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for incompatible aclitem data type in user tables");
+
+	snprintf(output_path, sizeof(output_path), "tables_using_aclitem.txt");
+
+	if (check_for_data_type_usage(cluster, "pg_catalog.aclitem", output_path))
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains the \"aclitem\" data type in user tables.\n"
+ "The internal format of \"aclitem\" changed in PostgreSQL version 16\n"
+ "so this cluster cannot currently be upgraded.  You can drop the\n"
+ "problem columns and restart the upgrade.  A list of the problem\n"
+ "columns is in the file:\n"
+ "%s", output_path);
+	}
+	else
+		check_ok();
+}
 
 /*
  * check_for_jsonb_9_4_usage()
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index df45879463..484dec39e8 100644
--- a/src/include/catalog/pg_type.dat
+

Re: Auto explain after query timeout

2022-09-20 Thread Gurjeet

On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote:
> Hopefully I'm not missing something obvious, but as far as I know
> there's no way to configure auto explain to work fire
> statement_timeout fires.

I believe you're correct.

> I'd like to look into this at some point, but I'm wondering if anyone
> has thought about it before, and, if so, is there any obvious
> impediment to doing so?

This would be a neat feature. Since the changes would be fairly
localized to the contrib module, this would be a great first patch for
someone new to contributing.

This can be exposed at a new GUC auto_explain.log_on_statement_timeout.
I wish our conventions allowed for creating hierarchies of GUC
parameters, e.g. auto_explain.when.statmeent_timeout.

For someone who would like to achieve this in the field today, I believe
they can set auto_explain.log_min_duration equal to, or less than,
statement_timeout.

Best regards,
Gurjeet
http://Gurje.et

Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)

Here is a rebased patch for cfbot.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9ae1f5409ddeee17b99a9565247da885cbb86be9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 22 Mar 2022 15:35:34 -0700
Subject: [PATCH v6 1/1] Disallow setting XMAX_COMMITTED and XMAX_LOCK_ONLY
 together.

Even though Postgres doesn't set both the XMAX_COMMITTED and
XMAX_LOCK_ONLY infomask bits on the same tuple header, the heap
visibility logic still accepts it.  However, other functions like
compute_new_xmax_infomask() seem to assume that this bit pattern
is not possible.

This change marks this bit pattern as disallowed, removes the heap
visibility logic that handles it, and adds checks like those for
other disallowed infomask bit combinations (e.g., XMAX_COMMITTED
and XMAX_IS_MULTI).  Besides simplifying the heap visibility logic
a bit, this change aims to reduce ambiguity about the legal tuple
header states.

Note that this change also disallows XMAX_COMMITTED together with
the special pre-v9.3 locked-only bit pattern that
HEAP_XMAX_IS_LOCKED_ONLY checks for.  This locked-only bit pattern
may still be present on servers pg_upgraded from pre-v9.3 versions.
---
 contrib/amcheck/verify_heapam.c | 19 
 src/backend/access/heap/README.tuplock  |  2 +-
 src/backend/access/heap/heapam.c| 10 +++
 src/backend/access/heap/heapam_visibility.c | 97 ++---
 src/backend/access/heap/hio.c   |  2 +
 5 files changed, 96 insertions(+), 34 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index d33f33f170..f74f88afc5 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -663,6 +663,25 @@ check_tuple_header(HeapCheckContext *ctx)
 		 */
 	}
 
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		(ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY))
+	{
+		report_corruption(ctx,
+		  pstrdup("locked-only should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
+	/* also check for pre-v9.3 lock-only bit pattern */
+	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+		HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask))
+	{
+		report_corruption(ctx,
+		  pstrdup("tuple with HEAP_XMAX_EXCL_LOCK set and HEAP_XMAX_IS_MULTI unset should not be marked committed"));
+
+		/* As above, do not skip further checks. */
+	}
+
 	if (infomask & HEAP_HASNULL)
 		expected_hoff = MAXALIGN(SizeofHeapTupleHeader + BITMAPLEN(ctx->natts));
 	else
diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index 6441e8baf0..4e565e60ab 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -152,4 +152,4 @@ The following infomask bits are applicable:
   whether the XMAX is a TransactionId or a MultiXactId.
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
-is set.
+or the HEAP_XMAX_LOCK_ONLY bit is set.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index eb811d751e..616df576c3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5115,6 +5115,16 @@ l5:
 		MultiXactStatus status;
 		MultiXactStatus new_status;
 
+		/*
+		 * Currently we don't allow XMAX_LOCK_ONLY and XMAX_COMMITTED to both be
+		 * set in a tuple header, so cross-check.
+		 *
+		 * Note that this also checks for the special locked-only bit pattern
+		 * that may be present on servers that were pg_upgraded from pre-v9.3
+		 * versions.
+		 */
+		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(old_infomask));
+
 		if (old_infomask2 & HEAP_KEYS_UPDATED)
 			status = MultiXactStatusUpdate;
 		else
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 6e33d1c881..e6ee8ff1fa 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -78,6 +78,31 @@
 #include "utils/snapmgr.h"
 
 
+/*
+ * InfomaskAssertions()
+ *
+ * Checks for invalid infomask bit patterns.
+ */
+static inline void
+InfomaskAssertions(HeapTupleHeader tuple)
+{
+	/*
+	 * XMAX_COMMITTED and XMAX_LOCK_ONLY cannot both be set in a tuple header.
+	 *
+	 * Note that this also checks for the special locked-only bit pattern that
+	 * may be present on servers that were pg_upgraded from pre-v9.3 versions.
+	 */
+	Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
+			 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)));
+
+	/*
+	 * XMAX_COMMITTED and XMAX_IS_MULTI cannot be be set in a tuple header.
+	 */
+	Assert(!((tuple->t_infomask & HEAP_XMAX_COMMITTED) &&
+			 (tuple->t_infomask & HEAP_XMAX_IS_MULTI)));
+}
+
+
 /*
  * SetHintBits()
  *
@@ -113,6 +138,8 @@ static inline void
 SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 			uint16 infomask, TransactionId xid)
 {
+	InfomaskAssertions(tuple);
+
 	if (TransactionIdIsValid(xid))
 	{
 		/* NB: xid must be known committed here!

Re: Auto explain after query timeout

On Tue, Sep 20, 2022 at 2:12 PM Gurjeet  wrote:
>
> On Tue Sep 20, 2022 at 10:34 AM PDT, James Coleman wrote:
> > Hopefully I'm not missing something obvious, but as far as I know
> > there's no way to configure auto explain to work fire
> > statement_timeout fires.
>
> I believe you're correct.
>
> > I'd like to look into this at some point, but I'm wondering if anyone
> > has thought about it before, and, if so, is there any obvious
> > impediment to doing so?
>
> This would be a neat feature. Since the changes would be fairly
> localized to the contrib module, this would be a great first patch for
> someone new to contributing.
>
> This can be exposed at a new GUC auto_explain.log_on_statement_timeout.
> I wish our conventions allowed for creating hierarchies of GUC
> parameters, e.g. auto_explain.when.statmeent_timeout.
>
> For someone who would like to achieve this in the field today, I believe
> they can set auto_explain.log_min_duration equal to, or less than,
> statement_timeout.

Either I'm missing something (and/or this was fixed in a later PG
version), but I don't think this is how it works. We have this
specific problem now: we set auto_explain.log_min_duration to 200 (ms)
and statement_timeout set to 30s, but when a statement times out we do
not get the plan logged with auto-explain.

James Coleman

Re: Support tls-exporter as channel binding for TLSv1.3

On Tue, Sep 20, 2022 at 11:01 AM Jacob Champion  wrote:
> Well, I'm working on a next version, but it's ballooning in complexity
> as I try to navigate the fix for OpenSSL 1.0.1 (which is currently
> failing the tests, unsurprisingly).

To be more specific: I think I'm hitting the case that Heikki pointed
out several years ago [1]:

> The problematic case is when e.g. the server
> only supports tls-unique and the client only supports
> tls-server-end-point. What we would (usually) like to happen, is to fall
> back to not using channel binding. But it's not clear how to make that
> work, and still protect from downgrade attacks.

The problem was deferred when tls-unique was removed. We might have to
actually solve it now.

bcc: Heikki, in case he would like to weigh in.

--Jacob

[1] 
https://www.postgresql.org/message-id/ec787074-2305-c6f4-86aa-6902f98485a4%40iki.fi

Re: A question about wording in messages

2022-09-20 Thread Tom Lane

Alvaro Herrera  writes:
> After spending way too much time editing this line, I ended up with
> exactly what Tom proposed, so +1 for his version.  I think "This
> controls" adds nothing very useful, and we don't have it anywhere else,
> except tcp_keepalives_count from where I also propose to remove it.

LGTM.

regards, tom lane

Re: Auto explain after query timeout

On Tue, Sep 20, 2022 at 2:35 PM James Coleman  wrote:
> Either I'm missing something (and/or this was fixed in a later PG
> version), but I don't think this is how it works. We have this
> specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> and statement_timeout set to 30s, but when a statement times out we do
> not get the plan logged with auto-explain.

I think you're correct. auto_explain uses the ExecutorEnd hook, but
that hook will not be fired in the case of an error. Indeed, if an
error has already been thrown, it would be unsafe to try to
auto-explain anything. For instance -- and this is just one problem of
probably many -- ExplainTargetRel() performs catalog lookups, which is
not OK in a failed transaction.

To make this work, I think you'd need a hook that fires just BEFORE
the error is thrown. However, previous attempts to introduce hooks
into ProcessInterrupts() have not met with a wam response from Tom, so
it might be a tough sell. And maybe for good reason. I see at least
two problems. The first is that explaining a query is a pretty
complicated operation that involves catalog lookups and lots of
complicated stuff, and I don't think that it would be safe to do all
of that at any arbitrary point in the code where ProcessInterrupts()
happened to fire. What if one of the syscache lookups misses the cache
and we have to open the underlying catalog? Then
AcceptInvalidationMessages() will fire, but we don't currently assume
that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
the running query has called a user-defined function or procedure
which is running DDL which is in the middle of changing catalog state
for some relation involved in the query at the moment that the
statement timeout arrives? I feel like there are big problems here.

The other problem I see is that ProcessInterrupts() is our mechanism
for allowing people to escape from queries that would otherwise run
forever by hitting ^C. But what if the explain code goes crazy and
itself needs to be interrupted, when we're already inside
ProcessInterrupts()? Maybe that would work out OK... or maybe it
wouldn't. I'm not really sure. But it's another reason to be very,
very cautious about putting complicated logic inside
ProcessInterrupts().

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Auto explain after query timeout

2022-09-20 Thread Gurjeet

On Tue Sep 20, 2022 at 11:34 AM PDT, James Coleman wrote:
> On Tue, Sep 20, 2022 at 2:12 PM Gurjeet  wrote:
> >
> > For someone who would like to achieve this in the field today, I believe
> > they can set auto_explain.log_min_duration equal to, or less than,
> > statement_timeout.
>
> Either I'm missing something (and/or this was fixed in a later PG
> version), but I don't think this is how it works. We have this
> specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> and statement_timeout set to 30s, but when a statement times out we do
> not get the plan logged with auto-explain.

My DBA skills are rusty, so I'll take your word for it.

If this is the current behaviour, I'm inclined to treat this as a bug,
or at least a desirable improvement, and see if auto_explain can be
improved to emit the plan on statment_timeout.

>From what I undestand, the behaviour of auto_explain is that it waits
for the query to finish before it emits the plan. This information is
very useful for diagnosing long-running queries that are still running.
Many a times, you encounter such queries in production workloads, and
reproducing such a scenario later on is either undesirable, expensive, or
even impossible. So being able to see the plan of a query that has
crossed auto_explain.log_min_duration as soon as possible, would be highly
desirable.

Best regards,
Gurjeet
http://Gurje.et

Support logical replication of large objects

2022-09-20 Thread Borui Yang

Hello,

I’m working on a patch to support logical replication of large objects
(LOBs). This is a useful feature when a database in logical
replication has lots of tables, functions and other objects that
change over time, such as in online cross major version upgrade.  As
an example, this lets users replicate large objects between different
PostgreSQL major versions.

The topic was previously discussed in [1]. Moreover, we need to
address the following 3 challenges. I worked on some designs and
appreciate feedback :

1. Replication of the change stream of LOBs
My suggestion is that we can just add a check when any LOB
function or API is called and executed in the backend, and then add a
simple SQL command in WAL files to do the replication . Take lo_unlink
as example[2] : we can create a “simple” SQL like  SELECT
lo_unlink(); and log it in WAL, so that replica only needs to
replay the “simple” SQL command. We can unlink the LOBs in replica
accordingly.
Pros :
a. We do not need to add much additional functionality.
b. For most of the LOBs related APIs, we don’t need to touch whole
LOBs, except for the case creation of LOBs.
Cons:
a. For the case creation of LOBs, we may need to split the whole
LOB content into WAL files which will increase volume of WAL and
replicated writes dramatically. This could be prevented if we can make
sure the original file is publicly shared, like a url from cloud
storage, or exists on host on replica as well.
2. Initializing replication of LOBs
When a subscription is established, LOBs in the source should be
replicated even if they are not created in replica. Here are two
options to approach this problem:
Option 1 : Support LOB related system catalogs in logical replication
We can make an exception in this line[3] in the
“check_publication_add_schema” function.
   Pros: All required LOBs can be transferred to replica.
   Cons: There is currently no support for allowing logical
replication of system catalogs.
Option 2 : Run a function or a script from source instance when it
detects logical replication is established.
The point is that we can ask the source to replicate whole LOBs
when a new subscription is created.
Maybe we can copy the whole LOBs related system catalogs and
replicate the copy to replica, then restore the original LOBs into
replica from the copy.
   Cons :  This will increase the volume of WAL and replicated
writes dramatically. I currently do not have a preference on either
option. I would like to see if others have thoughts on how we could
approach this.
3. OID conflicts
A common case is that the OID we want to publish is already used
in subscriber.
Option 1 (My recommendation) :  Create/Update existing System
catalog for mapping the OID if conflict happens
Maybe we could add another column naming like mapping_oid in
system catalog pg_largeobject_metaqdate on the replica. When the
replica detects the OID (E.g. 16400) that  source is replicating is
already used in replica, replica could store the 16400 as mapping_oid
and create a new OID (E.g. 16500) as oid to be used in replica, so
whatever operation is done on 16400 in source, in replica we just need
to perform on 16500.
   Cons : We would need to add additional columns to the system catalog
Option 2 :  Prompt error message in Replica and let user handle it manually
   Cons : This is not user-friendly.

Please let me know your thoughts.

Borui Yang
Amazon RDS/Aurora for PostgreSQL

[1] 
https://www.postgresql.org/message-id/VisenaEmail.16.35d9e854e3626e81.15cd0de93df%40tc7-visena
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/libpq/be-fsstubs.c#l312
[3] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/catalog/pg_publication.c#l98

Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Jonathan S. Katz

On 9/20/22 10:55 AM, Mark Dilger wrote:

On Sep 19, 2022, at 8:03 PM, Jonathan S. Katz wrote:

"When a partitioned table is added to a publication, all of its existing and future
partitions are implicitly considered to be part of the publication."[10]

Additionally, this is the behavior that is already present in "FOR ALL TABLES":

"Marks the publication as one that replicates changes for all tables in the
database, including tables created in the future."[10]

I don't think we should change this behavior that's already in logical
replication.

The existing behavior in logical replication doesn't have any "IN SCHEMA"
qualifiers.

This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier.
This was discussed multiple times on the original thread[1].

While I understand the reasons why "GRANT ... ALL TABLES IN SCHEMA" has a different behavior (i.e.
it's not applied to future objects) and do not advocate to change it, I have personally been affected where I
thought a permission would be applied to all future objects, only to discover otherwise. I believe it's more
intuitive to think that "ALL" applies to "everything, always."

The conversation is focusing on what "ALL TABLES" means, but the ambiguous part is what "IN SCHEMA" means. In GRANT it
means "currently in schema, computed now." We are about to create confusion by adding the "IN SCHEMA" phrase to
publication commands meaning "later in schema, computed then." A user who diligently consults the documentation for one command
to discover what "IN SCHEMA" means may fairly, but wrongly, assume it means the same thing in another command.

I tried to diligently read the sections where we talk about granting +
privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I
missed it, and I read through it twice, it does not explicitly state
whether or not "GRANT" applies to all objects at only that given moment,
or to future objects of that type which are created in that schema.
Maybe the behavior is implied or is part of the standard, but it's not
currently documented. We do link to "ALTER DEFAULT PRIVILEGES" at the
bottom of the GRANT[2] docs, but we don't give any indication as to why.

(This is also to say we should document in GRANT that ALL * IN SCHEMA
does not apply to future objects; if you need that behavior use ALTER
DEFAULT PRIVILEGES. Separate thread :)

I understand there is a risk of confusion of the similar grammar across
commands, but the current command in logical replication has this is
building on the existing behavior.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/flat/CALDaNm0OANxuJ6RXqwZsM1MSY4s19nuH3734j4a72etDwvBETQ%40mail.gmail.com

[2] https://www.postgresql.org/docs/current/sql-grant.html
[3] https://www.postgresql.org/docs/current/ddl-priv.html

OpenPGP_signature
Description: OpenPGP digital signature

Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Jonathan S. Katz

(RMT hat on, unless otherwise noted)

On 9/20/22 9:42 AM, Robert Haas wrote:

On Mon, Sep 19, 2022 at 11:03 PM Jonathan S. Katz wrote:

For #1 (allowing calls that have schema/table overlap...), there appears
to be both a patch that allows this (reversing[8]), and a suggestion for
dealing with a corner-case that is reasonable, i.e. disallowing adding
schemas to a publication when specifying column-lists. Do we think we
can have consensus on this prior to the RC1 freeze?

I am not sure whether we can or should rush a fix in that fast, but I
agree with this direction.

The RMT met today to discuss this.

We did agree that the above is an open item that should be resolved
before this release. While it is an accepted pattern for us to "ERROR"
on unsupported behavior and then later introduce said behavior, we do
agree with Peter's original post in this thread and would like it resolved.

As for the state of the fix, the patch has been iterated on and Amit
felt ready to commit it[1]. We do want to hear how others feel about
this, but the folks behind this feature have been working on this patch
since this was reported.

For #2 ("ALL TABLES IN SCHEMA" syntax), this was heavily discussed on
the original thread[1][3][4][5][7]. I thought Tom's proposal on the
syntax[3] was reasonable as it "future proofs" for when we allow other
schema-scoped objects to be published and give control over which ones
can be published.

All right, well, I still don't like it and think it's confusing, but
perhaps I'm in the minority.

The RMT discussed this as well. The RMT feels that there should not be
any changes to syntax/behavior for this release. This doesn't preclude
future work in this area (e.g. having a toggle for "all future
behavior"), but based on all the discussions and existing behavior in
this feature, we do not see a need to make changes or delay the release
on this.

I don't think we should change this behavior that's already in logical
replication. While I understand the reasons why "GRANT ... ALL TABLES IN
SCHEMA" has a different behavior (i.e. it's not applied to future
objects) and do not advocate to change it, I have personally been
affected where I thought a permission would be applied to all future
objects, only to discover otherwise. I believe it's more intuitive to
think that "ALL" applies to "everything, always."

Nah, there's room for multiple behaviors here. It's reasonable to want
to add all the tables currently in the schema to a publication (or
grant permissions on them) and it's reasonable to want to include all
current and future tables in the schema in a publication (or grant
permissions on them) too. The reason I don't like the ALL TABLES IN
SCHEMA syntax is that it sounds like the former, but actually is the
latter. Based on your link to the email from Tom, I understand now the
reason why it's like that, but it's still counterintuitive to me.

I understand your view on "multiple behaviors" and I do agree with your
reasoning. I still think we should leave this as is, but perhaps this
opens up an option we add later to specify the behavior.

For #3 (more superuser-only), in general I do agree that we shouldn't be
adding more of these. However, we have in this release, and not just to
this feature. ALTER SUBSCRIPTION ... SKIP[11] requires superuser. I
think it's easier for us to "relax" privileges (e.g. w/predefined roles)
than to make something "superuser-only" in the future, so I don't see
this being a blocker for v15. The feature will continue to work for
users even if we remove "superuser-only" in the future.

Yeah, this is clearly not a release blocker, I think.

The RMT concurs. We do recommend future work on "relaxing" the
superuser-only requirement.

Thanks,

Jonathan

[1]
https://www.postgresql.org/message-id/CAA4eK1LDhoBM8K5uVme8PZ%2BkxNOfVpRh%3DoO42JtFdqBgBuj1bA%40mail.gmail.com

OpenPGP_signature
Description: OpenPGP digital signature

Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Mark Dilger

> On Sep 20, 2022, at 12:36 PM, Jonathan S. Katz  wrote:
> 
> This behavior exists "FOR ALL TABLES" without the "IN SCHEMA" qualifier. This 
> was discussed multiple times on the original thread[1].

Yes, nobody is debating that as far as I can see.  And I do take your point 
that this stuff was discussed in other threads quite a while back.

> I tried to diligently read the sections where we talk about granting + 
> privileges[2][3] to see what it says about "ALL * IN SCHEMA". Unless I missed 
> it, and I read through it twice, it does not explicitly state whether or not 
> "GRANT" applies to all objects at only that given moment, or to future 
> objects of that type which are created in that schema. Maybe the behavior is 
> implied or is part of the standard, but it's not currently documented.

Interesting.  Thanks for that bit of research.

> We do link to "ALTER DEFAULT PRIVILEGES" at the bottom of the GRANT[2] docs, 
> but we don't give any indication as to why.
> 
> (This is also to say we should document in GRANT that ALL * IN SCHEMA does 
> not apply to future objects;

Yes, I agree this should be documented.

> if you need that behavior use ALTER DEFAULT PRIVILEGES. Separate thread :)
> 
> I understand there is a risk of confusion of the similar grammar across 
> commands, but the current command in logical replication has this is building 
> on the existing behavior.

I don't complain that it is buidling on the existing behavior.  I'm *only* 
concerned about the keywords we're using for this.  Consider the following:

   -- AS ADMIN
   CREATE USER bob NOSUPERUSER;
   GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob;
   SET ROLE bob;
   CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA 
option is reserved to superusers.  But we agreed that was a stop-gap solution 
that we'd potentially loosen in the future.  Certainly we'll need wiggle room 
in the syntax to perform that loosening:

   --- Must be superuser for this in pg15, and in subsequent releases.
   CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo;

   --- Not supported in pg15, but reserved for some future pg versions to allow
   --- non-superusers to create publications on tables currently in schema foo,
   --- assuming they have sufficient privileges on those tables
   CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

Doing it this way makes the syntax consistent between the GRANT...TO bob and 
the CREATE PUBLICATION bobs_pub.  Surely this makes more sense?

I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to 
another database that uses that keyword for what I think is a similar purpose.  
We should choose *something* for this, though, if we want things to be rational 
going forward.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making C function declaration parameter names consistent with corresponding definition names

2022-09-20 Thread Peter Geoghegan

On Mon, Sep 19, 2022 at 11:36 PM Peter Geoghegan  wrote:
> Attached revision v4 fixes those pg_dump patch items.
>
> It also breaks out the ecpg changes into their own patch.

I pushed much of this just now. All that remains to bring the entire
codebase into compliance is the ecpg patch and the pg_dump patch.
Those two areas are relatively tricky. But it's now unlikely that I'll
need to push a commit that makes relatively many CF patches stop
applying against HEAD -- that part is over.

Once we're done with ecpg and pg_dump, we can talk about the actual
practicalities of formally adopting a project policy on consistent
parameter names. I mostly use clang-tidy via my editor's support for
the clangd language server -- clang-tidy is primarily a linter, so it
isn't necessarily run in bulk all that often. I'll need to come up
with instructions for running clang-tidy from the command line that
are easy to follow.

I've found that the run_clang_tidy script (AKA run-clang-tidy.py)
works, but the whole experience feels hobbled together. I think that
we really need something like a build target for this -- something
comparable to what we do to support GCOV. That would also allow us to
use additional clang-tidy checks, which might be useful. We might even
find it useful to come up with some novel check of our own. Apparently
it's not all that difficult to write one from scratch, to implement
custom rules. There are already custom rules for big open source
projects such as the Linux Kernel, Chromium, and LLVM itself.

-- 
Peter Geoghegan

Re: Auto explain after query timeout

On Tue, Sep 20, 2022 at 3:06 PM Robert Haas  wrote:
>
> On Tue, Sep 20, 2022 at 2:35 PM James Coleman  wrote:
> > Either I'm missing something (and/or this was fixed in a later PG
> > version), but I don't think this is how it works. We have this
> > specific problem now: we set auto_explain.log_min_duration to 200 (ms)
> > and statement_timeout set to 30s, but when a statement times out we do
> > not get the plan logged with auto-explain.
>
> I think you're correct. auto_explain uses the ExecutorEnd hook, but
> that hook will not be fired in the case of an error. Indeed, if an
> error has already been thrown, it would be unsafe to try to
> auto-explain anything. For instance -- and this is just one problem of
> probably many -- ExplainTargetRel() performs catalog lookups, which is
> not OK in a failed transaction.
>
> To make this work, I think you'd need a hook that fires just BEFORE
> the error is thrown. However, previous attempts to introduce hooks
> into ProcessInterrupts() have not met with a wam response from Tom, so
> it might be a tough sell. And maybe for good reason. I see at least
> two problems. The first is that explaining a query is a pretty
> complicated operation that involves catalog lookups and lots of
> complicated stuff, and I don't think that it would be safe to do all
> of that at any arbitrary point in the code where ProcessInterrupts()
> happened to fire. What if one of the syscache lookups misses the cache
> and we have to open the underlying catalog? Then
> AcceptInvalidationMessages() will fire, but we don't currently assume
> that any old CHECK_FOR_INTERRUPTS() can process invalidations. What if
> the running query has called a user-defined function or procedure
> which is running DDL which is in the middle of changing catalog state
> for some relation involved in the query at the moment that the
> statement timeout arrives? I feel like there are big problems here.
>
> The other problem I see is that ProcessInterrupts() is our mechanism
> for allowing people to escape from queries that would otherwise run
> forever by hitting ^C. But what if the explain code goes crazy and
> itself needs to be interrupted, when we're already inside
> ProcessInterrupts()? Maybe that would work out OK... or maybe it
> wouldn't. I'm not really sure. But it's another reason to be very,
> very cautious about putting complicated logic inside
> ProcessInterrupts().

This is exactly the kind of background I was hoping someone would
provide; thank you, Robert.

It seems like one could imagine addressing all of these by having one of:

- A safe explain (e.g., disallow catalog access) that is potentially
missing information.
- A safe way to interrupt queries such as "safe shutdown" of a node
(e.g., a seq scan could stop returning tuples early) and allow a
configurable buffer of time after the statement timeout before firing
a hard abort of the query (and transaction).

Both of those seem like a significant amount of work.

Alternatively I wonder if it's possible (this would maybe assume no
catalog changes in the current transaction -- or at least none that
would be referenced by the current query) to open a new transaction
(with the same horizon information) and duplicate the plan over to
that transaction and run the explain there. This way you do it *after*
the error is raised. That's some serious spit-balling -- I'm not
saying that's doable, just trying to imagine how one might
comprehensively address the concerns.

Does any of that seem at all like a path you could imagine being fruitful?

Thanks,
James Coleman

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-09-20 Thread Peter Geoghegan

On Mon, Sep 12, 2022 at 2:01 PM Peter Geoghegan  wrote:
> I'd like to talk about one such technique on this thread. The attached
> WIP patch reduces the size of xl_heap_freeze_page records by applying
> a simple deduplication process.

Attached is v2, which I'm just posting to keep CFTester happy. No real
changes here.

--
Peter Geoghegan
From c555f716a79c7de1f26a5e1e87265f2702b980ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 30 Jul 2022 10:47:59 -0700
Subject: [PATCH v2] Shrink freeze WAL records via deduplication.

---
 src/include/access/heapam.h|  25 ++
 src/include/access/heapam_xlog.h   |  34 +--
 src/backend/access/heap/heapam.c   | 310 -
 src/backend/access/heap/vacuumlazy.c   |  39 +---
 src/backend/access/rmgrdesc/heapdesc.c |   4 +-
 5 files changed, 289 insertions(+), 123 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9dab35551..579ef32b0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,6 +99,19 @@ typedef enum
 	HEAPTUPLE_DELETE_IN_PROGRESS	/* deleting xact is still in progress */
 } HTSV_Result;
 
+/* heap_prepare_freeze_tuple state describing how to freeze a tuple */
+typedef struct HeapFreezeTuple
+{
+	/* Fields describing how to process tuple */
+	uint16		t_infomask2;
+	uint16		t_infomask;
+	TransactionId xmax;
+	uint8		frzflags;
+
+	/* Page offset number for tuple */
+	OffsetNumber offset;
+} HeapFreezeTuple;
+
 /* 
  *		function prototypes for heap access method
  *
@@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  Buffer *buffer, struct TM_FailureData *tmfd);
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+	  TransactionId relfrozenxid,
+	  TransactionId relminmxid,
+	  TransactionId cutoff_xid,
+	  TransactionId cutoff_multi,
+	  HeapFreezeTuple *frz,
+	  bool *totally_frozen,
+	  TransactionId *relfrozenxid_out,
+	  MultiXactId *relminmxid_out);
+extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+		TransactionId latestRemovedXid,
+		HeapFreezeTuple *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 			  TransactionId relfrozenxid, TransactionId relminmxid,
 			  TransactionId cutoff_xid, TransactionId cutoff_multi);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93c..4e3a023eb 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -315,34 +315,36 @@ typedef struct xl_heap_inplace
 
 /*
  * This struct represents a 'freeze plan', which is what we need to know about
- * a single tuple being frozen during vacuum.
+ * a group of tuples being frozen from the same page.
  */
 /* 0x01 was XLH_FREEZE_XMIN */
 #define		XLH_FREEZE_XVAC		0x02
 #define		XLH_INVALID_XVAC	0x04
 
-typedef struct xl_heap_freeze_tuple
+typedef struct xl_heap_freeze_plan
 {
 	TransactionId xmax;
-	OffsetNumber offset;
+	uint16		ntuples;
 	uint16		t_infomask2;
 	uint16		t_infomask;
 	uint8		frzflags;
-} xl_heap_freeze_tuple;
+} xl_heap_freeze_plan;
 
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
- * Backup block 0's data contains an array of xl_heap_freeze_tuple structs,
- * one for each tuple.
+ * Backup block 0's data contains an array of xl_heap_freeze_plan structs.
  */
 typedef struct xl_heap_freeze_page
 {
-	TransactionId cutoff_xid;
-	uint16		ntuples;
+	TransactionId latestRemovedXid;
+	uint16		nplans;
+
+	/* FREEZE PLANS FOLLOW */
+	/* OFFSET NUMBERS FOLLOW  */
 } xl_heap_freeze_page;
 
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
 
 /*
  * This is what we need to know about setting a visibility map bit
@@ -401,20 +403,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
-  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
-  int ntuples);
-extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-	  TransactionId relfrozenxid,
-	  TransactionId relminmxid,
-	  TransactionId cutoff_xid,
-	  TransactionId cutoff_multi,
-	  xl_heap_freeze_tuple *frz,
-	  bool *totally_frozen,
-	  TransactionId *relfrozenxid_out,
-	  MultiXactId *relminmxid_out);
-extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
-	  xl_heap_freeze_tuple *frz);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
    Buffer vm_buffer, TransactionId cutoff_xid, u

Re: Auto explain after query timeout

On Tue, Sep 20, 2022 at 5:08 PM James Coleman  wrote:
> - A safe explain (e.g., disallow catalog access) that is potentially
> missing information.

This would be pretty useless I think, because you'd be missing all
relation names.

> - A safe way to interrupt queries such as "safe shutdown" of a node
> (e.g., a seq scan could stop returning tuples early) and allow a
> configurable buffer of time after the statement timeout before firing
> a hard abort of the query (and transaction).

This might be useful, but it seems extremely difficult to get working.
You'd not only have to design the safe shutdown mechanism itself, but
also find a way to safely engage it at the right times.

> Alternatively I wonder if it's possible (this would maybe assume no
> catalog changes in the current transaction -- or at least none that
> would be referenced by the current query) to open a new transaction
> (with the same horizon information) and duplicate the plan over to
> that transaction and run the explain there. This way you do it *after*
> the error is raised. That's some serious spit-balling -- I'm not
> saying that's doable, just trying to imagine how one might
> comprehensively address the concerns.

Doesn't work, because the new transaction's snapshot wouldn't be the
same as that of the old one. Imagine that you create a table and run a
query on it in the same transaction. Then you migrate the plan tree to
a new transaction and try to find out the table name. But in the new
transaction, that table doesn't exist: it was destroyed by the
previous rollback.

Honestly I have no very good ideas how to create the feature you want
here. I guess the only thing I can think of is to separate the EXPLAIN
process into two phases: a first phase that runs when the plan tree is
set up and gathers all of the information that we might need later,
like relation names, and then a second phase that runs later when you
want to generate the output and does nothing that can fail, or at
least no database: maybe it's allowed to allocate memory, for example.
But that sounds like a big and perhaps painful refactoring exercise,
and I can imagine that there might be reasons why it doesn't work out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
> I'm attaching v5 patch-set. I've addressed review comments received so
> far and fixed a compiler warning that CF bot complained about.
> 
> Please review it further.

0001 looks reasonable to me.

+errno = 0;
+rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
appropriately.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work.  Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote:
> On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:
>> I'm attaching v5 patch-set. I've addressed review comments received so
>> far and fixed a compiler warning that CF bot complained about.
>> 
>> Please review it further.
> 
> 0001 looks reasonable to me.
> 
> +errno = 0;
> +rc = pg_pwritev_zeros(fd, pad_to_size);
> 
> Do we need to reset errno?  pg_pwritev_zeros() claims to set errno
> appropriately.
> 
> +/*
> + * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
> + * writing more bytes per pg_pwritev_with_retry() call is proven to be more
> + * performant.
> + */
> +#define PWRITEV_BLCKSZ  XLOG_BLCKSZ
> 
> This seems like something we should sort out now instead of leaving as
> future work.  Given your recent note, I think we should just use
> XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
> findings with different buffer sizes.

I also noticed that the latest patch set no longer applies, so I've marked
the commitfest entry as waiting-on-author.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Federated Authn/z with OAUTHBEARER

Hi Mahendrakar, thanks for your interest and for the patch!

On Mon, Sep 19, 2022 at 10:03 PM mahendrakar s
 wrote:
> The changes for each component are summarized below.
>
> 1. Provider-specific extension:
> Each OAuth provider implements their own token validator as an
> extension. Extension registers an OAuth provider hook which is matched
> to a line in the HBA file.

How easy is it to write a Bearer validator using C? My limited
understanding was that most providers were publishing libraries in
higher-level languages.

Along those lines, sample validators will need to be provided, both to
help in review and to get the pytest suite green again. (And coverage
for the new code is important, too.)

> 2. Add support to pass on the OAuth bearer token. In this
> obtaining the bearer token is left to 3rd party application or user.
>
> ./psql -U  -d 'dbname=postgres
> oauth_client_id= oauth_bearer_token=

This hurts, but I think people are definitely going to ask for it, given
the frightening practice of copy-pasting these (incredibly sensitive
secret) tokens all over the place... Ideally I'd like to implement
sender constraints for the Bearer token, to *prevent* copy-pasting (or,
you know, outright theft). But I'm not sure that sender constraints are
well-implemented yet for the major providers.

> 3. HBA: An additional param ‘provider’ is added for the oauth method.
> Defining "oauth" as method + passing provider, issuer endpoint
> and expected audience
>
> * * * * oauth   provider=
> issuer= scope=

Naming aside (this conflicts with Samay's previous proposal, I think), I
have concerns about the implementation. There's this code:

> + if (oauth_provider && oauth_provider->name)
> + {
> + ereport(ERROR,
> + (errmsg("OAuth provider \"%s\" is already 
> loaded.",
> + oauth_provider->name)));
> + }

which appears to prevent loading more than one global provider. But
there's also code that deals with a provider list? (Again, it'd help to
have test code covering the new stuff.)

>b)  libpq optionally compiled for the clients which
> explicitly need libpq to orchestrate OAuth communication with the
> issuer (it depends heavily on 3rd party library iddawc as Jacob
> already pointed out. The library seems to be supporting all the OAuth
> flows.)

Speaking of iddawc, I don't think it's a dependency we should choose to
rely on. For all the code that it has, it doesn't seem to provide
compatibility with several real-world providers.

Google, for one, chose not to follow the IETF spec it helped author, and
iddawc doesn't support its flavor of Device Authorization. At another
point, I think iddawc tried to decode Azure's Bearer tokens, which is
incorrect...

I haven't been able to check if those problems have been fixed in a
recent version, but if we're going to tie ourselves to a huge
dependency, I'd at least like to believe that said dependency is
battle-tested and solid, and personally I don't feel like iddawc is.

> - auth_method = I_TOKEN_AUTH_METHOD_NONE;
> - if (conn->oauth_client_secret && *conn->oauth_client_secret)
> - auth_method = I_TOKEN_AUTH_METHOD_SECRET_BASIC;

This code got moved, but I'm not sure why? It doesn't appear to have
made a change to the logic.

> + if (conn->oauth_client_secret && *conn->oauth_client_secret)
> + {
> + session_response_type = I_RESPONSE_TYPE_CLIENT_CREDENTIALS;
> + }

Is this an Azure-specific requirement? Ideally a public client (which
psql is) shouldn't have to provide a secret to begin with, if I
understand that bit of the protocol correctly. I think Google also
required provider-specific changes in this part of the code, and
unfortunately I don't think they looked the same as yours.

We'll have to figure all that out... Standards are great; everyone has
one of their own. :)

Thanks,
--Jacob

Re: predefined role(s) for VACUUM and ANALYZE

On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote:
> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
>> Any impact for the column sizes of the catalogs holding ACL
>> information?  Just asking while browsing the patch set.
> 
> Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
> in my testing, I hit a "row is too big" error with the same number of
> aclitems in a pg_class row before and after the change.  I might be missing
> something in my patch, or maybe I am misunderstanding how arrays of
> aclitems are stored on disk.

Ah, it looks like relacl is compressed.  The column is marked "extended,"
but pg_class doesn't appear to have a TOAST table, so presumably no
out-of-line storage can be used.  I found a couple of threads about this
[0] [1] [2].

[0] https://postgr.es/m/17245.964897719%40sss.pgh.pa.us
[1] https://postgr.es/m/200309040531.h845ViP05881%40candle.pha.pa.us
[2] https://postgr.es/m/29061.1265327626%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: default sorting behavior for index

2022-09-20 Thread Tom Lane

Zhihong Yu  writes:
> I was looking at this check in src/backend/parser/parse_utilcmd.c w.r.t.
> constraint:
> ...
> If the index has DESC sorting order, why it cannot be used to back a
> constraint ?
> Some concrete sample would help me understand this.

Please read the nearby comments, particularly

 * Insist on default opclass, collation, and sort options.
 * While the index would still work as a constraint with
 * non-default settings, it might not provide exactly the same
 * uniqueness semantics as you'd get from a normally-created
 * constraint; and there's also the dump/reload problem
 * mentioned above.

The "mentioned above" refers to this:

 * Insist on it being a btree.  That's the only kind that supports
 * uniqueness at the moment anyway; but we must have an index that
 * exactly matches what you'd get from plain ADD CONSTRAINT syntax,
 * else dump and reload will produce a different index (breaking
 * pg_upgrade in particular).

The concern about whether the uniqueness semantics are the same probably
mostly applies to just the opclass and collation properties.  However,
rd_indoption contains AM-specific options, and we have little ability
to be sure in this code exactly what those bits might do.  In any case
we'd definitely have a risk of things breaking during pg_upgrade if we
ignore rd_indoption.

regards, tom lane

Re: predefined role(s) for VACUUM and ANALYZE

On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote:
> On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote:
>> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
>>> Any impact for the column sizes of the catalogs holding ACL
>>> information?  Just asking while browsing the patch set.
>> 
>> Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
>> in my testing, I hit a "row is too big" error with the same number of
>> aclitems in a pg_class row before and after the change.  I might be missing
>> something in my patch, or maybe I am misunderstanding how arrays of
>> aclitems are stored on disk.
> 
> Ah, it looks like relacl is compressed.  The column is marked "extended,"
> but pg_class doesn't appear to have a TOAST table, so presumably no
> out-of-line storage can be used.  I found a couple of threads about this
> [0] [1] [2].

I suppose there is some risk that folks with really long aclitem arrays
might be unable to pg_upgrade to a version with uint64 AclModes, but I
suspect that risk is limited to extreme cases (i.e., multiple thousands of
aclitems).  I'm not sure whether that's worth worrying about too much.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [RFC] building postgres with meson - v13

2022-09-20 Thread Andres Freund

Hi,

On 2022-09-19 19:16:30 -0700, Andres Freund wrote:
> To have some initial "translation" for other developers I've started a wiki
> page with a translation table. Still very WIP:
> https://wiki.postgresql.org/wiki/Meson
> 
> For now, a bit of polishing aside, I'm just planning to add a minimal
> explanation of what's happening, and a reference to this thread.

I added installation instructions for meson for a bunch of platforms, but
failed to figure out how to do so in a rhel9 container. I don't have a rhel
subscription, and apparently the repos with developer tools now require a
subscription. Great way to make it easy for projects to test anything on RHEL.

Greetings,

Andres Freund

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Hi Onder,

Thanks for addressing all my previous feedback. I checked the latest
v12-0001, and have no more comments at this time.

One last thing - do you think there is any need to mention this
behaviour in the pgdocs, or is OK just to be a hidden performance
improvement?

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Summary function for pg_buffercache

2022-09-20 Thread Andres Freund

Hi,

On 2022-09-20 12:45:24 +0300, Aleksander Alekseev wrote:
> > I'm not sure how to avoid any undefined behaviour without locks though.
> > Even with locks, performance is much better. But is it good enough for 
> > production?
>
> Potentially you could avoid taking locks by utilizing atomic
> operations and lock-free algorithms. But these algorithms are
> typically error-prone and not always produce a faster code than the
> lock-based ones. I'm pretty confident this is out of scope of this
> particular patch.

Why would you need lockfree operations?  All you need to do is to read
BufferDesc->state into a local variable and then make decisions based on that?


> + for (int i = 0; i < NBuffers; i++)
> + {
> + BufferDesc *bufHdr;
> + uint32  buf_state;
> +
> + bufHdr = GetBufferDescriptor(i);
> +
> + /* Lock each buffer header before inspecting. */
> + buf_state = LockBufHdr(bufHdr);
> +
> + /* Invalid RelFileNumber means the buffer is unused */
> + if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag)))
> + {
> + buffers_used++;
> + usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state);
> +
> + if (buf_state & BM_DIRTY)
> + buffers_dirty++;
> + }
> + else
> + buffers_unused++;
> +
> + if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
> + buffers_pinned++;
> +
> + UnlockBufHdr(bufHdr, buf_state);
> + }

I.e. instead of locking the buffer header as done above, this could just do
something along these lines:

BufferDesc *bufHdr;
uint32  buf_state;

bufHdr = GetBufferDescriptor(i);

buf_state = pg_atomic_read_u32(&bufHdr->state);

if (buf_state & BM_VALID)
{
buffers_used++;
usagecount_avg += BUF_STATE_GET_USAGECOUNT(buf_state);

if (buf_state & BM_DIRTY)
buffers_dirty++;
}
else
buffers_unused++;

if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
buffers_pinned++;


Without a memory barrier you can get very slightly "out-of-date" values of the
state, but that's fine in this case.

Greetings,

Andres Freund

Re: Support pg_attribute_aligned and noreturn in MSVC

2022-09-20 Thread Michael Paquier

On Tue, Sep 20, 2022 at 08:01:20AM -0400, James Coleman wrote:
> I don't have access to a Windows machine for testing, but re-reading
> the documentation it looks like the issue is that our noreturn macro
> is used after the definition while the MSVC equivalent is used before.

A CI setup would do the job for example, see src/tools/ci/README that
explains how to set up things.

> I've removed that for now (and commented about it); it's not as
> valuable anyway since it's mostly an indicator for code analysis
> (human and machine).

Except for the fact that the patch missed to define
pg_attribute_noreturn() in the MSVC branch, this looks fine to me.  I
have been looking at what you meant with packing, and I can see the
business with PACK(), something actually doable with gcc.

That's a first step, at least, and I see no reason not to do it, so
applied.
--
Michael

signature.asc
Description: PGP signature

Re: predefined role(s) for VACUUM and ANALYZE

2022-09-20 Thread Michael Paquier

On Tue, Sep 20, 2022 at 04:50:10PM -0700, Nathan Bossart wrote:
> On Tue, Sep 20, 2022 at 04:31:17PM -0700, Nathan Bossart wrote:
>> On Tue, Sep 20, 2022 at 11:05:33AM -0700, Nathan Bossart wrote:
>>> On Tue, Sep 20, 2022 at 02:45:52PM +0900, Michael Paquier wrote:
 Any impact for the column sizes of the catalogs holding ACL
 information?  Just asking while browsing the patch set.
>>> 
>>> Since each aclitem requires 16 bytes instead of 12, I assume so.  However,
>>> in my testing, I hit a "row is too big" error with the same number of
>>> aclitems in a pg_class row before and after the change.  I might be missing
>>> something in my patch, or maybe I am misunderstanding how arrays of
>>> aclitems are stored on disk.
>> 
>> Ah, it looks like relacl is compressed.  The column is marked "extended,"
>> but pg_class doesn't appear to have a TOAST table, so presumably no
>> out-of-line storage can be used.  I found a couple of threads about this
>> [0] [1] [2].

Adding a toast table to pg_class has been a sensitive topic over the
years.  Based on my recollection of the events, there were worries
about the potential cross-dependencies with pg_class and pg_attribute
that this would create.

> I suppose there is some risk that folks with really long aclitem arrays
> might be unable to pg_upgrade to a version with uint64 AclModes, but I
> suspect that risk is limited to extreme cases (i.e., multiple thousands of
> aclitems).  I'm not sure whether that's worth worrying about too much.

Did you just run an aclupdate()?  4% for aclitem[] sounds like quite a
number to me :/  It may be worth looking at if these operations could
be locally optimized more, as well.  I'd like to think that we could
live with that to free up enough bits in AclItems for the next 20
years, anyway.  Any opinions?

For the column sizes of the catalogs, I was wondering about how
pg_column_size() changes when they hold ACL information.  Unoptimized
alignment could cause an unnecessary increase in the structure sizes,
so the addition of new fields or changes in object size could have
unexpected side effects.
--
Michael


signature.asc
Description: PGP signature

Re: Hash index build performance tweak from sorting

2022-09-20 Thread David Rowley

On Tue, 2 Aug 2022 at 03:37, Simon Riggs  wrote:
> Using the above test case, I'm getting a further 4-7% improvement on
> already committed code with the attached patch, which follows your
> proposal.
>
> The patch passes info via a state object, useful to avoid API churn in
> later patches.

Hi Simon,

I took this patch for a spin and saw a 2.5% performance increase using
the random INT test that Tom posted. The index took an average of
7227.47 milliseconds on master and 7045.05 with the patch applied.

On making a pass of the changes, I noted down a few things.

1. In _h_spoolinit() the HSpool is allocated with palloc and then
you're setting the istate field to a pointer to the HashInsertState
which is allocated on the stack by the only calling function
(hashbuild()).  Looking at hashbuild(), it looks like the return value
of _h_spoolinit is never put anywhere to make it available outside of
the function, so it does not seem like there is an actual bug there.
However, it just seems like a bug waiting to happen. If _h_spoolinit()
is pallocing memory, then we really shouldn't be setting pointer
fields in that memory to point to something on the stack.  It might be
nicer if the istate field in HSpool was a HashInsertStateData and
_h_spoolinit() just memcpy'd the contents of that parameter.  That
would make HSpool 4 bytes smaller and save additional pointer
dereferences in _hash_doinsert().

2. There are quite a few casts that are not required. e.g:

_hash_doinsert(rel, itup, heapRel, (HashInsertState) &istate);
buildstate.spool = _h_spoolinit(heap, index, num_buckets,
(HashInsertState) &insertstate);
buildstate.istate = (HashInsertState) &insertstate;

This is just my opinion, but I don't really see the value in having a
typedef for a pointer to HashInsertStateData. I can understand that if
the struct was local to a .c file, but you've got the struct and
pointer typedef in the same header. I understand we often do this in
the code, but I feel like we do it less often in newer code. e.g we do
it in aset.c but not generation.c (which is much newer than aset.c).
My personal preference would be just to name the struct
HashInsertState and have no extra pointer typedefs.

3. Just a minor nitpick. Line wraps at 80 chars. You're doing this
sometimes but not others. This seems just to be due to the additional
function parameters that have been added.

4. I added the following Assert to _hash_pgaddtup() as I expected the
itup_off to be set to the same thing before and after this change. I
see the Assert is failing in the regression tests.

Assert(PageGetMaxOffsetNumber(page) + 1 ==
   _hash_binsearch(page, _hash_get_indextuple_hashkey(itup)));

I think this is because _hash_binsearch() returns the offset with the
first tuple with the given hashkey, so if there are duplicate hashkey
values then it looks like PageAddItemExtended() will set needshuffle
and memmove() the existing item(s) up one slot.  I don't know this
hash index building code very well, but I wonder if it's worth having
another version of _hash_binsearch() that can be used to make
_hash_pgaddtup() put any duplicate hashkeys after the existing ones
rather than before and shuffle the others up? It sounds like that
might speed up normal insertions when there are many duplicate values
to hash.

I wonder if this might be the reason the random INT test didn't come
out as good as your original test which had unique values. The unique
values test would do less shuffling during PageAddItemExtended(). If
so, that implies that skipping the binary search is only part of the
gains here and that not shuffling tuples accounts for quite a bit of
the gain you're seeing. If so, then it would be good to not have to
shuffle duplicate hashkey tuples up in the page during normal
insertions as well as when building the index.

In any case, it would be nice to have some way to assert that we don't
accidentally pass sorted==true to _hash_pgaddtup() when there's an
existing item on the page with a higher hash value. Maybe we could
just look at the hash value of the last tuple on the page and ensure
it's <= to the current one?

5. I think it would be nicer to move the insertstate.sorted = false;
into the else branch in hashbuild(). However, you might have to do
that anyway if you were to do what I mentioned in #1.

David

Re: Perform streaming logical transactions by background workers and parallel apply

FYI -

The latest patch 30-0001 fails to apply, it seems due to a recent commit [1].

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-parall.patch
error: patch failed: src/include/replication/logicalproto.h:246
error: src/include/replication/logicalproto.h: patch does not apply

--
[1] 
https://github.com/postgres/postgres/commit/bfcf1b34805f70df48eedeec237230d0cc1154a6

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Perform streaming logical transactions by background workers and parallel apply

2022-09-20 Thread wangw.f...@fujitsu.com

> FYI -
> 
> The latest patch 30-0001 fails to apply, it seems due to a recent commit [1].
> 
> [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v30-0001-Perform-streaming-logical-transactions-by-
> parall.patch
> error: patch failed: src/include/replication/logicalproto.h:246
> error: src/include/replication/logicalproto.h: patch does not apply

Thanks for your kindly reminder.

I rebased the patch set and attached them in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275298521AE1BBEF5A055EE9E4F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

Re: why can't a table be part of the same publication as its schema

2022-09-20 Thread Jonathan S. Katz


[personal views, not RMT]

On 9/20/22 4:06 PM, Mark Dilger wrote:


I don't complain that it is buidling on the existing behavior.  I'm *only* 
concerned about the keywords we're using for this.  Consider the following:

-- AS ADMIN
CREATE USER bob NOSUPERUSER;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA foo TO bob;
SET ROLE bob;
CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

We're going to have that fail in pg15 because the FOR ALL TABLES IN SCHEMA 
option is reserved to superusers.  But we agreed that was a stop-gap solution 
that we'd potentially loosen in the future.  Certainly we'll need wiggle room 
in the syntax to perform that loosening:

--- Must be superuser for this in pg15, and in subsequent releases.
CREATE PUBLICATION bobs_pub FOR ALL FUTURE TABLES IN SCHEMA foo;

--- Not supported in pg15, but reserved for some future pg versions to allow
--- non-superusers to create publications on tables currently in schema foo,
--- assuming they have sufficient privileges on those tables
CREATE PUBLICATION bobs_pub FOR ALL TABLES IN SCHEMA foo;

Doing it this way makes the syntax consistent between the GRANT...TO bob and 
the CREATE PUBLICATION bobs_pub.  Surely this makes more sense?


When you put it that way, I see your point. However, for the 
lesser-privileged user though, will the behavior be that it will 
continue to add all future tables in a schema to the publication so long 
as they have sufficient privileges on those tables? Or would that mirror 
the current behavior with GRANT?


While I understand it makes it consistent, the one concern I raise is 
that it means the less privileged user could have a less convenient user 
experience than the privileged user. Perhaps that's OK, but worth noting.



I'm not a huge fan of the keyword "FUTURE" here, but I found a reference to 
another database that uses that keyword for what I think is a similar purpose.


I did try doing research on this prior, but hadn't thought to 
incorporate "future" into my searches.


Doing so, I probably found the same database that you did that used the 
"FUTURE" word for adding permissions to future objects (and this is 
fresh, as the docs for it were published last week). That's definitely 
interesting.


I did see some notes on a legacy database system that offered similar 
advice to what we do for GRANT if you're not using ALTER DEFAULT PRIVILEGES.



 We should choose *something* for this, though, if we want things to be 
rational going forward.


That all said, while I understand your point and open to the suggestion 
on "FUTURE", I'm not convinced on the syntax change. But I'll sleep on it.


Jonathan


OpenPGP_signature
Description: OpenPGP digital signature

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-09-20 Thread kuroda.hay...@fujitsu.com

> One last thing - do you think there is any need to mention this
> behaviour in the pgdocs, or is OK just to be a hidden performance
> improvement?

FYI - I put my opinion.
We have following sentence in the logical-replication.sgml:

```
...
If the table does not have any suitable key, then it can be set
   to replica identity full, which means the entire row becomes
   the key.  This, however, is very inefficient and should only be used as a
   fallback if no other solution is possible.
...
```

Here the word "very inefficient" may mean that sequential scans will be 
executed every time.
I think some descriptions can be added around here.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory