date:20240708

Re: Logging which local address was connected to in log_line_prefix

2024-07-08 Thread David Steele


On 5/24/24 22:33, Greg Sabino Mullane wrote:
Peter, thank you for the feedback. Attached is a new patch with 
"address" rather than "interface", plus a new default of "local" if 
there is no address. I also removed the questionable comment, and 
updated the commitfest title.


I tried the updated patch and it behaved as expected with [local] being 
logged for peer connections and an IP being logged for host connections.


One thing -- the changes in postgresql.conf.sample should use tabs to 
match the other lines. The patch uses spaces.


I also find the formatting in log_status_format() pretty awkward but I 
guess that will be handled by pgindent.


Regards,
-David

Address the -Wuse-after-free warning in ATExecAttachPartition()

2024-07-08 Thread Nitin Jadhav

In [1], Andres reported a -Wuse-after-free bug in the
ATExecAttachPartition() function.  I've created a patch to address it
with pointers from Amit offlist.

The issue was that the partBoundConstraint variable was utilized after
the list_concat() function. This could potentially lead to accessing
the partBoundConstraint variable after its memory has been freed.

The issue was resolved by using the return value of the list_concat()
function, instead of using the list1 argument of list_concat(). I
copied the partBoundConstraint variable to a new variable named
partConstraint and used it for the previous references before invoking
get_proposed_default_constraint(). I confirmed that the
eval_const_expressions(), make_ands_explicit(),
map_partition_varattnos(), QueuePartitionConstraintValidation()
functions do not modify the memory location pointed to by the
partBoundConstraint variable. Therefore, it is safe to use it for the
next reference in get_proposed_default_constraint()

Attaching the patch. Please review and share the comments if any.
Thanks to Andres for spotting the bug and some off-list advice on how
to reproduce it.

[1]: 
https://www.postgresql.org/message-id/flat/202311151802.ngj2la66jwgi%40alvherre.pgsql#4fc5622772ba0244c1ad203f5fc56701

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft


v1-0001-Address-the-Wuse-after-free-warning-in-ATExecAttachP.patch
Description: Binary data

Re: Pluggable cumulative statistics

2024-07-08 Thread Bertrand Drouvot

Hi,

On Mon, Jul 08, 2024 at 03:49:34PM +0900, Michael Paquier wrote:
> On Mon, Jul 08, 2024 at 06:39:56AM +, Bertrand Drouvot wrote:
> > +   for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; 
> > kind++)
> > +   {
> > +   char   *ptr;
> > +   const PgStat_KindInfo *info = pgstat_get_kind_info(kind);
> > 
> > I wonder if we could avoid going through stats that are not fixed ones. 
> > What about
> > doing something like?
> > Same would apply for the read part added in 9004abf6206e.
> 
> This becomes more relevant when the custom stats are added, as this
> performs a scan across the full range of IDs supported.  So this
> choice is here for consistency, and to ease the pluggability.

Gotcha.

> 
> > 2 ===
> > 
> > +   pgstat_build_snapshot_fixed(kind);
> > +   ptr = ((char *) &pgStatLocal.snapshot) + 
> > info->snapshot_ctl_off;
> > +   write_chunk(fpout, ptr, info->shared_data_len);
> > 
> > I think that using "shared_data_len" is confusing here (was not the case in 
> > the
> > context of 9004abf6206e). I mean it is perfectly correct but the wording 
> > "shared"
> > looks weird to me when being used here. What it really is, is the size of 
> > the
> > stats. What about renaming shared_data_len with stats_data_len?
> 
> It is the stats data associated to a shared entry.  I think that's OK,
> but perhaps I'm just used to it as I've been staring at this area for
> days.

Yeah, what I meant to say is that one could think for example that's the
PgStatShared_Archiver size while in fact it's the PgStat_ArchiverStats size.
I think it's more confusing when writing the stats. Here we are manipulating
"snapshot" and "snapshot" offsets. It was not that confusing when reading as we
are manipulating "shmem" and "shared" offsets.

As I said, the code is fully correct, that's just the wording here that sounds
weird to me in the "snapshot" context.

Except the above (which is just a Nit), 0001 LGTM.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

回复：Re: 回复：Re: speed up pg_upgrade with large number of tables

2024-07-08 Thread 杨伯宇(长堂)

> Thanks! Since you mentioned that you have multiple databases with 1M+
> databases, you might also be interested in commit 2329cad. That should
> speed up the pg_dump step quite a bit.
Wow, I noticed this commit(2329cad) when it appeared in commitfest. It has
doubled the speed of pg_dump in this scenario. Thank you for your effort!

Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient to 
this situation. Some time-consuming functions like check_for_data_types_usage
are not yet able to run in parallel. But these patches could be a great
starting point for a more efficient parallelism implementation. Maybe we can
do it later.

Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions

2024-07-08 Thread jian he

another tiny issue.

-select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a'
OMIT QUOTES);
+JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a' OMIT
QUOTES);
 [1, 2]


-select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a'
RETURNING int[] OMIT QUOTES ERROR ON ERROR);
+JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a'
RETURNING int[] OMIT QUOTES ERROR ON ERROR);

These two example queries don't need semicolons at the end?

Re: Pgoutput not capturing the generated columns

2024-07-08 Thread Peter Smith

Hi Shlok, Here are some review comments for patch v15-0003.

==
src/backend/catalog/pg_publication.c

1. publication_translate_columns

The function comment says:
 * Translate a list of column names to an array of attribute numbers
 * and a Bitmapset with them; verify that each attribute is appropriate
 * to have in a publication column list (no system or generated attributes,
 * no duplicates).  Additional checks with replica identity are done later;
 * see pub_collist_contains_invalid_column.

That part about "[no] generated attributes" seems to have gone stale
-- e.g. not quite correct anymore. Should it say no VIRTUAL generated
attributes?

==
src/backend/replication/logical/proto.c

2. logicalrep_write_tuple and logicalrep_write_attrs

I thought all the code fragments like this:

+ if (att->attgenerated && att->attgenerated != ATTRIBUTE_GENERATED_STORED)
+ continue;
+

don't need to be in the code anymore, because of the BitMapSet (BMS)
processing done to make the "column list" for publication where
disallowed generated cols should already be excluded from the BMS,
right?

So shouldn't all these be detected by the following statement:
if (!column_in_column_list(att->attnum, columns))
  continue;

==
src/backend/replication/logical/tablesync.c
3.
+ if(server_version >= 12)
+ {
+ bool gencols_allowed = server_version >= 17 &&
MySubscription->includegencols;
+
+ if (gencols_allowed)
+ {

Should say server_version >= 18, instead of 17

==
src/backend/replication/pgoutput/pgoutput.c

4. send_relation_and_attrs

(this is a similar comment for #2 above)

IIUC of the advantages of the BitMapSet (BMS) idea in patch 0001 to
process the generated columns up-front means there is no need to check
them again in code like this.

They should be discovered anyway in the subsequent check:
/* Skip this attribute if it's not present in the column list */
if (columns != NULL && !bms_is_member(att->attnum, columns))
  continue;

==
src/test/subscription/t/011_generated.pl

5.
AFAICT there are still multiple comments (e.g. for the "TEST tab"
comments) where it still says "generated" instead of "stored
generated". I did not make a "nitpicks" diff for these because those
comments are inherited from the prior patch 0002 which still has
outstanding review comments on it too. Please just search/replace
them.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Parallel CREATE INDEX for GIN indexes

2024-07-08 Thread Matthias van de Meent

On Sun, 7 Jul 2024, 18:11 Tomas Vondra,  wrote:
>
> On 7/3/24 20:36, Matthias van de Meent wrote:
>> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
>>  wrote:
>>> So I've switched this to use the regular data-type comparisons, with
>>> SortSupport etc. There's a bit more cleanup remaining and testing
>>> needed, but I'm not aware of any bugs.
>>
>> A review of patch 0001:
>>
>> ---
>>
>>> src/backend/access/gin/gininsert.c | 1449 +++-
>>
>> The nbtree code has `nbtsort.c` for its sort- and (parallel) build
>> state handling, which is exclusively used during index creation. As
>> the changes here seem to be largely related to bulk insertion, how
>> much effort would it be to split the bulk insertion code path into a
>> separate file?
>>
>
> Hmmm. I haven't tried doing that, but I guess it's doable. I assume we'd
> want to do the move first, because it involves pre-existing code, and
> then do the patch on top of that.
>
> But what would be the benefit of doing that? I'm not sure doing it just
> to make it look more like btree code is really worth it. Do you expect
> the result to be clearer?

It was mostly a consideration of file size and separation of concerns.
The sorted build path is quite different from the unsorted build,
after all.

>> I noticed that new fields in GinBuildState do get to have a
>> bs_*-prefix, but none of the other added or previous fields of the
>> modified structs in gininsert.c have such prefixes. Could this be
>> unified?
>>
>
> Yeah, these are inconsistencies from copying the infrastructure code to
> make the parallel builds work, etc.
>
>>> +/* Magic numbers for parallel state sharing */
>>> +#define PARALLEL_KEY_GIN_SHAREDUINT64CONST(0xB001)
>>> ...
>>
>> These overlap with BRIN's keys; can we make them unique while we're at it?
>>
>
> We could, and I recall we had a similar discussion in the parallel BRIN
> thread, right?. But I'm somewhat unsure why would we actually want/need
> these keys to be unique. Surely, we don't need to mix those keys in the
> single shm segment, right? So it seems more like an aesthetic thing. Or
> is there some policy to have unique values for these keys?

Uniqueness would be mostly useful for debugging shared memory issues,
but indeed, in a correctly working system we wouldn't have to worry
about parallel state key type confusion.

>> ---
>>
>>> +++ b/src/include/access/gin_tuple.h
>>> + typedef struct GinTuple
>>
>> I think this needs some more care: currently, each GinTuple is at
>> least 36 bytes in size on 64-bit systems. By using int instead of Size
>> (no normal indexable tuple can be larger than MaxAllocSize), and
>> packing the fields better we can shave off 10 bytes; or 12 bytes if
>> GinTuple.keylen is further adjusted to (u)int16: a key needs to fit on
>> a page, so we can probably safely assume that the key size fits in
>> (u)int16.
>
> Yeah, I guess using int64 is a bit excessive - you're right about that.
> I'm not sure this is necessarily about "indexable tuples" (GinTuple is
> not indexed, it's more an intermediate representation).

Yes, but even if the GinTuple itself isn't stored on disk in the
index, a GinTuple's key *is* part of the the primary GIN btree's keys
somewhere down the line, and thus must fit on a page somewhere. That's
what I was referring to.

> But if we can make it smaller, that probably can't hurt.
>
> I don't have a great intuition on how beneficial this might be. For
> cases with many TIDs per index key, it probably won't matter much. But
> if there's many keys (so that GinTuples store only very few TIDs), it
> might make a difference.

This will indeed matter most when small TID lists are common. I
suspect it's not uncommon to find us such a in situation when users
have a low maintenance_work_mem (and thus don't have much space to
buffer and combine index tuples before they're flushed), or when the
generated index keys can't be store in the available memory (such as
in my example; it didn't tune m_w_m at all, and the table I tested had
~15GB of data).

> I'll try to measure the impact on the same "realistic" cases I used for
> the earlier steps.

That would be greatly appreciated, thanks!


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

a potential typo in comments of pg_parse_json

2024-07-08 Thread Junwang Zhao

Not 100% sure, sorry if this doesn't make sense.

--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -514,7 +514,7 @@ freeJsonLexContext(JsonLexContext *lex)
  *
  * If FORCE_JSON_PSTACK is defined then the routine will call the non-recursive
  * JSON parser. This is a useful way to validate that it's doing the right
- * think at least for non-incremental cases. If this is on we expect to see
+ * thing at least for non-incremental cases. If this is on we expect to see
  * regression diffs relating to error messages about stack depth, but no
  * other differences.
  */


-- 
Regards
Junwang Zhao

Re: Partial aggregates pushdown

2024-07-08 Thread Jelte Fennema-Nio

On Sun, 7 Jul 2024 at 23:46, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> In my mind, Approach1 is superior. Therefore, if there are no objections this 
> week, I plan to resume implementing Approach1 next week. I would appreciate 
> it if anyone could discuss the topic with me or ask questions.

Honestly, the more I think about this, the more I like Approach2. Not
because I disagree with you about some of the limitations of
Approach2, but because I'd rather see those limitations fixed in
CREATE TYPE, instead of working around these limitations in CREATE
AGGREGATE. That way more usages can benefit. Detailed explanation and
arguments below.

> 1. Extendability
> I believe it is crucial to support scenarios where the local and remote major 
> versions may differ in the future (see the below).
>
> https://www.postgresql.org/message-id/4012625.1701120204%40sss.pgh.pa.us

>From my reading, Tom's concern is that different server versions
cannot talk to each other anymore. So as long as this perf
optimization is only enabled when server versions are the same, I
don't think there is a huge problem if we never implement this.
Honestly, I even think making this optimization opt-in at the FDW
server creation level would already solve Tom's concert. I do agree
that it would be good if we could have cross version partial
aggregates though, so it's definitely something to consider.

> Regarding this aspect, I consider Approach1 superior to Approach2. The reason 
> is that:
> ・The data type of an aggregate function's state value may change with each 
> major version increment.
> ・In Approach1, by extending the export/import functionalities to include the 
> major version in which the state value was created (refer to p.16 and p.17 of 
> [1]), I can handle such situations.
> ・On the other hand, it appears that Approach2 fundamentally lacks the 
> capability to support these scenarios.

Approach 2 definitely has some cross-version capabilities, e.g.
jsonb_send includes a version. Such an approach can be used to solve a
newer coordinator talking to an older worker, if the transtypes are
the same.

I personally don't think it's worth supporting this optimization for
an older coordinator talking to a newer worker. Using binary protocol
to talk to from an older server to a newer server doesn't work either.

Finally, based on p.16 & p.17 it's unclear to me how cross-version
with different transtypes would work. That situation seems inherently
incompatible to me.

> 2. Amount of codes
> Regarding this aspect, I find Approach1 to be better than Approach2.
> In Approach1, developers only need to export/import functions and can use a 
> standardized format for transmitting state values.
> In Approach2, developers have two options:
>   Option1: Adding typinput/typoutput and typsend/typreceive.
>   Option2: Adding typinput/typoutput only.
> Option1 requires more lines of code, which may be seen as cumbersome by some 
> developers.
> Option2 restricts developers to using only text representation for 
> transmitting state values, which I consider limiting.

In my opinion this is your strongest argument for Approach1. But you
didn't answer my previous two questions yet:

On Tue, 25 Jun 2024 at 11:28, Jelte Fennema-Nio  wrote:
> So that leaves me two questions:
> 1. Maybe CREATE TYPE should allow types without input/output functions
> as long as send/receive are defined. For these types text
> representation could fall back to the hex representation of bytea.
> 2. If for some reason 1 is undesired, then why are transtypes so
> special. Why is it fine for them to only have send/receive functions
> and not for other types?

Basically: I agree with this argument, but I feel like this being a
problem for this usecase is probably a sign that we should take the
solution a step further and solve this at the CREATE TYPE level
instead of allowing people to hack around CREATE TYPE its limitations
just for these partial aggregates.

> 3. Catalog size
> Regarding this point, I believe Approach1 is better than Approach2.
> In Approach1, theoretically, it is necessary to add export/import functions 
> to pg_proc for each aggregate.
> In Approach2, theoretically, it is necessary to add typoutput/typinput 
> functions (and typsend/typreceive if necessary) to pg_proc and add a native 
> type to pg_type for each aggregate.
> I would like to emphasize that we should consider user-defined functions in 
> addition to built-in aggregate functions.
> I think most developers prefer to avoid bloating catalogs, even if they may 
> not be able to specify exact reasons.
> In fact, in Robert's previous review, he expressed a similar concern (see 
> below).
>
> https://www.postgresql.org/message-id/CA%2BTgmobvja%2Bjytj5zcEcYgqzOaeJiqrrJxgqDf1q%3D3k8FepuWQ%40mail.gmail.com

So, to summarize the difference (assuming we change CREATE TYPE to
allow only typsend/typreceive): "Approach 2 adds an additional pg_type
entry per aggregate"

IMHO this is fine, especially sin

Re: pgsql: Add pg_get_acl() to get the ACL for a database object

2024-07-08 Thread Michael Paquier

On Fri, Jul 05, 2024 at 10:40:39AM +0200, Joel Jacobson wrote:
> OK, I made an attempt to implement this, based on adapting code from
> recordExtObjInitPriv() in aclchk.c, which retrieves ACL based on ATTNUM.
> 
> There are now two different code paths for columns and non-columns.
> 
> It sounds like a bigger refactoring in this area would be nice,
> which would enable cleaning up code in other functions as well,
> but maybe that's better to do as a separate project.

Thanks for the patch.  I have been looking at it for a few hours,
eyeing a bit on the ObjectProperty parts a bit if we were to extend it
for sub-object IDs, and did not like the complexity this introduces,
so I'd be OK to live with the extra handling in pg_get_acl() itself.

+   /* ignore dropped columns */
+   if (atttup->attisdropped)
+   {
+   ReleaseSysCache(tup);
+   PG_RETURN_NULL();
+   }

Hmm.  This is an important bit and did not consider it first.  That
makes the use of ObjectProperty less flexible because we want to look
at the data in the pg_attribute tuple to be able to return NULL in
this case.

I've tweaked a bit what you are proposing, simplifying the code and
removing the first batch of queries in the tests as these were less
interesting.  How does that look?
--
Michael
From 07b65a78fb660f6a27ef464870a3e27f788c5e28 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 8 Jul 2024 16:24:42 +0900
Subject: [PATCH v2] Extend pg_get_acl to handle sub-object IDs.

This patch modifies the pg_get_acl function to accept a third objsubid
param.  This enables the retrieval of ACLs for columns within a table.
---
 src/include/catalog/pg_proc.dat  |  2 +-
 src/backend/catalog/objectaddress.c  | 60 +++-
 src/test/regress/expected/privileges.out | 32 +++--
 src/test/regress/sql/privileges.sql  | 12 +++--
 doc/src/sgml/func.sgml   |  6 +--
 5 files changed, 87 insertions(+), 25 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0bf413fe05..0a0f6aa198 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6364,7 +6364,7 @@
 
 { oid => '8730', descr => 'get ACL for SQL object',
   proname => 'pg_get_acl', provolatile => 's', prorettype => '_aclitem',
-  proargtypes => 'oid oid', proargnames => '{classid,objid}',
+  proargtypes => 'oid oid int4', proargnames => '{classid,objid,objsubid}',
   prosrc => 'pg_get_acl' },
 
 { oid => '3839',
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 2983b9180f..fac3922964 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -4364,19 +4364,19 @@ pg_identify_object_as_address(PG_FUNCTION_ARGS)
 
 /*
  * SQL-level callable function to obtain the ACL of a specified object, given
- * its catalog OID and object OID.
+ * its catalog OID, object OID and sub-object ID.
  */
 Datum
 pg_get_acl(PG_FUNCTION_ARGS)
 {
 	Oid			classId = PG_GETARG_OID(0);
 	Oid			objectId = PG_GETARG_OID(1);
+	int32		objsubid = PG_GETARG_INT32(2);
 	Oid			catalogId;
 	AttrNumber	Anum_acl;
-	Relation	rel;
-	HeapTuple	tup;
 	Datum		datum;
 	bool		isnull;
+	HeapTuple	tup;
 
 	/* for "pinned" items in pg_depend, return null */
 	if (!OidIsValid(classId) && !OidIsValid(objectId))
@@ -4391,18 +4391,52 @@ pg_get_acl(PG_FUNCTION_ARGS)
 	if (Anum_acl == InvalidAttrNumber)
 		PG_RETURN_NULL();
 
-	rel = table_open(catalogId, AccessShareLock);
-
-	tup = get_catalog_object_by_oid(rel, get_object_attnum_oid(catalogId),
-	objectId);
-	if (!HeapTupleIsValid(tup))
+	/*
+	 * If dealing with a relation's attribute (objsubid is set), the ACL is
+	 * retrieved from pg_attribute.
+	 */
+	if (classId == RelationRelationId && objsubid != 0)
 	{
-		table_close(rel, AccessShareLock);
-		PG_RETURN_NULL();
-	}
+		AttrNumber	attnum = (AttrNumber) objsubid;
+		Form_pg_attribute atttup;
 
-	datum = heap_getattr(tup, Anum_acl, RelationGetDescr(rel), &isnull);
-	table_close(rel, AccessShareLock);
+		tup = SearchSysCache2(ATTNUM, ObjectIdGetDatum(objectId),
+			  Int16GetDatum(attnum));
+
+		if (!HeapTupleIsValid(tup))
+			PG_RETURN_NULL();
+
+		atttup = (Form_pg_attribute) GETSTRUCT(tup);
+
+		/* ignore dropped columns */
+		if (atttup->attisdropped)
+		{
+			ReleaseSysCache(tup);
+			PG_RETURN_NULL();
+		}
+
+		datum = SysCacheGetAttr(ATTNUM, tup,
+Anum_pg_attribute_attacl,
+&isnull);
+		ReleaseSysCache(tup);
+	}
+	else
+	{
+		Relation	rel;
+
+		rel = table_open(catalogId, AccessShareLock);
+
+		tup = get_catalog_object_by_oid(rel, get_object_attnum_oid(catalogId),
+		objectId);
+		if (!HeapTupleIsValid(tup))
+		{
+			table_close(rel, AccessShareLock);
+			PG_RETURN_NULL();
+		}
+
+		datum = heap_getattr(tup, Anum_acl, RelationGetDescr(rel), &isnull);
+		table_close(rel, AccessShareLock);
+	}
 
 	if (isnull)
 		PG_RETURN_NULL();
diff --git a/src/test/regress/expected/privileges.ou

Re: Partial aggregates pushdown

2024-07-08 Thread Jelte Fennema-Nio

On Sun, 30 Jun 2024 at 23:42, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> Instead, I can make PARTIAL_AGGREGATE an unreserved word by placing it after 
> the FILTER clause, like avg(c1) FILTER (WHERE c2 > 0) PARTIAL_AGGREGATE, and 
> by marking it as an ASLABEL word like FILTER.
> I attached the patch of the method.
> If there are no objections, I would like to proceed with the method described 
> above.
> I'd appreciate it if anyone comment the method.

I like this approach of using PARTIAL_AGGREGATE in the same place as
the FILTER clause.

Re: Partial aggregates pushdown

2024-07-08 Thread Jelte Fennema-Nio

On Sun, 7 Jul 2024 at 23:52, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> My plan for advancing the patch involves the following steps:
>   Step1. Decide the approach on transmitting state value.
>   Step2. Implement code (including comments) and tests to support a subset of 
> aggregate functions.
> Specifically, I plan to support avg, sum, and other aggregate functions 
> like min and max which don't need export/import functions.
>   Step3. Add documentations.
>
> To clarify my approach, should I proceed with Step 3 before Step2?

(my opinion, Bruce might have a different one)

I think it's good that you split the original patch in two:
0001: non-internal partial aggregates
0002: internal partial aggregates

I think we're aligned on the general design of 0001. So I think now is
definitely the time to include documentation there, so we can discuss
this patch in more detail, and move it forward.

I think generally for 0002 it would also be useful to have
documentation, I personally like reading it to understand the general
design and then comparing that to the code. But I also understand that
the language differences between Japanese and English, makes writing
such docs a significant effort for you. So I think it would be fine to
skip docs for 0002 for now until we decide on the approach we want to
take for internal partial aggregates.

Re: Internal error codes triggered by tests

2024-07-08 Thread Alexander Lakhin


Hello Michael,

05.07.2024 03:57, Michael Paquier wrote:

On Thu, Jul 04, 2024 at 11:00:01AM +0300, Alexander Lakhin wrote:

Could you please share your thoughts regarding other error cases, which is
not triggered by existing tests, but still can be easily reached by users?

For example:
SELECT satisfies_hash_partition(1, 1, 0, 0);

ERROR:  XX000: could not open relation with OID 1
LOCATION:  relation_open, relation.c:61

or:
CREATE TABLE t (b bytea);
INSERT INTO t SELECT ''::bytea;
CREATE INDEX brinidx ON t USING brin
  (b bytea_bloom_ops(n_distinct_per_range = -1.0));

ERROR:  XX000: the bloom filter is too large (44629 > 8144)
LOCATION:  bloom_init, brin_bloom.c:344

Should such cases be corrected too?

This is a case-by-case.  satisfies_hash_partition() is undocumented,
so doing nothing is fine by me.  The second one, though is something
taht can be triggered with rather normal DDL sequences.  That's more
annoying.


Thank you for the answer!

Let me show you other error types for discussion/classification:
SELECT pg_describe_object(1::regclass, 0, 0);

ERROR:  XX000: unsupported object class: 1
LOCATION:  getObjectDescription, objectaddress.c:4016
or
SELECT pg_identify_object_as_address('1'::regclass, 1, 1);

ERROR:  XX000: unsupported object class: 1
LOCATION:  getObjectTypeDescription, objectaddress.c:4597

--
SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000))
\gexec
ERROR:  XX000: could not open file "pg_snapshots/-...---" for reading: File 
name too long
LOCATION:  ImportSnapshot, snapmgr.c:1428

--
CREATE OPERATOR === (leftarg = int4, rightarg = int4, procedure = int4eq,
  commutator = ===, hashes);

CREATE TABLE t1 (a int);
ANALYZE t1;
CREATE TABLE t2 (a int);

SELECT * FROM t1, t2 WHERE t1.a === t2.a;

ERROR:  XX000: could not find hash function for hash operator 16385
LOCATION:  ExecHashTableCreate, nodeHash.c:560

--
WITH RECURSIVE oq(x) AS (
    WITH iq as (SELECT * FROM oq)
    SELECT * FROM iq
    UNION
    SELECT * from iq
)
SELECT * FROM oq;

ERROR:  XX000: missing recursive reference
LOCATION:  checkWellFormedRecursion, parse_cte.c:896
(commented as "should not happen", but still...)

--
CREATE EXTENSION ltree;
SELECT '1' ::ltree @ (repeat('!', 100)) ::ltxtquery;

ERROR:  XX000: stack too short
LOCATION:  makepol, ltxtquery_io.c:252

--
There is also a couple of dubious ereport(DEBUG1,
(errcode(ERRCODE_INTERNAL_ERROR), ...) calls like:
    /*
 * User-defined picksplit failed to create an actual split, ie it put
 * everything on the same side.  Complain but cope.
 */
    ereport(DEBUG1,
    (errcode(ERRCODE_INTERNAL_ERROR),
 errmsg("picksplit method for column %d of index \"%s\" failed",
    attno + 1, RelationGetRelationName(r)),

I'm not mentioning errors, that require more analysis and maybe correcting
the surrounding logic, not ERRCODE only.

Maybe it makes sense to separate the discussion of such errors, which are
not triggered by tests/not covered; I'm just not sure how to handle them
efficiently.

Best regards,
Alexander

Re: [PATCH] Add min/max aggregate functions to BYTEA

2024-07-08 Thread Aleksander Alekseev

Hi,

> What part of commitfest should I put the current patch to: "SQL
> Commands", "Miscellaneous" or something else? I can't figure it out.

Personally I qualified a similar patch [1] as "Server Features",
although I'm not 100% sure if this was the best choice.

[1]: https://commitfest.postgresql.org/48/4905/

-- 
Best regards,
Aleksander Alekseev

Re: Removing unneeded self joins

2024-07-08 Thread Alexander Korotkov

On Thu, Jul 4, 2024 at 11:40 AM jian he  wrote:
> On Thu, Jul 4, 2024 at 11:04 AM Alexander Korotkov  
> wrote:
> >
> > On Thu, Jul 4, 2024 at 5:15 AM jian he  wrote:
> > > in remove_self_join_rel, i have
> > > ```ChangeVarNodes((Node *) root->parse, toRemove->relid, toKeep->relid, 
> > > 0);```
> > > which will change the joinlist(RangeTblRef)  from (1,2)  to (2,2).
> > > Immediately after this call, I wrote a function (restore_rangetblref)
> > > to restore the joinlist as original (1,2).
> > > then remove_rel_from_joinlist won't error out.
> > > see remove_self_join_rel, restore_rangetblref.
> >
> > Thank you, now this is clear.  Could we add additional parameters to
> > ChangeVarNodes() instead of adding a new function which reverts part
> > of changes.
> >
>
> I didn't dare to. we have 42 occurrences of ChangeVarNodes.
> adding a parameter to it only for one location seems not intuitive.
>
> Now I have tried.
> changing to
> `ChangeVarNodes(Node *node, int rt_index, int new_index, int
> sublevels_up, bool change_RangeTblRef)`
>
> /* Replace varno in all the query structures */
> ChangeVarNodes((Node *) root->parse, toRemove->relid, toKeep->relid, 0, 
> false);
> ```
>
> it seems to work, pass the regression test.
> ```ChangeVarNodes((Node *) root->parse, toRemove->relid,
> toKeep->relid, 0, false);```
> is in remove_self_join_rel, remove_self_joins_one_group,
> remove_self_joins_recurse.
> all other places are ```ChangeVarNodes((Node *) root->parse,
> toRemove->relid, toKeep->relid, 0, true);```
> so ChangeVarNodes add a parameter will only influence the SJE feature.

Good.  But I think it's not necessary to to replace function signature
in all the 42 occurrences.  This will make our patch unnecessarily
conflict with others.  Instead we can have two functions
ChangeVarNodes(original function signature) and
ChangeVarNodesExtended(extended function signature).  Then existing
occurrences can still use ChangeVarNodes(), which will be just
shortcut for ChangeVarNodesExtended().

--
Regards,
Alexander Korotkov
Supabase

Re: Adminpack removal

2024-07-08 Thread Philippe BEAUDOIN


Le 01/07/2024 à 10:07, Daniel Gustafsson a écrit :

On 28 Jun 2024, at 09:06, Philippe BEAUDOIN  wrote:
So just looking in public repo covers probably less than 1% of the code. 
However, this may give a first idea, especialy if a feature use is already 
detected.

Searching for anything on Github is essentially a dead end since it reports so
many duplicates in forks etc.  That being said, I did a lot of searching and
browsing to find users [0], but came up empty (apart from forks which already
maintain their own copy).  A more targeted search is the Debian Code search
which at the time of removal (and well before then) showed zero occurrences of
adminpack functions in any packaged software, and no extensions which had
adminpack as a dependency.  While not an exhaustive search by any means, it
does provide a good hint.

Since you list no other extensions using adminpack to support keeping it, I
assume you also didn't find any when searching?
I just said that there are much much more code in private repos (so not 
analyzable) than in the public ones.


--
Daniel Gustafsson

[0] 
https://www.postgresql.org/message-id/B07CC211-DE35-4AC5-BD4E-0C6466700B06%40yesql.se

Re: MIN/MAX functions for a record

2024-07-08 Thread Aleksander Alekseev

Hi,

> Many thanks. Here is the corrected patch. Now it also includes MIN()
> support and tests.

Michael Paquier (cc:'ed) commented offlist that I forgot to change the
documentation.

Here is the corrected patch.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Support-min-record-and-max-record-aggregates.patch
Description: Binary data

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2024-07-08 Thread Aleksander Alekseev

Hi,

> Here is the corrected patchset.

TWIMC this is currently listed as an open item for PG17 [1].
Sorry if everyone interested is already aware.

[1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

-- 
Best regards,
Aleksander Alekseev

Re: Use pgBufferUsage for block reporting in analyze

2024-07-08 Thread Anthonin Bonnefoy

Hi,

Thanks for the review, I've updated the patches with the suggestions:
- moved renaming of misses to reads to the first patch
- added intermediate variables for total blks usage

I've also done some additional tests using the provided
vacuum_analyze_buffer_usage.sql script. It relies on
pg_stat_statements to check the results (only pgss gives information
on dirtied buffers). It gives the following output:

 psql:vacuum_analyze_buffer_usage.sql:21: INFO:  vacuuming
"postgres.pg_temp_7.vacuum_blks_stat_test"
 ...
 buffer usage: 105 hits, 3 reads, 6 dirtied
 ...
  query  | sum_hit | sum_read | sum_dirtied
 +-+--+-
  VACUUM (VERBOSE... | 105 |3 |   6

For vacuum, we have the same results with SKIP_DATABASE_STATS. Without
this setting, we would have block usage generated by
vac_update_datfrozenxid outside of vacuum_rel and therefore not
tracked by the verbose output. For the second test, the second patch
is needed to have ANALYZE (VERBOSE) output the block usage. It will
output the following:

 psql:vacuum_analyze_buffer_usage.sql:29: INFO:  analyzing
"pg_temp_7.vacuum_blks_stat_test"
 ...
 buffer usage: 84 hits, 33 reads, 2 dirtied
 ...
 query| sum_hit | sum_read | sum_dirtied
 -+-+--+-
  ANALYZE (VERBOSE... |  91 |   38 |   2

There's additional buffer hits/reads reported by pgss, those are from
analyze_rel opening the relations in try_relation_open and are not
tracked by the ANALYZE VERBOSE.


v3-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v3-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data


vacuum_analyze_buffer_usage.sql
Description: Binary data

Re: Interrupts vs signals

2024-07-08 Thread Heikki Linnakangas


On 08/07/2024 05:56, Thomas Munro wrote:

Here's an updated version of this patch.

The main idea is that SendProcSignal(pid, PROCSIGNAL_XXX, procno)
becomes SendInterrupt(INTERRUPT_XXX, procno), and all the pending
interrupt global variables and pss_procsignalFlags[] go away, along
with the SIGUSR1 handler.  The interrupts are compressed into a single
bitmap.  See commit message for more details.

The patch is failing on Windows CI for reasons I haven't debugged yet,
but I wanted to share what I have so far.  Work in progress!

Here is my attempt to survey the use of signals and write down what I
think we might do about them all so far, to give the context for this
patch:

https://wiki.postgresql.org/wiki/Signals

Comments, corrections, edits very welcome.


Nice, thanks!


Background worker state notifications are also changed from raw
kill(SIGUSR1) to SetLatch().  That means that SetLatch() is now called
from the postmaster.  The main purpose of including that change is to be
able to remove procsignal_sigusr1_handler completely and set SIGUSR1 to
SIG_IGN, and show the system working.

XXX Do we need to invent SetLatchRobust() that doesn't trust anything in
shared memory, to be able to set latches from the postmaster?


The patch actually does both: it still does kill(SIGUSR1) and also sets 
the latch.


I think it would be nice if RegisterDynamicBackgroundWorker() had a 
"bool notify_me" argument, instead of requiring the caller to set 
"bgw_notify_pid = MyProcPid" before the call. That's a 
backwards-compatibility break, but maybe we should bite the bullet and 
do it. Or we could do this in RegisterDynamicBackgroundWorker():


if (worker->bgw_notify_pid == MyProcPid)
worker->bgw_notify_pgprocno = MyProcNumber;

I think we can forbid setting pgw_notify_pid to anything else than 0 or 
MyProcPid.


A SetLatchRobust would be nice. Looking at SetLatch(), I don't think it 
can do any damage if you called it on a pointer to garbage, except if 
the pointer itself is bogus, then just dereferencing it an cause a 
segfault. So it would be nice to have a version specifically designed 
with that in mind. For example, it could assume that the Latch's pid is 
never legally equal to MyProcPid, because postmaster cannot own any latches.


Another approach would be to move the responsibility of background 
worker state notifications out of postmaster completely. When a new 
background worker is launched, the worker process itself could send the 
notification that it has started. And similarly, when a worker exits, it 
could send the notification just before exiting. There's a little race 
condition with exiting: if a process is waiting for the bgworker to 
exit, and launches a new worker immediately when the old one exits, 
there will be a brief period when the old and new process are alive at 
the same time. The old worker wouldn't be doing anything interesting 
anymore since it's exiting, but it still counts towards 
max_worker_processes, so launching the new process might fail because of 
hitting the limit. Maybe we should just bump up max_worker_processes. Or 
postmaster could check PMChildFlags and not count processes that have 
already deregistered from PMChildFlags towards the limit.



-volatile uint32 InterruptHoldoffCount = 0;
-volatile uint32 QueryCancelHoldoffCount = 0;
-volatile uint32 CritSectionCount = 0;
+uint32 InterruptHoldoffCount = 0;
+uint32 QueryCancelHoldoffCount = 0;
+uint32 CritSectionCount = 0;


I wondered if these are used in PG_TRY-CATCH blocks in a way that would 
still require volatile. I couldn't find any such usage by some quick 
grepping, so I think we're good, but I thought I'd mention it.



+/*
+ * The set of "standard" interrupts that CHECK_FOR_INTERRUPTS() and
+ * ProcessInterrupts() handle.  These perform work that is safe to run whenever
+ * interrupts are not "held".  Other kinds of interrupts are only handled at
+ * more restricted times.
+ */
+#define INTERRUPT_STANDARD_MASK
   \


Some interrupts are missing from this mask:

- INTERRUPT_PARALLEL_APPLY_MESSAGE
- INTERRUPT_IDLE_STATS_UPDATE_TIMEOUT
- INTERRUPT_SINVAL_CATCHUP
- INTERRUPT_NOTIFY

Is that on purpose?


-/*
- * Because backends sitting idle will not be reading sinval events, we
- * need a way to give an idle backend a swift kick in the rear and make
- * it catch up before the sinval queue overflows and forces it to go
- * through a cache reset exercise.  This is done by sending
- * PROCSIG_CATCHUP_INTERRUPT to any backend that gets too far behind.
- *
- * The signal handler will set an interrupt pending flag and will set the
- * processes latch. Whenever starting to read from the client, or when
- * interrupted while doing so, ProcessClientReadInterrupt() will call
- * ProcessCatchupEvent().
- */
-volatile sig_atomic_t catchupInterruptPending = false;


Would be nice to move that comment somewhere else rather than remove it 
completely.



--- a/src/backe

RE: Conflict Detection and Resolution

2024-07-08 Thread Zhijie Hou (Fujitsu)

On Monday, July 8, 2024 12:32 PM Zhijie Hou (Fujitsu)  
wrote:
> 
> I researched about how to detect the resolve update_deleted and thought
> about one idea: which is to maintain the xmin in logical slot to preserve
> the dead row and support latest_timestamp_xmin resolution for
> update_deleted to maintain data consistency.
> 
> Here are details of the xmin idea and resolution of update_deleted:
> 
> 1. how to preserve the dead row so that we can detect update_delete
>conflict correctly. (In the following explanation, let's assume there is a
>a multimeter setup with node A, B).
> 
> To preserve the dead row on node A, I think we could maintain the "xmin"
> in the logical replication slot on Node A to prevent the VACCUM from
> removing the dead row in user table. The walsender that acquires the slot
> is responsible to advance the xmin. (Node that I am trying to explore
> xmin idea as it could be more efficient than using commit_timestamp, and the
> logic could be simpler as we are already maintaining catalog_xmin in
> logical slot and xmin in physical slot)
> 
> - Strategy for advancing xmin:
> 
> The xmin can be advanced if a) a transaction (xid:1000) has been flushed
> to the remote node (Node B in this case). *AND* b) On Node B, the local
> transactions that happened before applying the remote
> transaction(xid:1000) were also sent and flushed to the Node A.
> 
> - The implementation:
> 
> condition a) can be achieved with existing codes, the walsender can
> advance the xmin similar to the catalog_xmin.
> 
> For condition b), we can add a subscription option (say 'feedback_slot').
> The feedback_slot indicates the replication slot that will send changes to
> the origin (On Node B, the slot should be subBA). The apply worker will
> check the status(confirmed flush lsn) of the 'feedback slot' and send
> feedback to the walsender about the WAL position that has been sent and
> flushed via the feedback_slot.

The above are some initial thoughts of how to preserve the dead row for
update_deleted conflict detection.

After thinking more, I have identified a few additional cases that I
missed to analyze regarding the design. One aspect that needs more
thoughts is the possibility of multiple slots on each node. In this
scenario, the 'feedback_slot' subscription option would need to be
structured as a list. However, requiring users to specify all the slots
may not be user-friendly. I will explore if this process can be
automated.

In addition, I will think more about the potential impact of re-using the
existing 'xmin' of the slot which may affect existing logic that relies on
'xmin'.

I will analyze more and reply about these points.

Best Regards,
Hou zj

Re: Incorrect results from numeric round() and trunc()

2024-07-08 Thread Dean Rasheed

On Mon, 8 Jul 2024 at 00:40, Joel Jacobson  wrote:
>
> On Sun, Jul 7, 2024, at 13:28, Dean Rasheed wrote:
> > I've also tidied up a bit by replacing all instances of SHRT_MAX with
> > a new constant NUMERIC_WEIGHT_MAX, whose name more accurately
> > describes the limit, as used in various other overflow checks.
>
> Having thought a bit more on this, I think we probably need a
> DEC_DIGITS sensitive definition of NUMERIC_WEIGHT_MAX,
> since per spec the max range for numeric is 0x2 (131072)
> decimal digits.
>

No, the maximum weight is determined by the use of int16 to store the
weight. Therefore if you did reduce DEC_DIGITS to 1 or 2, the number
of decimal digits allowed before the decimal point would be reduced
too.

Regards,
Dean

Re: Parallel CREATE INDEX for GIN indexes

2024-07-08 Thread Matthias van de Meent

On Sun, 7 Jul 2024, 18:26 Tomas Vondra,  wrote:
>
> On 7/5/24 21:45, Matthias van de Meent wrote:
>> On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
>>  wrote:
>>>
>>> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
>>>  wrote:
 So I've switched this to use the regular data-type comparisons, with
 SortSupport etc. There's a bit more cleanup remaining and testing
 needed, but I'm not aware of any bugs.
>>> ---
 +++ b/src/backend/utils/sort/tuplesortvariants.c
>>>
>>> I was thinking some more about merging tuples inside the tuplesort. I
>>> realized that this could be implemented by allowing buffering of tuple
>>> writes in writetup. This would require adding a flush operation at the
>>> end of mergeonerun to store the final unflushed tuple on the tape, but
>>> that shouldn't be too expensive. This buffering, when implemented
>>> through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
>>> merge the TID lists of same-valued GIN tuples while they're getting
>>> stored and re-sorted, thus reducing the temporary space usage of the
>>> tuplesort by some amount with limited overhead for other
>>> non-deduplicating tuplesorts.
>>>
>>> I've not yet spent the time to get this to work though, but I'm fairly
>>> sure it'd use less temporary space than the current approach with the
>>> 2 tuplesorts, and could have lower overall CPU overhead as well
>>> because the number of sortable items gets reduced much earlier in the
>>> process.
>>
>> I've now spent some time on this. Attached the original patchset, plus
>> 2 incremental patches, the first of which implement the above design
>> (patch no. 8).
>>
>> Local tests show it's significantly faster: for the below test case
>> I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
>> improvement.
>>
>> After applying the 'reduce the size of GinTuple' patch, index creation
>> time is down to 551514ms, or about 29% faster total. This all was
>> tested with a fresh stock postgres configuration.
>>
>> """
>> CREATE UNLOGGED TABLE testdata
>> AS SELECT sha256(i::text::bytea)::text
>> FROM generate_series(1, 1500) i;
>> CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
>> """
>>
>
> Those results look really promising. I certainly would not have expected
> such improvements - 20-30% speedup on top of the already parallel run
> seems great. I'll do a bit more testing to see how much this helps for
> the "more realistic" data sets.
>
> I can't say I 100% understand how the new stuff in tuplesortvariants.c
> works, but that's mostly because my knowledge of tuplesort internals is
> fairly limited (and I managed to survive without that until now).
>
> What makes me a bit unsure/uneasy is that this seems to "inject" custom
> code fairly deep into the tuplesort logic. I'm not quite sure if this is
> a good idea ...

I thought it was still fairly high-level: it adds (what seems to me) a
natural extension to the pre-existing "write a tuple to the tape" API,
allowing the Tuplesort (TS) implementation to further optimize its
ordered tape writes through buffering (and combining) of tuple writes.
While it does remove the current 1:1 relation of TS tape writes to
tape reads for the GIN case, there is AFAIK no code in TS that relies
on that 1:1 relation.

As to the GIN TS code itself; yes it's more complicated, mainly
because it uses several optimizations to reduce unnecessary
allocations and (de)serializations of GinTuples, and I'm aware of even
more such optimizations that can be added at some point.

As an example: I suspect the use of GinBuffer in writetup_index_gin to
be a significant resource drain in my patch because it lacks
"freezing" in the tuplesort buffer. When no duplicate key values are
encountered, the code is nearly optimal (except for a full tuple copy
to get the data into the GinBuffer's memory context), but if more than
one GinTuple has the same key in the merge phase we deserialize both
tuple's posting lists and merge the two. I suspect that merge to be
more expensive than operating on the compressed posting lists of the
GinTuples themselves, so that's something I think could be improved. I
suspect/guess it could save another 10% in select cases, and will
definitely reduce the memory footprint of the buffer.
Another thing that can be optimized is the current approach of
inserting data into the index: I think it's kind of wasteful to
decompress and later re-compress the posting lists once we start
storing the tuples on disk.

Kind regards,

Matthias van de Meent

Re: Improving the latch handling between logical replication launcher and worker processes.

2024-07-08 Thread Amit Kapila

On Fri, Jul 5, 2024 at 6:38 PM Heikki Linnakangas  wrote:
>
>
> Your solution with an additional latch seems better because it makes
> WaitForReplicationWorkerAttach() react more quickly, without the 10 s
> wait. I'm surprised we have that in the first place, 10 s seems like a
> pretty long time to wait for a parallel apply worker to start. Why was
> that ever OK?
>

Isn't the call wait for 10 milliseconds?  The comment atop
WaitLatch("The "timeout" is given in milliseconds...) indicates the
timeout is in milliseconds.

-- 
With Regards,
Amit Kapila.

Re: pgsql: Add pg_get_acl() to get the ACL for a database object

2024-07-08 Thread Joel Jacobson

On Mon, Jul 8, 2024, at 10:34, Michael Paquier wrote:
> Thanks for the patch.  I have been looking at it for a few hours,
> eyeing a bit on the ObjectProperty parts a bit if we were to extend it
> for sub-object IDs, and did not like the complexity this introduces,
> so I'd be OK to live with the extra handling in pg_get_acl() itself.
>
> +   /* ignore dropped columns */
> +   if (atttup->attisdropped)
> +   {
> +   ReleaseSysCache(tup);
> +   PG_RETURN_NULL();
> +   }
>
> Hmm.  This is an important bit and did not consider it first.  That
> makes the use of ObjectProperty less flexible because we want to look
> at the data in the pg_attribute tuple to be able to return NULL in
> this case.
>
> I've tweaked a bit what you are proposing, simplifying the code and
> removing the first batch of queries in the tests as these were less
> interesting.  How does that look?

Thanks, nice simplifications.
I agree the tests you removed are not that interesting.

Looks good to me.

Patch didn't apply to HEAD nor on top of any of the previous commits either,
but I couldn't figure out why based on the .rej files, strange.
I copy/pasted the parts from the patch to test it. Let me know if you need a
rebased version of it, in case you will need to do the same, to save some work.

Also noted the below in your last commit:

a6417078c414 has introduced as project policy that new features
committed during the development cycle should use new OIDs in the
[8000,] range.

4564f1cebd43 did not respect that rule, so let's renumber pg_get_acl()
to use an OID in the correct range.

Thanks for the info!
Will make sure to use the `src/include/catalog/renumber_oids.pl` tool
for future patches.

Regards,
Joel

Re: Incorrect results from numeric round() and trunc()

2024-07-08 Thread Joel Jacobson

On Mon, Jul 8, 2024, at 11:45, Dean Rasheed wrote:
> On Mon, 8 Jul 2024 at 00:40, Joel Jacobson  wrote:
>>
>> On Sun, Jul 7, 2024, at 13:28, Dean Rasheed wrote:
>> > I've also tidied up a bit by replacing all instances of SHRT_MAX with
>> > a new constant NUMERIC_WEIGHT_MAX, whose name more accurately
>> > describes the limit, as used in various other overflow checks.
>>
>> Having thought a bit more on this, I think we probably need a
>> DEC_DIGITS sensitive definition of NUMERIC_WEIGHT_MAX,
>> since per spec the max range for numeric is 0x2 (131072)
>> decimal digits.
>>
>
> No, the maximum weight is determined by the use of int16 to store the
> weight. Therefore if you did reduce DEC_DIGITS to 1 or 2, the number
> of decimal digits allowed before the decimal point would be reduced
> too.

OK, that can actually be seen as a feature, especially since it's
of course more likely DEC_DIGITS could increase in the future
than decrease.

For example, let's say we would double it to 8,
then if NUMERIC_WEIGHT_MAX would still be 0x7FFF (32767),
then the maximum range for numeric would increase from 131072 to 262144
decimal digits allowed before the decimal point.

LGTM.

Regards,
Joel

Re: Address the -Wuse-after-free warning in ATExecAttachPartition()

2024-07-08 Thread Junwang Zhao

On Mon, Jul 8, 2024 at 3:22 PM Nitin Jadhav
 wrote:
>
> In [1], Andres reported a -Wuse-after-free bug in the
> ATExecAttachPartition() function.  I've created a patch to address it
> with pointers from Amit offlist.
>
> The issue was that the partBoundConstraint variable was utilized after
> the list_concat() function. This could potentially lead to accessing
> the partBoundConstraint variable after its memory has been freed.
>
> The issue was resolved by using the return value of the list_concat()
> function, instead of using the list1 argument of list_concat(). I
> copied the partBoundConstraint variable to a new variable named
> partConstraint and used it for the previous references before invoking
> get_proposed_default_constraint(). I confirmed that the
> eval_const_expressions(), make_ands_explicit(),
> map_partition_varattnos(), QueuePartitionConstraintValidation()
> functions do not modify the memory location pointed to by the
> partBoundConstraint variable. Therefore, it is safe to use it for the
> next reference in get_proposed_default_constraint()
>
> Attaching the patch. Please review and share the comments if any.
> Thanks to Andres for spotting the bug and some off-list advice on how
> to reproduce it.

The patch LGTM.

Curious how to reproduce the bug ;)

>
> [1]: 
> https://www.postgresql.org/message-id/flat/202311151802.ngj2la66jwgi%40alvherre.pgsql#4fc5622772ba0244c1ad203f5fc56701
>
> Best Regards,
> Nitin Jadhav
> Azure Database for PostgreSQL
> Microsoft



-- 
Regards
Junwang Zhao

Re: Test to dump and restore objects left behind by regression

2024-07-08 Thread Ashutosh Bapat

On Fri, Jul 5, 2024 at 10:59 AM Michael Paquier  wrote:

> On Fri, Jun 28, 2024 at 06:00:07PM +0530, Ashutosh Bapat wrote:
> > Here's a description of patches and some notes
> > 0001
> > ---
> > 1. Per your suggestion the logic to handle dump output differences is
> > externalized in PostgreSQL::Test::AdjustDump. Instead of eliminating
> those
> > differences altogether from both the dump outputs, the corresponding DDL
> in
> > the original dump output is adjusted to look like that from the restored
> > database. Thus we retain full knowledge of what differences to expect.
> > 2. I have changed the name filter_dump to filter_dump_for_upgrade so as
> to
> > differentiate between two adjustments 1. for upgrade and 2. for
> > dump/restore. Ideally the name should have been
> adjust_dump_for_ugprade() .
> > It's more of an adjustment than filtering as indicated by the function it
> > calls. But I haven't changed that. The new function to adjust dumps for
> > dump and restore tests is named adjust_dump_for_restore() however.
> > 3. As suggested by Daniel upthread, the test for dump and restore happens
> > before upgrade which might change the old cluster thus changing the state
> > of objects left behind by regression. The test is not executed if
> > regression is not used to create the old cluster.
> > 4. The code to compare two dumps and report differences if any is moved
> to
> > its own function compare_dumps() which is used for both upgrade and
> > dump/restore tests.
> > The test uses the custom dump format for dumping and restoring the
> > database.
>
> At quick glance, that seems to be going in the right direction.  Note
> that you have forgotten install and uninstall rules for the new .pm
> file.
>

Before submitting the patch, I looked for all the places which mention
AdjustUpgrade or AdjustUpgrade.pm to find places where the new module needs
to be mentioned. But I didn't find any. AdjustUpgrade is not mentioned
in src/test/perl/Makefile or src/test/perl/meson.build. Do we want to also
add AdjustUpgrade.pm in those files?


>
> 0002 increases more the runtime of a test that's already one of the
> longest ones in the tree is not really appealing, I am afraid.
>

We could forget 0002. I am fine with that.  But I can change the code such
that formats other than "plain" are tested when PG_TEST_EXTRAS contains
"regress_dump_formats". Would that be acceptable?

-- 
Best Wishes,
Ashutosh Bapat

Re: tests fail on windows with default git settings

2024-07-08 Thread Dave Page

Hi

On Sun, 7 Jul 2024 at 07:07, Andres Freund  wrote:

> Hi,
>
> On 2024-07-07 01:26:13 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > Do we want to support checking out with core.autocrlf?
> >
> > -1.  It would be a constant source of breakage, and you could never
> > expect that (for example) making a tarball from such a checkout
> > would match anyone else's results.
>
> WFM.
>
>
> > > If we do not want to support that, ISTM we ought to raise an error
> somewhere?
> >
> > +1, if we can figure out how.
>
> I can see two paths:
>
> 1) we prevent eol conversion, by using the right magic incantation in
>.gitattributes
>
> 2) we check that some canary file is correctly encoded, e.g. during meson
>configure (should suffice, this is realistically only a windows issue)
>
>
> It seems that the only realistic way to achieve 1) is to remove the "text"
> attribute from all files. That had me worried for a bit, thinking that
> might
> have a larger blast radius. However, it looks like this is solely used for
> line-ending conversion. The man page says:
>   "This attribute marks the path as a text file, which enables end-of-line
> conversion:"
>
>
> Which sounds like it'd work well - except that it appears to behave oddly
> when
> updating to such a change in an existing repo -
>
> cd /tmp/;
> rm -rf pg-eol;
> git -c core.eol=crlf -c core.autocrlf=true clone ~/src/postgresql pg-eol;
> cd pg-eol;
> git config core.eol crlf; git config core.autocrlf true;
> stat src/test/modules/test_json_parser/tiny.json -> 6748 bytes
>
> cd ~/src/postgresql
> stat src/test/modules/test_json_parser/tiny.json -> 6604 bytes
> echo '* -text' >> .gitattributes
> git commit -a -m tmp
>
> cd /tmp/pg-eol
> git pull
> git st
>   ...
>   nothing to commit, working tree clean
> stat src/test/modules/test_json_parser/tiny.json -> 6748 bytes
>
> I.e. the repo still is in CRLF state.
>
> But if I reclone at that point, the line endings are in a sane state.
>
>
> IIUC this is because line-ending conversion is done only during
> checkout/checkin.
>
>
> There are ways to get git to redo the normalization, but it's somewhat
> awkward [1].
>

Yeah, I vaguely remember playing with core.autocrlf many years ago and
running into similar issues.


>
> OTOH, given that the tests already fail, I assume our windows contributors
> already have disabled autocrlf?
>

I can't speak for others of course, but at least as far as building of
installers is concerned, we use tarballs not git checkouts.

For my own work; well, I've started playing with PG17 on Windows just in
the last month or so and have noticed a number of test failures as well as
a weird meson issue that only shows up on a Github actions runner. I was
hoping to look into those issues this week as I've been somewhat
sidetracked with other work of late.

-- 
Dave Page
pgAdmin: https://www.pgadmin.org
PostgreSQL: https://www.postgresql.org
EDB: https://www.enterprisedb.com

array_in sub function ReadArrayDimensions error message

2024-07-08 Thread jian he

while reviewing the json query doc,
I found out the following error message was not quite right.

select '[1,2]'::int[];
ERROR:  malformed array literal: "[1,2]"
LINE 1: select '[1,2]'::int[];
   ^
DETAIL:  Missing "]" after array dimensions.
should it be:
"Missing delimiter ":" while specifying array dimensions."
?
Because here, the first problem is the comma should be colon.


also
"DETAIL:  Missing "]" after array dimensions."
should be
DETAIL:  Missing "]" while specifying array dimensions.
?

Re: Thoughts on NBASE=100000000

2024-07-08 Thread Matthias van de Meent

On Sun, 7 Jul 2024, 22:40 Joel Jacobson,  wrote:
>
> Hello hackers,
>
> I'm not hopeful this idea will be fruitful, but maybe we can find solutions
> to the problems together.
>
> The idea is to increase the numeric NBASE from 1e4 to 1e8, which could 
> possibly
> give a significant performance boost of all operations across the board,
> on 64-bit architectures, for many inputs.
>
> Last time numeric's base was changed was back in 2003, when d72f6c75038 
> changed
> it from 10 to 1. Back then, 32-bit architectures were still dominant,
> so base-1 was clearly the best choice at this time.
>
> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
> have been the best choice, since the square of that still fits in
> a 64-bit signed int.

Back then 64-bit was by far not as dominant (server and consumer chips
with AMD64 ISA only got released that year after the commit), so I
don't think 1e8 would have been the best choice at that point in time.
Would be better now, yes, but not back then.

> Changing NBASE might seem impossible at first, due to the existing numeric 
> data
> on disk, and incompatibility issues when numeric data is transferred on the
> wire.
>
> Here are some ideas on how to work around some of these:
>
> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
> when reading data.

I think that a dynamic decision would make more sense here. At low
precision, the overhead of 4+1 bytes vs 2 bytes is quite significant.
This sounds important for overall storage concerns, especially if the
padding bytes (mentioned below) are added to indicate types.

> - Due to the lack of a version field in the NumericVar struct,
> we need a way to detect if a Numeric value on disk uses
> the existing NBASE=1e4, or NBASE=1e8.
> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
> defined as:
> #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
> will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
> The idea is then to let "NUMERIC_NBYTES divisible by three"
> indicate NBASE=1e8, at the cost of one to three extra padding bytes.

Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an
uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only
multiples of 3?  I'm asking because there are many integers divisible
by both 2 and 3 (all integer multiples of 6; that's 50% of the
multiples of 3), so with the multiple-of-3 scheme  we might need up to
5 pad bytes to get to the next multiple of 3 that isn't also a
multiple of 2. Additionally, if the last digit woud've fit in
NBASE_1e4, then the 1e8-based numeric value could even be 7 bytes
larger than the equivalent 1e4-based numeric.

While I don't think this is worth implementing for general usage, it
could be worth exploring for the larger numeric values, where the
relative overhead of the larger representation is lower.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?

2024-07-08 Thread Masahiro.Ikeda

Hi,

While I'm researching about [1], I found there are inconsistent EXPLAIN outputs.
Here is an example which shows " OPERATOR(pg_catalog.". Though it's not wrong,
I feel like there is no consistency in the output format.

-- A reproduce procedure
create temp table btree_bpchar (f1 text collate "C");
create index on btree_bpchar(f1 bpchar_ops);
insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
set enable_seqscan to false;
set enable_bitmapscan to false;
set enable_indexonlyscan to false; -- or true
explain (costs off)
select * from btree_bpchar where f1::bpchar like 'foo';

-- Index Scan result
  QUERY PLAN  
--
 Index Scan using btree_bpchar_f1_idx on btree_bpchar
   Index Cond: ((f1)::bpchar = 'foo'::bpchar)
   Filter: ((f1)::bpchar ~~ 'foo'::text)
(3 rows)

-- Index Only Scan result which has 'OPERATOR'
QUERY PLAN 
---
 Index Only Scan using btree_bpchar_f1_idx on btree_bpchar
   Index Cond: (f1 OPERATOR(pg_catalog.=) 'foo'::bpchar)-- Here is the 
point.
   Filter: ((f1)::bpchar ~~ 'foo'::text)
(3 rows)


IIUC, the index only scan use fixed_indexquals, which is removed "RelabelType" 
nodes,
for EXPLAIN so that get_rule_expr() could not understand the left argument of 
the operator
(f1 if the above case) can be displayed with arg::resulttype and it doesn't 
need to
show "OPERATOR(pg_catalog.)".

I've attached PoC patch to show a simple solution. It just adds a new member 
"indexqualorig"
to the index only scan node like the index scan and the bitmap index scan. But, 
since I'm
a beginner about the planner, I might have misunderstood something or there 
should be better
ways.



BTW, I'm trying to add a new index AM interface for EXPLAIN on the thread([1]). 
As the aspect,
my above solution might not be ideal because AMs can only know index column ids 
(varattno)
from fixed_indexquals. In that case, to support "fixed_indexquals" as argument 
of deparse_expression()
is better.

[1] Improve EXPLAIN output for multicolumn B-Tree Index
https://www.postgresql.org/message-id/flat/TYWPR01MB1098260B694D27758FE2BA46FB1C92%40TYWPR01MB10982.jpnprd01.prod.outlook.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION



v1-0001-Fix-inconsistent-explain-output-for-index-only-sc.patch
Description:  v1-0001-Fix-inconsistent-explain-output-for-index-only-sc.patch

Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?

2024-07-08 Thread Tomas Vondra

On 7/8/24 13:03, masahiro.ik...@nttdata.com wrote:
> Hi,
> 
> While I'm researching about [1], I found there are inconsistent EXPLAIN 
> outputs.
> Here is an example which shows " OPERATOR(pg_catalog.". Though it's not wrong,
> I feel like there is no consistency in the output format.
> 
> -- A reproduce procedure
> create temp table btree_bpchar (f1 text collate "C");
> create index on btree_bpchar(f1 bpchar_ops);
> insert into btree_bpchar values ('foo'), ('fool'), ('bar'), ('quux');
> set enable_seqscan to false;
> set enable_bitmapscan to false;
> set enable_indexonlyscan to false; -- or true
> explain (costs off)
> select * from btree_bpchar where f1::bpchar like 'foo';
> 
> -- Index Scan result
>   QUERY PLAN  
> --
>  Index Scan using btree_bpchar_f1_idx on btree_bpchar
>Index Cond: ((f1)::bpchar = 'foo'::bpchar)
>Filter: ((f1)::bpchar ~~ 'foo'::text)
> (3 rows)
> 
> -- Index Only Scan result which has 'OPERATOR'
> QUERY PLAN 
> ---
>  Index Only Scan using btree_bpchar_f1_idx on btree_bpchar
>Index Cond: (f1 OPERATOR(pg_catalog.=) 'foo'::bpchar)-- Here is 
> the point.
>Filter: ((f1)::bpchar ~~ 'foo'::text)
> (3 rows)
> 

This apparently comes from generate_operator_name() in ruleutils.c,
where the OPERATOR() decorator is added if:

 /*
  * The idea here is to schema-qualify only if the parser would fail to
  * resolve the correct operator given the unqualified op name with the
  * specified argtypes.
  */

So clearly, the code believes just the operator name could be ambiguous,
so it adds the namespace too. Why exactly it is considered ambiguous I
don't know, but perhaps you have other applicable operators in the
search_path, or something like that?

> 
> IIUC, the index only scan use fixed_indexquals, which is removed 
> "RelabelType" nodes,
> for EXPLAIN so that get_rule_expr() could not understand the left argument of 
> the operator
> (f1 if the above case) can be displayed with arg::resulttype and it doesn't 
> need to
> show "OPERATOR(pg_catalog.)".
> 
> I've attached PoC patch to show a simple solution. It just adds a new member 
> "indexqualorig"
> to the index only scan node like the index scan and the bitmap index scan. 
> But, since I'm
> a beginner about the planner, I might have misunderstood something or there 
> should be better
> ways.
> 

I honestly don't know if this is the correct solution. It seems to me
handling this at the EXPLAIN level might just mask the issue - it's not
clear to me why adding "indexqualorig" would remove the ambiguity (if
there's one). Perhaps it might be better to find why the ruleutils.c
code thinks the OPERATOR() is necessary, and then improve/fix that?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel CREATE INDEX for GIN indexes

2024-07-08 Thread Tomas Vondra




On 7/8/24 11:45, Matthias van de Meent wrote:
> On Sun, 7 Jul 2024, 18:26 Tomas Vondra,  wrote:
>>
>> On 7/5/24 21:45, Matthias van de Meent wrote:
>>> On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
>>>  wrote:

 On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
  wrote:
> So I've switched this to use the regular data-type comparisons, with
> SortSupport etc. There's a bit more cleanup remaining and testing
> needed, but I'm not aware of any bugs.
 ---
> +++ b/src/backend/utils/sort/tuplesortvariants.c

 I was thinking some more about merging tuples inside the tuplesort. I
 realized that this could be implemented by allowing buffering of tuple
 writes in writetup. This would require adding a flush operation at the
 end of mergeonerun to store the final unflushed tuple on the tape, but
 that shouldn't be too expensive. This buffering, when implemented
 through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
 merge the TID lists of same-valued GIN tuples while they're getting
 stored and re-sorted, thus reducing the temporary space usage of the
 tuplesort by some amount with limited overhead for other
 non-deduplicating tuplesorts.

 I've not yet spent the time to get this to work though, but I'm fairly
 sure it'd use less temporary space than the current approach with the
 2 tuplesorts, and could have lower overall CPU overhead as well
 because the number of sortable items gets reduced much earlier in the
 process.
>>>
>>> I've now spent some time on this. Attached the original patchset, plus
>>> 2 incremental patches, the first of which implement the above design
>>> (patch no. 8).
>>>
>>> Local tests show it's significantly faster: for the below test case
>>> I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
>>> improvement.
>>>
>>> After applying the 'reduce the size of GinTuple' patch, index creation
>>> time is down to 551514ms, or about 29% faster total. This all was
>>> tested with a fresh stock postgres configuration.
>>>
>>> """
>>> CREATE UNLOGGED TABLE testdata
>>> AS SELECT sha256(i::text::bytea)::text
>>> FROM generate_series(1, 1500) i;
>>> CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
>>> """
>>>
>>
>> Those results look really promising. I certainly would not have expected
>> such improvements - 20-30% speedup on top of the already parallel run
>> seems great. I'll do a bit more testing to see how much this helps for
>> the "more realistic" data sets.
>>
>> I can't say I 100% understand how the new stuff in tuplesortvariants.c
>> works, but that's mostly because my knowledge of tuplesort internals is
>> fairly limited (and I managed to survive without that until now).
>>
>> What makes me a bit unsure/uneasy is that this seems to "inject" custom
>> code fairly deep into the tuplesort logic. I'm not quite sure if this is
>> a good idea ...
> 
> I thought it was still fairly high-level: it adds (what seems to me) a
> natural extension to the pre-existing "write a tuple to the tape" API,
> allowing the Tuplesort (TS) implementation to further optimize its
> ordered tape writes through buffering (and combining) of tuple writes.
> While it does remove the current 1:1 relation of TS tape writes to
> tape reads for the GIN case, there is AFAIK no code in TS that relies
> on that 1:1 relation.
> 
> As to the GIN TS code itself; yes it's more complicated, mainly
> because it uses several optimizations to reduce unnecessary
> allocations and (de)serializations of GinTuples, and I'm aware of even
> more such optimizations that can be added at some point.
> 
> As an example: I suspect the use of GinBuffer in writetup_index_gin to
> be a significant resource drain in my patch because it lacks
> "freezing" in the tuplesort buffer. When no duplicate key values are
> encountered, the code is nearly optimal (except for a full tuple copy
> to get the data into the GinBuffer's memory context), but if more than
> one GinTuple has the same key in the merge phase we deserialize both
> tuple's posting lists and merge the two. I suspect that merge to be
> more expensive than operating on the compressed posting lists of the
> GinTuples themselves, so that's something I think could be improved. I
> suspect/guess it could save another 10% in select cases, and will
> definitely reduce the memory footprint of the buffer.
> Another thing that can be optimized is the current approach of
> inserting data into the index: I think it's kind of wasteful to
> decompress and later re-compress the posting lists once we start
> storing the tuples on disk.
> 

I need to experiment with this a bit more, to better understand the
behavior and pros/cons. But one thing that's not clear to me is why
would this be better than simply increasing the amount of memory for the
initial BuildAccumulator buffer ...

Wouldn't that have pretty much the same effect?


regards

-- 
Tomas Vondra
EnterpriseD

Re: Thoughts on NBASE=100000000

2024-07-08 Thread Joel Jacobson

On Mon, Jul 8, 2024, at 12:45, Matthias van de Meent wrote:
> On Sun, 7 Jul 2024, 22:40 Joel Jacobson,  wrote:
>> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
>> have been the best choice, since the square of that still fits in
>> a 64-bit signed int.
>
> Back then 64-bit was by far not as dominant (server and consumer chips
> with AMD64 ISA only got released that year after the commit), so I
> don't think 1e8 would have been the best choice at that point in time.
> Would be better now, yes, but not back then.

Oh, grammar mistake by me!
I meant to say it "would be the best choice", in line with what I wrote above:

>> Last time numeric's base was changed was back in 2003, when d72f6c75038 
>> changed
>> it from 10 to 1. Back then, 32-bit architectures were still dominant,
>> so base-1 was clearly the best choice at this time.

>> Changing NBASE might seem impossible at first, due to the existing numeric 
>> data
>> on disk, and incompatibility issues when numeric data is transferred on the
>> wire.
>>
>> Here are some ideas on how to work around some of these:
>>
>> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
>> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
>> when reading data.
>
> I think that a dynamic decision would make more sense here. At low
> precision, the overhead of 4+1 bytes vs 2 bytes is quite significant.
> This sounds important for overall storage concerns, especially if the
> padding bytes (mentioned below) are added to indicate types.

Right, I agree.

Another idea: It seems possible to reduce the disk space for numerics
that fit into one byte, i.e. 0 <= val <= 255, which could be communicated
via NUMERIC_NBYTES=1.
At least the value 0 should be quite common.

>> - Due to the lack of a version field in the NumericVar struct,
>> we need a way to detect if a Numeric value on disk uses
>> the existing NBASE=1e4, or NBASE=1e8.
>> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
>> defined as:
>> #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
>> will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
>> The idea is then to let "NUMERIC_NBYTES divisible by three"
>> indicate NBASE=1e8, at the cost of one to three extra padding bytes.
>
> Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an
> uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only
> multiples of 3?

Oh, yes of course! Thinko.

> While I don't think this is worth implementing for general usage, it
> could be worth exploring for the larger numeric values, where the
> relative overhead of the larger representation is lower.

Yes, I agree it's definitively seems like a win for larger numeric values.
Not sure about smaller numeric values, maybe it's possible
to improve upon.

Regards,
Joel

Re: doc: modify the comment in function libpqrcv_check_conninfo()

2024-07-08 Thread ikedarintarof


On 2024-07-08 15:28, Fujii Masao wrote:

On 2024/07/01 18:15, Jelte Fennema-Nio wrote:

On Thu, 27 Jun 2024 at 12:27, ikedarintarof
 wrote:

Thanks for your suggestion. I used ChatGPT to choose the wording, but
it's still difficult for me.


Looks good to me now (but obviously biased since you took my wording).


LGTM, too.


 * Validate connection info string, and determine whether it might 
cause

 * local filesystem access to be attempted.

The later part of the above comment for libpqrcv_check_conninfo()
seems unclear. My understanding is that if must_use_password is true
and this function completes without error, the password must be
in the connection string, so there's no need to read the .pgpass file
(i.e., no local filesystem access). That part seems to be trying to
explaing something like that. Is this correct?

Anyway, wouldn't it be better to either remove this unclear part or
rephrase it for clarity?

Regards,


Thank you for your comment.

I agree "local filesystem access" is vague.  I think the reference to 
.pgpass
(local file system) is not necessary in the comment for 
libpqrcv_check_conninfo()

because it is explained in libpqrcv_connect() as shown below.


/*
* Re-validate connection string. The validation already happened at DDL
* time, but the subscription owner may have changed. If we don't 
recheck

* with the correct must_use_password, it's possible that the connection
* will obtain the password from a different source, such as PGPASSFILE 
or

* PGPASSWORD.
*/
libpqrcv_check_conninfo(conninfo, must_use_password);


I remove the unclear part from the previous patch and add some 
explanation for

 later part of the function.



Or, I think it is also good to make the comment more simple:

 * The function checks that
 * 1. connection info string is properly parsed and
 * 2. a password is provided if must_use_password is true.
 * If the check is failed, the it will raise an error.
 */
static void
libpqrcv_check_conninfo(const char *conninfo, bool must_use_password)


Regards,

Rintaro IkedaFrom 514bd81f7d1b7104e6d2d92cb072a84096cd3885 Mon Sep 17 00:00:00 2001
From: Rintaro Ikeda <51394766+rinik...@users.noreply.github.com>
Date: Thu, 27 Jun 2024 16:04:08 +0900
Subject: [PATCH v2] modify the commnet in libpqrcv_check_conninfo()

---
 .../replication/libpqwalreceiver/libpqwalreceiver.c| 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 02f12f2921..e6a4e4cc1c 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -305,12 +305,14 @@ bad_connection:
 }
 
 /*
- * Validate connection info string, and determine whether it might cause
- * local filesystem access to be attempted.
+ * The function
+ * 1. validates connection info string and
+ * 2. checks a password is provided if must_use_password is true.
  *
  * If the connection string can't be parsed, this function will raise
- * an error and will not return. If it can, it will return true if this
- * connection string specifies a password and false otherwise.
+ * an error. If must_use_password is true, the function raises an error
+ * if no password is provided in the connection string. In any other case
+ * it successfully completes.
  */
 static void
 libpqrcv_check_conninfo(const char *conninfo, bool must_use_password)
-- 
2.39.3 (Apple Git-146)

Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions

2024-07-08 Thread Amit Langote

On Fri, Jul 5, 2024 at 10:16 PM Erik Rijkers  wrote:
> Op 7/5/24 om 14:35 schreef Amit Langote:
> > Hi Jian,
> >
> > Thanks for the reviews.
> >
>  > [v3-0001-SQL-JSON-Various-improvements-to-SQL-JSON-query-f.patch]
>i.e., from the patch for doc/src/sgml/func.sgml
>
>
> Small changes:
>
> 4x:
> 'a SQL'  should be
> 'an SQL'
> ('a SQL' does never occur in the docs; it's always 'an SQL'; apperently
> the 'sequel' pronunciation is out)
>
> 'some other type to which can be successfully coerced'
> 'some other type to which it can be successfully coerced'
>
>
> 'specifies the behavior behavior'
> 'specifies the behavior'
>
>
> In the following sentence:
>
> "By default, the result is returned as a value of type jsonb,
> though the RETURNING clause can be used to return
> the original jsonb value as some other type to which it
> can be successfully coerced."
>
> it seems to me that this phrase is better removed:
> "the original jsonb value as"

Thanks, I've addressed all these in the next patch I'll send.

-- 
Thanks, Amit Langote

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-07-08 Thread Amit Kapila

On Mon, Jul 8, 2024 at 12:34 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Another possible problem is related to my use case. I haven't reproduced 
> > this
> > case, just some thoughts. I guess, when two_phase is ON, the PREPARE 
> > statement
> > may be truncated from the WAL at checkpoint, but COMMIT PREPARED is still 
> > kept
> > in the WAL. On catchup, I would ask the master to send transactions from 
> > some
> > restart LSN. I would like to get all such transactions competely, with 
> > theirs
> > bodies, not only COMMIT PREPARED messages.
>
> I don't think it is a real issue. WALs for prepared transactions will retain
> until they are committed/aborted.
> When the two_phase is on and transactions are PREPAREd, they will not be
> cleaned up from the memory (See ReorderBufferProcessTXN()).  Then, 
> RUNNING_XACT
> record leads to update the restart_lsn of the slot but it cannot be move 
> forward
> because ReorderBufferGetOldestTXN() returns the prepared transaction (See
> SnapBuildProcessRunningXacts()). restart_decoding_lsn of each transaction, 
> which
> is a candidate of restart_lsn of the slot. is always behind the startpoint of
> its txn.
>

I see that in 0003/0004, the patch first aborts pending prepared
transactions, update's catalog, and then change slot's property via
walrcv_alter_slot. What if there is any ERROR (say the remote node is
not reachable or there is an error while updating the catalog) after
we abort the pending prepared transaction? Won't we end up with lost
prepared transactions in such a case?

Few other comments:
=
The code to handle SUBOPT_TWOPHASE_COMMIT should be after failover
option handling for the sake of code symmetry. Also, the checks should
be in same order like first for slot_name, then enabled, then for
PreventInTransactionBlock(), after those, we can have other checks for
two_phase. If possible, we can move common checks in both failover and
two_phase options into a common function.

What should be the behavior if one tries to set slot_name to NONE and
also tries to toggle two_pahse option? I feel both options together
don't makes sense because there is no use in changing two_phase for
some slot which we are disassociating the subscription from. The same
could be said for the failover option as well, so if we agree with
some different behavior here, we can follow the same for failover
option as well.

-- 
With Regards,
Amit Kapila.

Re: ssl tests fail due to TCP port conflict

2024-07-08 Thread Alexander Lakhin


Hello,

07.06.2024 17:25, Tom Lane wrote:

Andrew Dunstan  writes:

I still think my patch to force TCP mode for the SSL test makes sense as
well.

+1 to both things.  If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.


I'd like to add that the kerberos/001_auth test also suffers from the port
conflict, but slightly differently. Look for example at [1]:
krb5kdc.log contains:
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): setting up 
network...
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address already in use - Cannot bind server socket 
on 127.0.0.1.55853

Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Failed 
setting up a UDP socket (for 127.0.0.1.55853)
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): Address 
already in use - Error setting up network

As far as I can see, the port for kdc is chosen by
PostgreSQL::Test::Kerberos, via
PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
port availability (with can_bind()), but not for UDP, so this increases
the probability of the conflict for this test (a similar failure: [2]).
Although we can also find a failure with TCP: [3]

(It's not clear to me, what processes can use UDP ports while testing,
but maybe those buildfarm animals are running on the same logical
machine simultaneously?)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19

Best regards,
Alexander

Re: Pgoutput not capturing the generated columns

2024-07-08 Thread Shlok Kyal

On Mon, 8 Jul 2024 at 13:20, Peter Smith  wrote:
>
> Hi Shlok, Here are some review comments for patch v15-0003.
>
> ==
> src/backend/catalog/pg_publication.c
>
> 1. publication_translate_columns
>
> The function comment says:
>  * Translate a list of column names to an array of attribute numbers
>  * and a Bitmapset with them; verify that each attribute is appropriate
>  * to have in a publication column list (no system or generated attributes,
>  * no duplicates).  Additional checks with replica identity are done later;
>  * see pub_collist_contains_invalid_column.
>
> That part about "[no] generated attributes" seems to have gone stale
> -- e.g. not quite correct anymore. Should it say no VIRTUAL generated
> attributes?
Yes, we should use VIRTUAL generated attributes, I have modified it.

> ==
> src/backend/replication/logical/proto.c
>
> 2. logicalrep_write_tuple and logicalrep_write_attrs
>
> I thought all the code fragments like this:
>
> + if (att->attgenerated && att->attgenerated != ATTRIBUTE_GENERATED_STORED)
> + continue;
> +
>
> don't need to be in the code anymore, because of the BitMapSet (BMS)
> processing done to make the "column list" for publication where
> disallowed generated cols should already be excluded from the BMS,
> right?
>
> So shouldn't all these be detected by the following statement:
> if (!column_in_column_list(att->attnum, columns))
>   continue;
The current BMS logic do not handle the Virtual Generated Columns.
There can be cases where we do not want a virtual generated column but
it would be present in BMS.
To address this I have added the above logic. I have added this logic
similar to the checks of 'attr->attisdropped'.

> ==
> src/backend/replication/pgoutput/pgoutput.c
>
> 4. send_relation_and_attrs
>
> (this is a similar comment for #2 above)
>
> IIUC of the advantages of the BitMapSet (BMS) idea in patch 0001 to
> process the generated columns up-front means there is no need to check
> them again in code like this.
>
> They should be discovered anyway in the subsequent check:
> /* Skip this attribute if it's not present in the column list */
> if (columns != NULL && !bms_is_member(att->attnum, columns))
>   continue;
Same explanation as above.

I have addressed all the comments in v16-0003 patch. Please refer [1].
[1]: 
https://www.postgresql.org/message-id/CANhcyEXw%3DBFFVUqohWES9EPkdq-ZMC5QRBVQqQPzrO%3DQ7uzFQw%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

RE: Partial aggregates pushdown

2024-07-08 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Jelte,

Thank you for comments and advises.

> From: Jelte Fennema-Nio 
> Sent: Monday, July 8, 2024 5:31 PM
> On Sun, 7 Jul 2024 at 23:46, fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > In my mind, Approach1 is superior. Therefore, if there are no objections 
> > this week, I plan to resume implementing
> Approach1 next week. I would appreciate it if anyone could discuss the topic 
> with me or ask questions.
> 
> Honestly, the more I think about this, the more I like Approach2. Not because 
> I disagree with you about some of the
> limitations of Approach2, but because I'd rather see those limitations fixed 
> in CREATE TYPE, instead of working around
> these limitations in CREATE AGGREGATE. That way more usages can benefit. 
> Detailed explanation and arguments below.
Firstly, I may have jumped to conclusions too quickly. I apologize that.
I would appreciate it if we clarify Approach 1 and Approach 2 more precisely 
and we can proceed with the discussion.

Before we get into the details, let me break down the main differences between 
Approach 1 and Approach 2.

The best thing about Approach2 is that it lets us send state values using the 
existing data type system.
I'm worried that if we choose Approach2, we might face some limits because we 
have to create new types.
But, we might be able to fix these limits if we look into it more. 

Approach1 doesn't make new types, so we can avoid these limits.
But, it means we have to make export/import functions that are similar to the 
typsend/typreceive functions.
So, we need to make sure if we really need this method.

Is this the right understanding?

> From: Jelte Fennema-Nio 
> Sent: Monday, July 8, 2024 5:59 PM
> On Sun, 30 Jun 2024 at 23:42, fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > Instead, I can make PARTIAL_AGGREGATE an unreserved word by placing it 
> > after the FILTER clause, like avg(c1) FILTER
> (WHERE c2 > 0) PARTIAL_AGGREGATE, and by marking it as an ASLABEL word like 
> FILTER.
> > I attached the patch of the method.
> > If there are no objections, I would like to proceed with the method 
> > described above.
> > I'd appreciate it if anyone comment the method.
> 
> I like this approach of using PARTIAL_AGGREGATE in the same place as the 
> FILTER clause.
Thank you for comment.

> On Sun, 7 Jul 2024 at 23:52, fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > My plan for advancing the patch involves the following steps:
> >   Step1. Decide the approach on transmitting state value.
> >   Step2. Implement code (including comments) and tests to support a subset 
> > of aggregate functions.
> > Specifically, I plan to support avg, sum, and other aggregate functions 
> > like min and max which don't need
> export/import functions.
> >   Step3. Add documentations.
> >
> > To clarify my approach, should I proceed with Step 3 before Step2?
> 
> (my opinion, Bruce might have a different one)
> 
> I think it's good that you split the original patch in two:
> 0001: non-internal partial aggregates
> 0002: internal partial aggregates
> 
> I think we're aligned on the general design of 0001. So I think now is 
> definitely the time to include documentation there, so
> we can discuss this patch in more detail, and move it forward.
> 
> I think generally for 0002 it would also be useful to have documentation, I 
> personally like reading it to understand the
> general design and then comparing that to the code. But I also understand 
> that the language differences between Japanese
> and English, makes writing such docs a significant effort for you. So I think 
> it would be fine to skip docs for 0002 for now
> until we decide on the approach we want to take for internal partial 
> aggregates.
At least for 0001, it seems like it would be a good idea to attach a document 
at this stage.

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

Re: Improving the latch handling between logical replication launcher and worker processes.

2024-07-08 Thread vignesh C

On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas  wrote:
>
> On 05/07/2024 14:07, vignesh C wrote:
> > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas  wrote:
> >>
> >> I'm don't quite understand the problem we're trying to fix:
> >>
> >>> Currently the launcher's latch is used for the following: a) worker
> >>> process attach b) worker process exit and c) subscription creation.
> >>> Since this same latch is used for multiple cases, the launcher process
> >>> is not able to handle concurrent scenarios like: a) Launcher started a
> >>> new apply worker and waiting for apply worker to attach and b) create
> >>> subscription sub2 sending launcher wake up signal. In this scenario,
> >>> both of them will set latch of the launcher process, the launcher
> >>> process is not able to identify that both operations have occurred 1)
> >>> worker is attached 2) subscription is created and apply worker should
> >>> be started. As a result the apply worker does not get started for the
> >>> new subscription created immediately and gets started after the
> >>> timeout of 180 seconds.
> >>
> >> I don't see how we could miss a notification. Yes, both events will set
> >> the same latch. Why is that a problem?
> >
> > The launcher process waits for the apply worker to attach at
> > WaitForReplicationWorkerAttach function. The launcher's latch is
> > getting set concurrently for another create subscription and apply
> > worker attached. The launcher now detects the latch is set while
> > waiting at WaitForReplicationWorkerAttach, it resets the latch and
> > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> > the latch has already been reset). Further details are provided below.
> >
> > The loop will see that the new
> >> worker has attached, and that the new subscription has been created, and
> >> process both events. Right?
> >
> > Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> > processing the create subscription event.
> >
> > Slightly detailing further:
> > In the scenario when we execute two concurrent create subscription
> > commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> > CREATE SUBSCRIPTION sub2.
> > In a few random scenarios, the following events may unfold:
> > After the first create subscription command(sub1), the backend will
> > set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> > Subsequently, the launcher process will reset the latch and identify
> > the addition of a new subscription in the pg_subscription list. The
> > launcher process will proceed to request the postmaster to start the
> > apply worker background process (sub1) and request the postmaster to
> > notify the launcher once the apply worker(sub1) has been started.
> > Launcher will then wait for the apply worker(sub1) to attach in the
> > WaitForReplicationWorkerAttach function.
> > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> > executed concurrently, also set the launcher's latch(because of
> > ApplyLauncherWakeupAtCommit).
> > Simultaneously when the launcher remains in the
> > WaitForReplicationWorkerAttach function waiting for apply worker of
> > sub1 to start, the postmaster will start the apply worker for
> > subscription sub1 and send a SIGUSR1 signal to the launcher process
> > via ReportBackgroundWorkerPID. Upon receiving this signal, the
> > launcher process will then set its latch.
> >
> > Now, the launcher's latch has been set concurrently after the apply
> > worker for sub1 is started and the execution of the CREATE
> > SUBSCRIPTION sub2 command.
> >
> > At this juncture, the launcher, which had been awaiting the attachment
> > of the apply worker, detects that the latch is set and proceeds to
> > reset it. The reset operation of the latch occurs within the following
> > section of code in WaitForReplicationWorkerAttach:
> > ...
> > rc = WaitLatch(MyLatch,
> > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > 10L, WAIT_EVENT_BGWORKER_STARTUP);
> >
> > if (rc & WL_LATCH_SET)
> > {
> > ResetLatch(MyLatch);
> > CHECK_FOR_INTERRUPTS();
> > }
> > ...
> >
> > After resetting the latch here, the activation signal intended to
> > start the apply worker for subscription sub2 is no longer present. The
> > launcher will return to the ApplyLauncherMain function, where it will
> > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> > processing the create subscription request (i.e., creating a new apply
> > worker for sub2).
> > The issue arises from the latch being reset in
> > WaitForReplicationWorkerAttach, which can occasionally delay the
> > synchronization of table data for the subscription.
>
> Ok, I see it now. Thanks for the explanation. So the problem isn't using
> the same latch for different purposes per se. It's that we're trying to
> use it in a nested fashion, resetting it in the inner loop.
>
> Looking at the proposed patch more closely:
>
> > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(Logi

Re: Partial aggregates pushdown

2024-07-08 Thread Jelte Fennema-Nio

On Mon, 8 Jul 2024 at 14:12, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> The best thing about Approach2 is that it lets us send state values using the 
> existing data type system.
> I'm worried that if we choose Approach2, we might face some limits because we 
> have to create new types.
> But, we might be able to fix these limits if we look into it more.
>
> Approach1 doesn't make new types, so we can avoid these limits.
> But, it means we have to make export/import functions that are similar to the 
> typsend/typreceive functions.
> So, we need to make sure if we really need this method.
>
> Is this the right understanding?

Yeah, correct. To clarify my reasoning a bit more: IMHO, the main
downside of implementing Approach1 is that we then end up with two
different mechanisms to "take data from memory and serialize it in a
way in which it can be sent over the network". I'd very much prefer if
we could have a single system responsible for that task. So if there's
issues with the current system (e.g. having to implement
typinput/typoutput), then I'd rather address these problems in the
existing system. Only if that turns out to be impossible for some
reason, then I think I'd prefer Approach1.

Personally, even if the Approach2 requires a bit more code, then I'd
still prefer a single serialization system over having two
serializiation systems.

Re: Improving the latch handling between logical replication launcher and worker processes.

2024-07-08 Thread Amit Kapila

On Mon, Jul 8, 2024 at 5:47 PM vignesh C  wrote:
>
> On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas  wrote:
> >
>
> > As an alternative, smaller fix, I think we could do the attached. It
> > forces the launcher's main loop to do another iteration, if it calls
> > logicalrep_worker_launch(). That extra iteration should pick up any
> > missed notifications.
>
> This also works.
>

The minor drawback would be that in many cases the extra iteration
would not lead to anything meaningful.

-- 
With Regards,
Amit Kapila.

Re: Wrong security context for deferred triggers?

2024-07-08 Thread Laurenz Albe


On 7/8/24 04:14, Joseph Koshakow wrote:

Given the above and the fact that the patch is a breaking change, my
vote would still be to keep the current behavior and update the
documentation. Though I'd be happy to be overruled by someone with more
knowledge of triggers.


Thanks for that feedback.
Based on that, the patch should be rejected.

Since there were a couple of other opinions early in the thread, I'll let
it sit like that for now, and judgement can be passed at the end of the
commitfest.  Perhaps somebody else wants to chime in.

Yours,
Laurenz Albe

Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?

2024-07-08 Thread Tom Lane

Tomas Vondra  writes:
> I honestly don't know if this is the correct solution. It seems to me
> handling this at the EXPLAIN level might just mask the issue - it's not
> clear to me why adding "indexqualorig" would remove the ambiguity (if
> there's one). Perhaps it might be better to find why the ruleutils.c
> code thinks the OPERATOR() is necessary, and then improve/fix that?

As Masahiro-san already said, the root cause is that the planner
removes the RelabelType that is on the indexed variable.  So ruleutils
sees that the operator is not the one that would be chosen by the
parser given those input expressions (cf generate_operator_name)
and it concludes it'd better schema-qualify the operator.  While
that doesn't really make any difference in this particular case,
it is the right thing in typical rule-deparsing cases.

I don't think this output is really wrong, and I'm not in favor
of adding nontrivial overhead to make it better, so I don't like
the proposed patch.  Maybe generate_operator_name could use some
other heuristic, but I'm unsure what.  The case it wants to cover
is that the operator is in the search path but is masked by some
operator in an earlier schema, so simply testing if the operator's
schema is in the path at all would be insufficient.

regards, tom lane

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

2024-07-08 Thread Andrei Lepikhov


On 3/11/2023 23:43, Tomas Vondra wrote:

On 9/11/23 10:04, Lepikhov Andrei wrote:
  * Determine bucketsize fraction and MCV frequency for the inner
  * relation. We use the smallest bucketsize or MCV frequency estimated
  * for any individual hashclause; this is undoubtedly conservative.

I'm sure this may lead to inflated cost for "good" cases (where the
actual bucket size really is a product), which may push the optimizer to
use the less efficient/slower join method.

Yes, It was contradictory idea, though.

IMHO the only principled way forward is to get a better ndistinct
estimate (which this implicitly does), perhaps by using extended
statistics. I haven't tried, but I guess it'd need to extract the
clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of 
extended statistics for estimation of the HashJoin bucket_size. In 
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows 
to reduce computational cost of this statistic.


[1] 
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web


--
regards, Andrei Lepikhov
From b4b007970b1d9b99602b8422f2122c8e5738828e Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Mon, 13 May 2024 16:15:02 +0700
Subject: [PATCH] Use extended statistics for precise estimation of bucket size
 in hash join.

Recognizing the real-life complexity where columns in the table often have
functional dependencies, PostgreSQL's estimation of the number of distinct
values over a set of columns can be underestimated (or much rarely,
overestimated) when dealing with multi-clause JOIN.
In the case of hash join, it can end up with a small number of predicted hash
buckets and, as a result, picking non-optimal merge join.

To improve the situation, we introduce one additional stage of bucket size
estimation - having two or more join clauses estimator lookup for extended
statistics and use it for multicolumn estimation.
Clauses are grouped into lists, each containing expressions referencing the
same relation. The result of the multicolumn estimation made over such a list
is combined with others according to the caller's logic.
Clauses that are not estimated are returned to the caller for further
estimation.
---
 src/backend/optimizer/path/costsize.c |  12 +-
 src/backend/utils/adt/selfuncs.c  | 171 ++
 src/include/utils/selfuncs.h  |   4 +
 3 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index ee23ed7835..eafde90d4c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4250,9 +4250,19 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
}
else
{
+   List *otherclauses;
+
innerbucketsize = 1.0;
innermcvfreq = 1.0;
-   foreach(hcl, hashclauses)
+
+   /* At first, try to estimate bucket size using extended 
statistics. */
+   otherclauses = estimate_multivariate_bucketsize(root,
+   
inner_path->parent,
+   
hashclauses,
+   
&innerbucketsize);
+
+   /* Pass through the remaining clauses */
+   foreach(hcl, otherclauses)
{
RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, 
hcl);
Selectivity thisbucketsize;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 5f5d7959d8..89ed2d136a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3751,6 +3751,177 @@ estimate_num_groups(PlannerInfo *root, List 
*groupExprs, double input_rows,
return numdistinct;
 }
 
+/*
+ * Try to estimate the bucket size of the hash join inner side when the join
+ * condition contains two or more clauses by employing extended statistics.
+ *
+ * The main idea of this approach is that the distinct value generated by
+ * multivariate estimation on two or more columns would provide less bucket 
size
+ * than estimation on one separate column.
+ *
+ * IMPORTANT: It is crucial to synchronise the approach of combining different
+ * estimations with the caller's method.
+ * Return a list of clauses that didn't fetch any extended statistics.
+ */
+List *
+estimate_multivariate_bucketsize(PlannerInfo *root, RelOptInfo *inner,
+List 
*hashclauses,
+

Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions

2024-07-08 Thread Amit Langote

Thanks for the readthrough.

On Sat, Jul 6, 2024 at 11:56 AM jian he  wrote:
> json_exists
> "Returns true if the SQL/JSON path_expression applied to the
> context_item using the PASSING values yields any items."
> now you changed to
> <<
> Returns true if the SQL/JSON path_expression applied to the
> context_item. path_expression can reference variables named in the
> PASSING clause.
> <<
>  Is it wrong?

Yes, I mistakenly dropped "doesn't yield any items."

> For both ON EMPTYand ON ERROR,
> specifying ERROR will cause an error to be
> thrown with
> the appropriate message. Other options include returning a SQL NULL, 
> an
> need one more whitespace. should be:
> For both ON EMPTY and ON ERROR,

Fixed.

> Note that an error will be thrown if multiple values are returned and
> WITHOUT WRAPPER is specified.
> since not specify error on error, then no error will be thrown. maybe
> rephrase to
> It will be evaulated as error if multiple values are returned and
> WITHOUT WRAPPER is specified.

Done.

>
> For both ON EMPTY and ON ERROR,
> specifying ERROR will cause an error to be
> thrown with
> the appropriate message. Other options include returning a SQL NULL, 
> an
> empty array or object (array by default), or a user-specified 
> expression
> that can be coerced to jsonb or the type specified in
> RETURNING.
> The default when ON EMPTY or ON
> ERROR
> is not specified is to return a SQL NULL value when the respective
> situation occurs.
>
> in here, "empty array or object (array by default)",
> I don't think people can understand the meaning of "(array by default)" .
>
> "or a user-specified expression"
> maybe we can change to
> "or a user-specified
> DEFAULT expression"
> I think "user-specified expression" didn't have much link with
> "DEFAULT expression"

Yes, specifying each option's syntax in parenthesis makes sense.

> path_expression can reference variables named
> in the PASSING clause.
> do we need "The path_expression"?

Fixed.

> also maybe we can add
> + In PASSING clause, varname is
> +the variables name, value is the
> + variables' value.

Instead of expanding the description of the PASSING clause in each
function's description, I've moved its description to the top
paragraph with slightly different text.

> we can add a PASSING clause example from sqljson_queryfuncs.sql ,
> since all three functions can use it.

Done.

> JSON_VALUE:
> < of ON ERROR below).
> then
> << The ON ERROR and ON EMPTY clauses have similar semantics as
> mentioned in the description of JSON_QUERY, except the set of values
> returned in lieu of throwing an error is different.
>
> you first refer "below", then director to JSON_QUERY on error, on
> empty description.
> is the correct usage of "below"?
> "(though see the discussion of ON ERROR below)."
> i am not sure the meaning of "though" even watched this
> https://www.youtube.com/watch?v=r-LphuCKQ0Q

I've replaced the sentence with "(though see the discussion of ON
ERROR below)" with this:

"getting multiple values will be treated as an error."

No need to reference the ON ERROR clause with that wording.

> +   context_item can be any character string that
> +   can be succesfully cast to jsonb.
>
> typo: "succesfully", should be "successfully"

Fixed.

> maybe rephrase it to:
> +   context_item can be jsonb type or any
> character string that
> +   can be successfully cast to jsonb.

Done.

> +ON EMPTY expression (that is caused by empty 
> result
> +of path_expressionevaluation).
> need extra white space, should be
> +of path_expression evaluation).

Fixed.

> +The default when ON EMPTY or ON
> ERROR
> +is not specified is to return a SQL NULL value when the respective
> +situation occurs.
> Correct me if I'm wrong.
> we can just say:
> +The default when ON EMPTY or ON
> ERROR
> +is not specified is to return an SQL NULL value.

Agreed.

> another tiny issue.
>
> -select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a'
> OMIT QUOTES);
> +JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a' OMIT
> QUOTES);
>  [1, 2]
> 
> 
> -select json_query(jsonb '{"a": "[1, 2]"}', 'lax $.a'
> RETURNING int[] OMIT QUOTES ERROR ON ERROR);
> +JSON_QUERY(jsonb '{"a": "[1, 2]"}', 'lax $.a'
> RETURNING int[] OMIT QUOTES ERROR ON ERROR);
>
> These two example queries don't need semicolons at the end?

Fixed.

Also, I've removed the following sentence in the description of
JSON_EXISTS, because a) it seems out of place

-   
-Note that if the path_expression is
-strict and ON ERROR behavior is
-ERROR, an error is generated if it yields no items.
-   

and b) does not seem correct:

SELECT JSON_EXISTS(jsonb '{"key1": [1,2,3]}', 'strict $.key1[*] ? (@ >
3)' ERROR ON ERROR);
 json_ex

Re: a potential typo in comments of pg_parse_json

2024-07-08 Thread Amit Langote

On Mon, Jul 8, 2024 at 5:25 PM Junwang Zhao  wrote:
> Not 100% sure, sorry if this doesn't make sense.
>
> --- a/src/common/jsonapi.c
> +++ b/src/common/jsonapi.c
> @@ -514,7 +514,7 @@ freeJsonLexContext(JsonLexContext *lex)
>   *
>   * If FORCE_JSON_PSTACK is defined then the routine will call the 
> non-recursive
>   * JSON parser. This is a useful way to validate that it's doing the right
> - * think at least for non-incremental cases. If this is on we expect to see
> + * thing at least for non-incremental cases. If this is on we expect to see
>   * regression diffs relating to error messages about stack depth, but no
>   * other differences.
>   */

Good catch.  Fixed.


--
Thanks, Amit Langote

Re: Injection point locking

2024-07-08 Thread Heikki Linnakangas


On 25/06/2024 05:25, Noah Misch wrote:

On Mon, Jun 24, 2024 at 11:03:09AM -0400, Tom Lane wrote:

Heikki Linnakangas  writes:

... I can't do that, because InjectionPointRun() requires a PGPROC
entry, because it uses an LWLock. That also makes it impossible to use
injection points in the postmaster. Any chance we could allow injection
points to be triggered without a PGPROC entry? Could we use a simple
spinlock instead?


That sounds fine to me.  If calling hash_search() with a spinlock feels too
awful, a list to linear-search could work.



With a fast path for the case that no injection points
are attached or something?


Even taking a spinlock in the postmaster is contrary to project
policy.  Maybe we could look the other way for debug-only code,
but it seems like a pretty horrible precedent.


If you're actually using an injection point in the postmaster, that would be
the least of concerns.  It is something of a concern for running an injection
point build while not attaching any injection point.  One solution could be a
GUC to control whether the postmaster participates in injection points.
Another could be to make the data structure readable with atomics only.


I came up with the attached. It replaces the shmem hash table with an 
array that's scanned linearly. On each slot in the array, there's a 
generation number that indicates whether the slot is in use, and allows 
detecting concurrent modifications without locks. The attach/detach 
operations still hold the LWLock, but InjectionPointRun() is now 
lock-free, so it can be used without a PGPROC entry.


It's now usable from postmaster too. However, it's theoretically 
possible that if shared memory is overwritten with garbage, the garbage 
looks like a valid injection point with a name that matches one of the 
injection points that postmaster looks at. That seems unlikely enough 
that I think we can accept the risk. To close that gap 100% I think a 
GUC is the only solution.


Note that until we actually add an injection point to a function that 
runs in the postmaster, there's no risk. If we're uneasy about that, we 
could add an assertion to InjectionPointRun() to prevent it from running 
in the postmaster, so that we don't cross that line inadvertently.


Thoughts?

--
Heikki Linnakangas
Neon (https://neon.tech)
From ca331988b67937b942db5c55770919306931d1ac Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 8 Jul 2024 16:09:46 +0300
Subject: [PATCH 1/1] Use atomics to avoid locking InjectionPointRun()

---
 src/backend/utils/misc/injection_point.c | 359 +++
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 243 insertions(+), 117 deletions(-)

diff --git a/src/backend/utils/misc/injection_point.c b/src/backend/utils/misc/injection_point.c
index 48f29e9b60..c5b2310532 100644
--- a/src/backend/utils/misc/injection_point.c
+++ b/src/backend/utils/misc/injection_point.c
@@ -21,7 +21,6 @@
 
 #include "fmgr.h"
 #include "miscadmin.h"
-#include "port/pg_bitutils.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -31,22 +30,35 @@
 
 #ifdef USE_INJECTION_POINTS
 
-/*
- * Hash table for storing injection points.
- *
- * InjectionPointHash is used to find an injection point by name.
- */
-static HTAB *InjectionPointHash;	/* find points from names */
-
 /* Field sizes */
 #define INJ_NAME_MAXLEN		64
 #define INJ_LIB_MAXLEN		128
 #define INJ_FUNC_MAXLEN		128
 #define INJ_PRIVATE_MAXLEN	1024
 
-/* Single injection point stored in InjectionPointHash */
+/* Single injection point stored in shared memory */
 typedef struct InjectionPointEntry
 {
+	/*
+	 * Because injection points need to be usable without LWLocks, we use a
+	 * generation counter on each entry to to allow safe, lock-free reading.
+	 *
+	 * To read an entry, first read the current 'generation' value.  If it's
+	 * even, then the slot is currently unused, and odd means it's in use.
+	 * When reading the other fields, beware that they may change while
+	 * reading them, if the entry is released and reused!  After reading the
+	 * other fields, read 'generation' again: if its value hasn't changed, you
+	 * can be certain that the other fields you read are valid.  Otherwise,
+	 * the slot was concurrently recycled, and you should ignore it.
+	 *
+	 * When adding an entry, you must store all the other fields first, and
+	 * then update the generation number, with an appropriate memory barrier
+	 * in between. In addition to that protocol, you must also hold
+	 * InjectionPointLock, to protect two backends modifying the array at the
+	 * same time.
+	 */
+	pg_atomic_uint64 generation;
+
 	char		name[INJ_NAME_MAXLEN];	/* hash key */
 	char		library[INJ_LIB_MAXLEN];	/* library */
 	char		function[INJ_FUNC_MAXLEN];	/* function */
@@ -58,8 +70,22 @@ typedef struct InjectionPointEntry
 	char		private_data[INJ_PRIVATE_MAXLEN];
 } InjectionPointEntry;
 
-#define INJECTION_POINT_HASH_INIT_SIZE	16
-#define INJECTION_POINT_

Re: pg_wal_summary_contents() and pg_walsummary may return different results on the same WAL summary file

2024-07-08 Thread Robert Haas

On Thu, Jul 4, 2024 at 6:16 AM Fujii Masao  wrote:
> Yes, so I updated the commit message. I borrowed your description and used it 
> in the message. Attached is the revised version of the patch.
>
> If there are no objections, I will commit and backpatch it.

+1. Maybe change "Fix bugs in pg_wal_summary_contents()" to "Fix limit
block handling in pg_wal_summary_contents()".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Pluggable cumulative statistics

2024-07-08 Thread Bertrand Drouvot

Hi,

On Mon, Jul 08, 2024 at 07:22:32AM +, Bertrand Drouvot wrote:
> Except the above (which is just a Nit), 0001 LGTM.
> 

Looking at 0002:

It looks pretty straightforward, just one comment:

+   ptr = ((char *) ctl) + kind_info->shared_ctl_off;
+   kind_info->init_shmem_cb((void *) ptr);

I don't think we need to cast ptr to void when calling init_shmem_cb(). Looking
at some examples in the code, it does not look like we cast the argument to void
when a function has (void *) as parameter (also there is examples in 0003 where
it's not done, see next comments for 0003).

So I think removing the cast here would be more consistent.

Looking at 0003:

It looks pretty straightforward. Also for example, here:

+   fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
+   write_chunk_s(fpout, &kind);
write_chunk(fpout, ptr, info->shared_data_len);

ptr is not casted to void when calling write_chunk() while its second parameter
is a "void *".

+   ptr = ((char *) shmem) + 
info->shared_ctl_off +
+   info->shared_data_off;
+
+   if (!read_chunk(fpin, ptr,

Same here for read_chunk().

I think that's perfectly fine and that we should do the same in 0002 when
calling init_shmem_cb() for consistency.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Optimize commit performance with a large number of 'on commit delete rows' temp tables

2024-07-08 Thread feichanghong

Hi wenhui,




I carefully analyzed the reason for the performance regression with fewer

temporary tables in the previous patch (v1-0002-): the k_hash_funcs determined

by the bloom_create function were 10(MAX_HASH_FUNCS), which led to an excessive

calculation overhead for the bloom filter.




Based on the calculation formula for the bloom filter, when the number of items

is 100 and k_hash_funcs is 2, the false positive rate for a 1KB bloom filter is

0.0006096; when the number of items is 1000, the false positive rate is

0.048929094. Therefore, k_hash_funcs of 2 can already achieve a decent false

positive rate, while effectively reducing the computational overhead of the

bloom filter.




I have re-implemented a bloom_create_v2 function to create a bloom filter with

a specified number of hash functions and specified memory size.




From the test data below, it can be seen that the new patch in the attachment

(v1-0003-) does not lead to performance regression in any scenario.

Furthermore, the default threshold value can be lowered to 2.




Here is the TPS performance data for different numbers of temporary tables

under different thresholds, as compared with the head (98347b5a). The testing

tool used is pgbench, with the workload being to insert into one temporary

table (when the number of temporary tables is 0, the workload is SELECT 1):




|tablenum      |0           |1   
        |2          |5  
        |10         |100    
    |1000       |

|--|||---|---|---|---|---|

|head(98347b5a)|39912.722209|10064.306268|9183.871298|7452.071689|5641.487369|1073.203851|114.530958
 |

|threshold-2   
|40097.047974|10009.598155|9982.172866|9955.651235|.338901|9785.626296|8278.828828|




Here is the TPS performance data for different numbers of temporary tables

at a threshold of 2, compared with the head (commit 98347b5a). The testing tool

is pgbench, with the workload being to insert into all temporary tables:




|table num     |1          |2    
      | 5         |10       
  |100       |1000     |

|--|---|---|---|---|--|-|

|head(98347b5a)|7243.945042|5734.545012|3627.290594|2262.594766|297.856756|27.745808|

|threshold-2   
|7289.171381|5740.849676|3626.135510|2207.439931|293.145036|27.020953|







I have previously researched the implementation of the Global Temp Table (GTT)

you mentioned, and it have been used in Alibaba Cloud's PolarDB (Link [1]).

GTT can prevent truncation operations on temporary tables that have not been

accessed by the current session (those not in the OnCommitItem List), but GTT

that have been accessed by the current session still need to be truncated at

commit time.Therefore, GTT also require the optimizations mentioned in the

above patch.







[1] 
https://www.alibabacloud.com/help/en/polardb/polardb-for-oracle/using-global-temporary-tables?spm=a3c0i.23458820.2359477120.1.66e16e9bUpV7cK


Best Regards,
Fei Changhong

v1-0003-Optimize-commit-with-temp-tables-guc.patch
Description: Binary data


v1-0003-Optimize-commit-with-temp-tables-without-guc.patch
Description: Binary data

Re: Injection point locking

2024-07-08 Thread Tom Lane

Heikki Linnakangas  writes:
> Note that until we actually add an injection point to a function that 
> runs in the postmaster, there's no risk. If we're uneasy about that, we 
> could add an assertion to InjectionPointRun() to prevent it from running 
> in the postmaster, so that we don't cross that line inadvertently.

As long as we consider injection points to be a debug/test feature
only, I think it's a net positive that one can be set in the
postmaster.  I'd be considerably more uncomfortable if somebody
wanted to do that in production, but maybe it'd be fine even then.

regards, tom lane

Re: 回复：Re: 回复：Re: speed up pg_upgrade with large number of tables

2024-07-08 Thread Nathan Bossart

On Mon, Jul 08, 2024 at 03:22:36PM +0800, 杨伯宇(长堂) wrote:
> Besides, https://commitfest.postgresql.org/48/4995/ seems insufficient to 
> this situation. Some time-consuming functions like check_for_data_types_usage
> are not yet able to run in parallel. But these patches could be a great
> starting point for a more efficient parallelism implementation. Maybe we can
> do it later.

I actually just wrote up the first version of the patch for parallelizing
the data type checks over the weekend.  I'll post it shortly.

-- 
nathan

Re: array_in sub function ReadArrayDimensions error message

2024-07-08 Thread Tom Lane

jian he  writes:
> while reviewing the json query doc,
> I found out the following error message was not quite right.

> select '[1,2]'::int[];
> ERROR:  malformed array literal: "[1,2]"
> LINE 1: select '[1,2]'::int[];
>^
> DETAIL:  Missing "]" after array dimensions.

> should it be:
> "Missing delimiter ":" while specifying array dimensions."

That's presuming quite a lot about the nature of the error.
All the code knows is that what follows the "1" should be
either ":" or "]", and when it sees "," instead it throws
this error.  I agree the existing message isn't great, but
trying to be more specific could confuse people even more
if the more-specific message doesn't apply either.

One possibility could be

 if (*p != ']')
 ereturn(escontext, false,
 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
  errmsg("malformed array literal: \"%s\"", origStr),
+ (strchr(p, ']') != NULL) ?
+ errdetail("Array dimensions have invalid syntax.") :
  errdetail("Missing \"%s\" after array dimensions.",
"]")));

that is, only say "Missing "]"" if there's no ']' anywhere, and
otherwise just say the dimensions are wrong.  This could be fooled
by a ']' that's part of some string in the data, but even then the
errdetail isn't wrong.

Or we could just say "Array dimensions have invalid syntax."
unconditionally.

regards, tom lane

Re: 回复: An implementation of multi-key sort

2024-07-08 Thread Tomas Vondra




On 7/4/24 14:45, Yao Wang wrote:
> Hi John,
> 
> Thanks for your kind message. I talked to Heikki before getting Tomas's
> response, and he said "no promise but I will take a look". That's why I
> added his email. I have updated the CF entry and added Tomas as reviewer.
> 
> Hi Tomas,
> 
> Again, I'd say a big thank to you. The report and script are really, really
> helpful. And your ideas are very valuable.
> 
> Firstly, the expectation of mksort performance:
> 
> 1. When mksort works well, it should be faster than qsort because it saves
> the cost of comparing duplicated values every time.
> 2. When all values are distinct at a particular column, the comparison
> will finish immediately, and mksort will actually fall back to qsort. For
> the case, mksort should be equal or a bit slower than qsort because it need
> to maintain more complex state.
> 
> Generally, the benefit of mksort is mainly from duplicated values and sort
> keys: the more duplicated values and sort keys are, the bigger benefit it
> gets.
> 
> Analysis on the report in your previous mail
> 
> 
> 1. It seems the script uses $count to specify the duplicated values:
> 
> number of repetitions for each value (ndistinct = nrows/count)
> 
> However, it is not always correct. For type text, the script generates
> values like this:
> 
> expr="md5(((i / $count) + random())::text)"
> 
> But md5() generates totally random values regardless of $count. Some cases
> of timestamptz have the same problem.
> 
> For all distinct values, the sort will finish at first depth and fall to
> qsort actually.
> 

You're right, thanks for noticing / fixing this.

> 2. Even for the types with correct duplicated setting, the duplicated ratio
> is very small: e.g. say $nrows = 1 and $count = 100, only 1% duplicated
> rows can go to depth 2, and only 0.01% of them can go to depth 3. So it still
> works on nearly all distinct values.
> 

True, but that's why the scripts test with much larger data sets too,
with more comparisons needing to look at other columns. It's be possible
to construct data sets that are likely to benefit more from mksort - I'm
not against doing that, but then there's the question of what data sets
are more representative of what users actually do.

I'd say a random data set like the ones I used are fairly common - it's
fine to not improve them, but we should not regress them.

> 3. Qsort of PG17 uses kind of specialization for tuple comparator, i.e. it
> uses specialized functions for different types, e.g. qsort_tuple_unsigned()
> for unsigned int. The specialized comparators avoid all type related checks
> and are much faster than regular comparator. That is why we saw 200% or more
> regression for the cases.
> 

OK, I'm not familiar with this code enough to have an opinion.

> 
> Code optimizations I did for mk qsort
> -
> 
> 1. Adapted specialization for tuple comparator.
> 2. Use kind of "hybrid" sort: when we actually adapt bubble sort due to
> limited sort items, use bubble sort to check datums since specified depth.
> 3. Other other optimizations such as pre-ordered check.
> 
> 
> Analysis on the new report
> --
> 
> I also did some modifications to your script about the issues of data types,
> plus an output about distinct value count/distinct ratio, and an indicator
> for improvement/regression. I attached the new script and a report on a
> data set with 100,000 rows and 2, 5, 8 columns.
> 

OK, but I think a report for a single data set size is not sufficient to
evaluate a patch like this, it can easily miss various caching effects
etc. The results I shared a couple minutes ago are from 1000 to 10M
rows, and it's much more complete view.

> 1. Generally, the result match the expectation: "When mksort works well, it
> should be faster than qsort; when mksort falls to qsort, it should be equal
> or a bit slower than qsort."

The challenge is how to know in advance if mksort is likely to work well.

> 2. For all values of "sequential" (except text type), mksort is a bit slower
> than qsort because no actual sort is performed due to the "pre-ordered"
> check.

OK

> 3. For int and bigint type, mksort became faster and faster when
> there were more and more duplicated values and sort keys. Improvement of
> the best cases is about 58% (line 333) and 57% (line 711).

I find it hard to interpret the text-only report, but I suppose these
are essentially the "green" patches in the PDF report I attached to my
earlier message. And indeed, there are nice improvements, but only with
cases with very many duplicates, and the price for that is 10-20%
regressions in the other cases. That does not seem like a great trade
off to me.

> 4. For timestamptz type, mksort is a bit slower than qsort because the
> distinct ratio is always 1 for almost all cases. I think more benefit is
> available by increasing the duplicated values.

Yeah, this was a b

Re: optimizing pg_upgrade's once-in-each-database steps

2024-07-08 Thread Nathan Bossart

As I mentioned elsewhere [0], here's a first attempt at parallelizing the
data type checks.  I was worried that I might have to refactor Daniel's
work in commit 347758b quite significantly, but I was able to avoid that by
using a set of generic callbacks and providing each task step an index to
the data_types_usage_checks array.

[0] https://postgr.es/m/Zov5kHbEyDMuHJI_%40nathan

-- 
nathan
>From 0d3353c318ca4dcd582d67ca25c47b9e3cf24921 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 28 Jun 2024 11:02:44 -0500
Subject: [PATCH v3 1/7] introduce framework for parallelizing pg_upgrade tasks

---
 src/bin/pg_upgrade/Makefile  |   1 +
 src/bin/pg_upgrade/async.c   | 323 +++
 src/bin/pg_upgrade/meson.build   |   1 +
 src/bin/pg_upgrade/pg_upgrade.h  |  16 ++
 src/tools/pgindent/typedefs.list |   4 +
 5 files changed, 345 insertions(+)
 create mode 100644 src/bin/pg_upgrade/async.c

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index bde91e2beb..3bc4f5d740 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -12,6 +12,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
$(WIN32RES) \
+   async.o \
check.o \
controldata.o \
dump.o \
diff --git a/src/bin/pg_upgrade/async.c b/src/bin/pg_upgrade/async.c
new file mode 100644
index 00..7df1e7712d
--- /dev/null
+++ b/src/bin/pg_upgrade/async.c
@@ -0,0 +1,323 @@
+/*
+ * async.c
+ *
+ * parallelization via libpq's async APIs
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ * src/bin/pg_upgrade/async.c
+ */
+
+#include "postgres_fe.h"
+
+#include "common/connect.h"
+#include "fe_utils/string_utils.h"
+#include "pg_upgrade.h"
+
+static int dbs_complete;
+static int dbs_processing;
+
+typedef struct AsyncTaskCallbacks
+{
+   AsyncTaskGetQueryCB query_cb;
+   AsyncTaskProcessCB process_cb;
+   boolfree_result;
+   void   *arg;
+} AsyncTaskCallbacks;
+
+typedef struct AsyncTask
+{
+   AsyncTaskCallbacks *cbs;
+   int num_cb_sets;
+} AsyncTask;
+
+typedef enum
+{
+   FREE,
+   CONNECTING,
+   SETTING_SEARCH_PATH,
+   RUNNING_QUERY,
+} AsyncSlotState;
+
+typedef struct
+{
+   AsyncSlotState state;
+   int db;
+   int query;
+   PGconn *conn;
+} AsyncSlot;
+
+AsyncTask *
+async_task_create(void)
+{
+   return pg_malloc0(sizeof(AsyncTask));
+}
+
+void
+async_task_free(AsyncTask *task)
+{
+   if (task->cbs)
+   pg_free(task->cbs);
+
+   pg_free(task);
+}
+
+void
+async_task_add_step(AsyncTask *task,
+   AsyncTaskGetQueryCB query_cb,
+   AsyncTaskProcessCB process_cb, bool 
free_result,
+   void *arg)
+{
+   AsyncTaskCallbacks *new_cbs;
+
+   task->cbs = pg_realloc(task->cbs,
+  ++task->num_cb_sets * 
sizeof(AsyncTaskCallbacks));
+
+   new_cbs = &task->cbs[task->num_cb_sets - 1];
+   new_cbs->query_cb = query_cb;
+   new_cbs->process_cb = process_cb;
+   new_cbs->free_result = free_result;
+   new_cbs->arg = arg;
+}
+
+static void
+conn_failure(PGconn *conn)
+{
+   pg_log(PG_REPORT, "%s", PQerrorMessage(conn));
+   printf(_("Failure, exiting\n"));
+   exit(1);
+}
+
+static void
+start_conn(const ClusterInfo *cluster, AsyncSlot *slot)
+{
+   PQExpBufferData conn_opts;
+
+   /* Build connection string with proper quoting */
+   initPQExpBuffer(&conn_opts);
+   appendPQExpBufferStr(&conn_opts, "dbname=");
+   appendConnStrVal(&conn_opts, cluster->dbarr.dbs[slot->db].db_name);
+   appendPQExpBufferStr(&conn_opts, " user=");
+   appendConnStrVal(&conn_opts, os_info.user);
+   appendPQExpBuffer(&conn_opts, " port=%d", cluster->port);
+   if (cluster->sockdir)
+   {
+   appendPQExpBufferStr(&conn_opts, " host=");
+   appendConnStrVal(&conn_opts, cluster->sockdir);
+   }
+
+   slot->conn = PQconnectStart(conn_opts.data);
+   termPQExpBuffer(&conn_opts);
+
+   if (!slot->conn)
+   conn_failure(slot->conn);
+}
+
+static void
+dispatch_query(const ClusterInfo *cluster, AsyncSlot *slot,
+  const AsyncTask *task)
+{
+   AsyncTaskCallbacks *cbs = &task->cbs[slot->query];
+   AsyncTaskGetQueryCB get_query = cbs->query_cb;
+   DbInfo *dbinfo = &cluster->dbarr.dbs[slot->db];
+   char   *query = (*get_query) (dbinfo, cbs->arg);
+
+   if (!PQsendQuery(slot->conn, query))
+   conn_failure(slot->conn);
+
+   pg_free(query);
+}
+
+static PGresult *
+get_last_result(PGconn *conn)
+{
+   PGresult   *tmp;
+   PGresult   *res = NULL;
+
+   while ((tmp = PQgetResult(conn)) != NULL)
+   {
+   PQclear(res);

Re: 回复: An implementation of multi-key sort

2024-07-08 Thread Tomas Vondra




On 7/7/24 08:32, Konstantin Knizhnik wrote:
> 
> On 04/07/2024 3:45 pm, Yao Wang wrote:
>> Generally, the benefit of mksort is mainly from duplicated values and
>> sort
>> keys: the more duplicated values and sort keys are, the bigger benefit it
>> gets.
> ...
>> 1. Use distinct stats info of table to enable mksort
>>
>> It's kind of heuristics: in optimizer, check
>> Form_pg_statistic->stadistinct
>> of a table via pg_statistics. Enable mksort only when it is less than a
>> threshold.
>>
>> The hacked code works, which need to modify a couple of interfaces of
>> optimizer. In addition, a complete solution should consider types and
>> distinct values of all columns, which might be too complex, and the
>> benefit
>> seems not so big.
> 
> 
> If mksort really provides advantage only when there are a lot of
> duplicates (for prefix keys?) and of small fraction of duplicates there
> is even some (small) regression
> then IMHO taking in account in planner information about estimated
> number of distinct values seems to be really important. What was a
> problem with accessing this statistics and why it requires modification
> of optimizer interfaces? There is `get_variable_numdistinct` function
> which is defined and used only in selfuncs.c
> 

Yeah, I've been wondering about that too. But I'm also a bit unsure if
using this known-unreliable statistics (especially with filters and
multiple columns) would actually fix the regressions.

> Information about values distribution seems to be quite useful for 
> choosing optimal sort algorithm. Not only for multi-key sort
> optimization. For example if we know min.max value of sort key and it is
> small, we can use O(N) algorithm for sorting. Also it can help to
> estimate when TOP-N search is preferable.
> 

This assumes the information is accurate / reliable, and I'm far from
sure about that.

> Right now Posgres creates special path for incremental sort. I am not
> sure if we also need to be separate path for mk-sort.
> But IMHO if we need to change some optimizer interfaces to be able to
> take in account statistic and choose preferred sort algorithm at
> planning time, then it should be done.
> If mksort can increase sort more than two times (for large number of
> duplicates), it will be nice to take it in account when choosing optimal
> plan.
> 

I did commit the incremental sort patch, and TBH I'm not convinced I'd
do that again. It's a great optimization when it works (and it seems to
work in plenty of cases), but we've also had a number of reports about
significant regressions, where the incremental sort costing is quite
off. Granted, it's often about cases where we already had issues and
incremental sort just "exacerbates" that (say, with LIMIT queries), but
that's kinda the point I'm trying to make - stats are inherently
incomplete / simplified, and some plans are more sensitive to that.

Which is why I'm wondering if we might do the decision based on some
information collected at runtime.

> Also in this case we do not need extra GUC for explicit enabling of
> mksort. There are too many parameters for optimizer and adding one more
> will make tuning more complex. So I prefer that decision is take buy
> optimizer itself based on the available information, especially if
> criteria seems to be obvious.
> 

The GUC is very useful for testing, so let's keep it for now.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ❓ JSON Path Dot Precedence

2024-07-08 Thread David E. Wheeler

Hi, following up on some old threads.

> On Apr 10, 2024, at 16:44, David E. Wheeler  wrote:
> 
> That makes sense, thanks. It’s just a little odd to me that the resulting 
> path isn’t a query at all. To Erik’s point: what path can `'0x2.p10` even 
> select?

I’m wondering whether the jsonpath parser should be updated to reject cases 
like this. I think it will always return no results. AFAICT, there’s no way to 
navigate to an object identifier immediately after a number:

david=# select '0x2.p10'::jsonpath;
 jsonpath  
---
 (2)."p10"
(1 row)

david=# select jsonb_path_query(target => '[0, 1, {"p10": true}]', path => 
'0x2.p10');
 jsonb_path_query 
--
(0 rows)

david=# select jsonb_path_query(target => '{"0x2": {"p10": true}}', path => 
'0x2.p10');
 jsonb_path_query 
--
(0 rows)

It’s just inherently meaningless. BTW, it’s not limited to hex numbers:

david=# select '(2).p10'::jsonpath;
 jsonpath  
---
 (2)."p10"

OTOH, maybe that’s a corner case we can live with.

Best,

David

Re: Parallel CREATE INDEX for GIN indexes

2024-07-08 Thread Matthias van de Meent

On Mon, 8 Jul 2024, 13:38 Tomas Vondra,  wrote:
>
> On 7/8/24 11:45, Matthias van de Meent wrote:
> > As to the GIN TS code itself; yes it's more complicated, mainly
> > because it uses several optimizations to reduce unnecessary
> > allocations and (de)serializations of GinTuples, and I'm aware of even
> > more such optimizations that can be added at some point.
> >
> > As an example: I suspect the use of GinBuffer in writetup_index_gin to
> > be a significant resource drain in my patch because it lacks
> > "freezing" in the tuplesort buffer. When no duplicate key values are
> > encountered, the code is nearly optimal (except for a full tuple copy
> > to get the data into the GinBuffer's memory context), but if more than
> > one GinTuple has the same key in the merge phase we deserialize both
> > tuple's posting lists and merge the two. I suspect that merge to be
> > more expensive than operating on the compressed posting lists of the
> > GinTuples themselves, so that's something I think could be improved. I
> > suspect/guess it could save another 10% in select cases, and will
> > definitely reduce the memory footprint of the buffer.
> > Another thing that can be optimized is the current approach of
> > inserting data into the index: I think it's kind of wasteful to
> > decompress and later re-compress the posting lists once we start
> > storing the tuples on disk.
> >
>
> I need to experiment with this a bit more, to better understand the
> behavior and pros/cons. But one thing that's not clear to me is why
> would this be better than simply increasing the amount of memory for the
> initial BuildAccumulator buffer ...
>
> Wouldn't that have pretty much the same effect?

I don't think so:

The BuildAccumulator buffer will probably never be guaranteed to have
space for all index entries, though it does use memory more
efficiently than Tuplesort. Therefore, it will likely have to flush
keys multiple times into sorted runs, with likely duplicate keys
existing in the tuplesort.

My patch 0008 targets the reduction of IO and CPU once
BuildAccumulator has exceeded its memory limits. It reduces the IO and
computational requirement of Tuplesort's sorted-run merging by merging
the tuples in those sorted runs in that merge process, reducing the
number of tuples, bytes stored, and number of compares required in
later operations. It enables some BuildAccumulator-like behaviour
inside the tuplesort, without needing to assign huge amounts of memory
to the BuildAccumulator by allowing efficient spilling to disk of the
incomplete index data. And, during merging, it can work with
O(n_merge_tapes * tupsz) of memory, rather than O(n_distinct_keys *
tupsz). This doesn't make BuildAccumulator totally useless, but it
does reduce the relative impact of assigning more memory.

One significant difference between the modified Tuplesort and
BuildAccumulator is that the modified Tuplesort only merges the
entries once they're getting written, i.e. flushed from the in-memory
structure; while BuildAccumulator merges entries as they're being
added to the in-memory structure.

Note that this difference causes BuildAccumulator to use memory more
efficiently during in-memory workloads (it doesn't duplicate keys in
memory), but as BuildAccumulator doesn't have spilling it doesn't
handle full indexes' worth of data (it does duplciate keys on disk).

I hope this clarifies things a bit. I'd be thrilled if we'd be able to
put BuildAccumulator-like behaviour into the in-memory portion of
Tuplesort, but that'd require a significantly deeper understanding of
the Tuplesort internals than what I currently have, especially in the
area of its memory management.

Kind regards

Matthias van de Meent
Neon (https://neon.tech)

Re: Commitfest manager for July 2024

2024-07-08 Thread Joe Conway


On 7/3/24 12:51, Andrey M. Borodin wrote:

On 3 Jul 2024, at 01:08, Corey Huinker  wrote:

I'll give it a shot.


Great, thank you! Do you have extended access to CF? Like activity log and 
mass-mail functions?
If no I think someone from PG_INFRA can grant you necessary access.


I can do that, although given that Corey and I are colleagues, it might 
be better if someone else on pginfra did.


Or at least if a few other hackers tell me to "just do it"...

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: Additional Directory for Extensions

2024-07-08 Thread David E. Wheeler

On Jun 25, 2024, at 18:31, David E. Wheeler  wrote:

> For those who prefer a GitHub patch review experience, see this PR:
> 
>  https://github.com/theory/postgres/pull/3/files

Rebased and restored PGC_SUSET in the attached v5 patch, plus noted the 
required privileges in the docs.

Best,

David



v5-0001-Add-extension_destdir-GUC.patch
Description: Binary data

Re: ❓ JSON Path Dot Precedence

2024-07-08 Thread David G. Johnston

On Mon, Jul 8, 2024 at 8:27 AM David E. Wheeler 
wrote:

> Hi, following up on some old threads.
>
> > On Apr 10, 2024, at 16:44, David E. Wheeler 
> wrote:
> >
> > That makes sense, thanks. It’s just a little odd to me that the
> resulting path isn’t a query at all. To Erik’s point: what path can
> `'0x2.p10` even select?
>
> I’m wondering whether the jsonpath parser should be updated to reject
> cases like this. I think it will always return no results. AFAICT, there’s
> no way to navigate to an object identifier immediately after a number:
>
>
If we go down this path wouldn't the correct framing be:  do not allow
accessors after scalars ?  The same argument applies to false/"john" and
other scalar types since by definition none of them have subcomponents to
be accessed.

That said, the parser has a lax mode which somewhat implies it doesn't
expect the jsonpath type to perform much in the way of validation of the
semantic correctness of the encoded path expression.

I like the idea of a smarter expression-holding type and would even wish to
have had this on day one.  Retrofitting is less appealing.  We document a
similarity with regular expressions here where we, for better and worse,
have lived without a regexppath data type forever and leave it to the
executor to tell the user their pattern is invalid.  Leaning on that
precedence here makes accepting the status quo more reasonable.  Though
strict/lax modes and, I think, variables, motivates me to put my vote
toward the "do more validation" group.

Does the standard even have a separate type here or is that our
implementation detail invention?

David J.

Re: Patch bug: Fix jsonpath .* on Arrays

2024-07-08 Thread David E. Wheeler

On Jun 27, 2024, at 04:17, Michael Paquier  wrote:

> The tests of jsonb_jsonpath.sql include a lot of patterns for @? and
> jsonb_path_query with the lax and strict modes, so shouldn't these
> additions be grouped closer to the existing tests rather than added at 
> the end of the file?

I’ve moved them closer to other tests for unwrapping behavior in the attached 
updated and rebased v3 patch.

Best,

David



v3-0001-Add-tests-for-jsonpath-.-on-arrays.patch
Description: Binary data

Re: ❓ JSON Path Dot Precedence

2024-07-08 Thread David E. Wheeler

On Jul 8, 2024, at 12:05, David G. Johnston  wrote:

> If we go down this path wouldn't the correct framing be:  do not allow 
> accessors after scalars ?  The same argument applies to false/"john" and 
> other scalar types since by definition none of them have subcomponents to be 
> accessed.

Yes, excellent point.

> That said, the parser has a lax mode which somewhat implies it doesn't expect 
> the jsonpath type to perform much in the way of validation of the semantic 
> correctness of the encoded path expression.

My understanding is that lax mode means it ignores where the JSON doesn’t abide 
by expectations of the path expression, not that the path parsing is lax.

> I like the idea of a smarter expression-holding type and would even wish to 
> have had this on day one.  Retrofitting is less appealing.  We document a 
> similarity with regular expressions here where we, for better and worse, have 
> lived without a regexppath data type forever and leave it to the executor to 
> tell the user their pattern is invalid.  Leaning on that precedence here 
> makes accepting the status quo more reasonable.  Though strict/lax modes and, 
> I think, variables, motivates me to put my vote toward the "do more 
> validation" group.

This feels different from a documented difference in behavior as an 
implementation choice, like path regex vs. Spencer. In this case, the 
expression is technically meaningless, but there’s never so much as an error 
thrown.

> Does the standard even have a separate type here or is that our 
> implementation detail invention?

Sorry, separate type for what?

Best,

David

Re: ❓ JSON Path Dot Precedence

2024-07-08 Thread David G. Johnston

On Mon, Jul 8, 2024 at 9:12 AM David E. Wheeler 
wrote:

> On Jul 8, 2024, at 12:05, David G. Johnston 
> wrote:
>
> > Does the standard even have a separate type here or is that our
> implementation detail invention?
>
> Sorry, separate type for what?
>
>
We created a data type named: jsonpath.  Does the standard actually have
that data type and defined parsing behavior or does it just have functions
where one of the inputs is text whose contents are a path expression?

David J.

Re: Add GiST support for mixed-width integer operators

2024-07-08 Thread Paul Jungwirth

On 7/6/24 05:04, Andrey M. Borodin wrote:>> On 5 Jul 2024, at 23:46, Paul Jungwirth 
 wrote:


this commit adds support for all combinations of int2/int4/int8 for all five btree 
operators (=/>).


Looks like a nice feature to have.
Would it make sense to do something similar to float8? Or, perhaps, some other 
types from btree_gist?


Here is another patch adding float4/8 and also date/timestamp/timestamptz, in the same combinations 
as btree.


No other types seem like they deserve this treatment. For example btree doesn't 
mix oids with ints.


Also, are we sure such tests will be stable?


You're right, it was questionable. We hadn't analyzed the table, and after doing that the plan 
changes from a bitmap scan to an index-only scan. That makes more sense, and I doubt it will change 
now that it's based on statistics.


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.comFrom c3b5f0c135b1d430f9ebf1367644f70726f5bb41 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" 
Date: Fri, 5 Jul 2024 11:16:09 -0700
Subject: [PATCH v2] Use GiST index with mixed integer/float/timestamp widths

A GiST index on a bigint will not be used when compared to an int
literal, unless the user casts the literal to a bigint. This is
surprising and causes invisible slowness. To fix that we can add pg_amop
entries for mixed-width arguments. We do that for all int2/int4/int8
combinations, and also for float4/float8 and date/timestamp/timestamptz
(not that timestamptz is wider than timestamp, but we can follow btree
in allowing mixed operators for them too).

With btree, we have one opfamily for all integer types, and it supports
=(int2,int8), etc. With GiST we have a separate opfamily for each width,
so it's less obvious where to put the mixed-width operators. But it
makes sense for wider opfamilies to include support for smaller types,
so I added =(int2,int8) and =(int4,int8) to gist_int8_ops, and
=(int2,int4) to gist_int4_ops. Also =(float4,float8) to
gist_float8_ops, =(date,timestamptz) and =(timestamp,timestamptz to
gist_timestamptz_ops, and =(date,timestamp) to gist_timestamp_ops. And
as long as we're doing this, we might as well support all five btree
operators (<, <=, =, >=, >).
---
 contrib/btree_gist/Makefile |   3 +-
 contrib/btree_gist/btree_gist--1.7--1.8.sql | 101 
 contrib/btree_gist/btree_gist.control   |   2 +-
 contrib/btree_gist/expected/float8.out  |  12 +++
 contrib/btree_gist/expected/int4.out|  12 +++
 contrib/btree_gist/expected/int8.out|  21 
 contrib/btree_gist/expected/timestamp.out   |  12 +++
 contrib/btree_gist/expected/timestamptz.out |  22 +
 contrib/btree_gist/meson.build  |   1 +
 contrib/btree_gist/sql/float8.sql   |   8 ++
 contrib/btree_gist/sql/int4.sql |   7 ++
 contrib/btree_gist/sql/int8.sql |  10 ++
 contrib/btree_gist/sql/timestamp.sql|   7 ++
 contrib/btree_gist/sql/timestamptz.sql  |  11 +++
 14 files changed, 227 insertions(+), 2 deletions(-)
 create mode 100644 contrib/btree_gist/btree_gist--1.7--1.8.sql

diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile
index 073dcc745c4..965f0559d64 100644
--- a/contrib/btree_gist/Makefile
+++ b/contrib/btree_gist/Makefile
@@ -33,7 +33,8 @@ EXTENSION = btree_gist
 DATA = btree_gist--1.0--1.1.sql \
btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \
btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \
-   btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql
+   btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \
+   btree_gist--1.7--1.8.sql
 PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes"
 
 REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \
diff --git a/contrib/btree_gist/btree_gist--1.7--1.8.sql b/contrib/btree_gist/btree_gist--1.7--1.8.sql
new file mode 100644
index 000..221df3cd6f8
--- /dev/null
+++ b/contrib/btree_gist/btree_gist--1.7--1.8.sql
@@ -0,0 +1,101 @@
+/* contrib/btree_gist/btree_gist--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION btree_gist UPDATE TO '1.7'" to load this file. \quit
+
+-- Add mixed integer operators.
+-- This lets Postgres use the index without casting literals to bigints, etc.
+-- Whereas btree has one integer_ops opfamily,
+-- GiST has gist_int2_ops, gist_int4_ops, and gist_int8_ops.
+-- We add the mixed operators to whichever opfamily matches the larger type,
+-- sort of like "promoting".
+ALTER OPERATOR FAMILY gist_int8_ops USING gist ADD
+  OPERATOR 1 <  (int8, int4),
+  OPERATOR 2 <= (int8, int4),
+  OPERATOR 3 =  (int8, int4),
+  OPERATOR 4 >= (int8, int4),
+  OPERATOR 5 >  (int8, int4),
+
+  OPERATOR 1 <  (int4, int8),
+  OPERATOR 2 <= (int4, int8),
+  OPERATOR 3 =  (int4, int8),
+  OPERATOR 4 >= (int4, int8),
+  OPERATOR 5 >  (int4, int8),
+
+  OPERATOR 1 <  (int8, int2),
+  OPERATOR 2 <

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-07-08 Thread Michael Christofides

>
> So thanks again! and this will really help a lot of people.


I'd like to echo this thanks to you all.

While looking to add support for SERIALIZE in an explain visualisation tool
I work on, I realised there isn't yet an equivalent auto_explain parameter
for SERIALIZE. I'm not sure if this is a deliberate omission (perhaps for a
similar reason planning time is not included in auto_explain?), but I
didn't see it mentioned above, so I thought best to ask in case not.

Thanks again,
Michael

Re: ❓ JSON Path Dot Precedence

2024-07-08 Thread David E. Wheeler

On Jul 8, 2024, at 12:17, David G. Johnston  wrote:

> We created a data type named: jsonpath.  Does the standard actually have that 
> data type and defined parsing behavior or does it just have functions where 
> one of the inputs is text whose contents are a path expression?

Ah, got it.

D

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-07-08 Thread Tom Lane

Michael Christofides  writes:
> While looking to add support for SERIALIZE in an explain visualisation tool
> I work on, I realised there isn't yet an equivalent auto_explain parameter
> for SERIALIZE. I'm not sure if this is a deliberate omission (perhaps for a
> similar reason planning time is not included in auto_explain?), but I
> didn't see it mentioned above, so I thought best to ask in case not.

I'm not sure there's a need for it.  When a query runs under
auto_explain, the output values will be sent to the client,
so those cycles should be accounted for anyway, no?

(Perhaps the auto_explain documentation should mention this?)

regards, tom lane

Re: Windows: openssl & gssapi dislike each other

2024-07-08 Thread Andres Freund

Hi,

On 2024-06-13 00:12:51 +0900, Imran Zaheer wrote:
> I removed the macro from the sslinfo.c as suggested by Andrew. Then I
> was thinking maybe we can undo some other similar code.

What precisely do you mean by that? Just getting rid of the "ordered include"
of openssl headers in {fe,be}-secure-openssl.h?

> I rearranged the headers to their previous position in
> be-secure-openssl.c and in fe-secure-openssl.c. I was able to compile
> with gssapi and openssl enabled. You can look into the original commits. [1,
> 2]
> Is it ok if we undo the changes from these commits?
>
> I am attaching two new patches.
> One with macro guards removed from ssinfo.c.
> Second patch will additionally rearrange headers for
> be-secure-openssl.c and in fe-secure-openssl.c to their previous
> position.

One thing that concerns me with this is that there are other includes of
gssapi/gssapi.h (e.g. in , which haven't been changed here. ISTM we ought to do 
apply
the same change to all of those, otherwise we're just waiting for the problem
to re-appear.

I wonder if we should add a src/include/libpq/pg-gssapi.h or such, which could
wrap the entire ifdeferry for gss support. Something like

#ifdef ENABLE_GSS

#if defined(HAVE_GSSAPI_H)
#include 
#include 
#else
#include 
#include 
#endif

/*
 * On Windows,  includes a #define for X509_NAME, which breaks our
 * ability to use OpenSSL's version of that symbol if  is pulled
 * in after  ... and, at least on some builds, it is.  We
 * can't reliably fix that by re-ordering #includes, because libpq/libpq-be.h
 * #includes .  Instead, just zap the #define again here.
 */
#ifdef X509_NAME
#undef X509_NAME
#endif

#endif /* ENABLE_GSS */

Which'd allow the various places using gss (libpq-be.h, be-gssapi-common.h,
libpq-int.h) to just include pg-gssapi.h and get all of the above without
redundancy?

Another thing that concerns me about this approach is that it seems to assume
that the only source of such conflicting includes is gssapi. What if some
other header pulls in wincrypt.h?  But I can't really see a way out of that...

Greetings,

Andres Freund

Re: jsonpath: Inconsistency of timestamp_tz() Output

2024-07-08 Thread David E. Wheeler

On Jul 2, 2024, at 10:53, David E. Wheeler  wrote:

> ``` patch
> --- a/src/test/regress/expected/jsonb_jsonpath.out
> +++ b/src/test/regress/expected/jsonb_jsonpath.out
> @@ -2914,7 +2914,7 @@ HINT:  Use *_tz() function for time zone support.
> select jsonb_path_query_tz('"2023-08-15"', '$.timestamp_tz()'); -- should work
>  jsonb_path_query_tz 
> -
> - "2023-08-15T07:00:00+00:00"
> + "2023-08-15T00:00:00-07:00"
> (1 row)
> 
> select jsonb_path_query('"12:34:56"', '$.timestamp_tz()');
> @@ -3003,7 +3003,7 @@ HINT:  Use *_tz() function for time zone support.
> select jsonb_path_query_tz('"2023-08-15 12:34:56"', '$.timestamp_tz()'); -- 
> should work
>  jsonb_path_query_tz 
> -
> - "2023-08-15T12:34:56+00:00"
> + "2023-08-15T12:34:56+10:00"
> (1 row)
> 
> select jsonb_path_query('"10-03-2017 12:34"', '$.datetime("dd-mm- 
> HH24:MI")');
> ```

FWIW I fixed this issue in my jsonpath port, which I released over the 
weekend.[1] You can see what I think should be the proper output for these two 
examples in these Playground links, where the response will use your browser’s 
time zone: [2], [3].

Best,

David


[1]: https://justatheory.com/2024/07/go-sqljson-path/
[2]: 
https://theory.github.io/sqljson/playground/?p=%2524.timestamp_tz%28%29&j=%25222023-08-15%2522&a=&o=49
[3]: 
https://theory.github.io/sqljson/playground/?p=%2524.timestamp_tz%28%29&j=%25222023-08-15%252012%253A34%253A56%2522&a=&o=49

Should we work around msvc failing to compile tab-complete.c?

2024-07-08 Thread Andres Freund

Hi,

Compiling postgres on windows with tab-completion support fails either with
"fatal error C1026: parser stack overflow, program too complex”.
or (in recent versions) with
"…/src/bin/psql/tab-complete.c(4023): fatal error C1001: Internal compiler 
error."

I've reported this to the visual studio folks at [1] and supposedly a fix is
waiting to be released.  I don't know if that's just for the internal compiler
error in 2022 or lifting the limitation.


It's pretty easy to work around the error [2]. I wonder if we should just do
that, then we don't have to insist on a very new msvc version being used and
it'll work even if they just decide to fix the internal compiler error.


Besides just breaking the if-else-if chain at some arbitrary point, we could
alternatively make a more general efficiency improvement, and add a bit of
nesting at a few places. E.g. instead of having ~33 checks for COMMENT being
the first word, we could use "outer" else-if for COMMENT and check the
sub-variants inside that branch:

else if (HeadMatches("COMMENT"))
{
if (Matches("COMMENT"))
COMPLETE_WITH("ON");
else if (Matches("COMMENT", "ON"))
...
}

if we do that to one or two common keywords (COMMENT and DROP seems easiest)
we'd be out from that limitation, and would probably reduce the number of
cycles for completions noticeably.

OTOH, that'd be a more noisy change and it'd be less defensible to apply it to
17 - which IMO would be nice to do.


Greetings,

Andres Freund

[1] 
https://developercommunity.visualstudio.com/t/tab-completec4023:-fatal-error-C1001:/10685868
[2] 
https://postgr.es/m/CACLU5mRRLAW2kca3k2gVDM8kow6wgvT_BCaeg37jz%3DKyj1afvw%40mail.gmail.com

Re: debugging what might be a perf regression in 17beta2

2024-07-08 Thread MARK CALLAGHAN

My results have too much variance so this is a false alarm. One day I might
learn whether the noise is from HW, Postgres or my test method.
I ended up trying 10 builds between 17beta1 and 17beta2, but even with that
I don't have a clear signal.

On Fri, Jul 5, 2024 at 8:48 PM David Rowley  wrote:

> On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN  wrote:
> > On small servers I have at home I can reproduce the problem without
> concurrent queries and 17beta2 is 5% to 10% slower there.
> >
> > The SQL statement for the scan microbenchmark is:
> > SELECT * from %s WHERE LENGTH(c) < 0
>
> Can you share the CREATE TABLE and script to populate so others can try?
>
> Also, have you tried with another compiler?  It does not seem
> impossible that the refactoring done in heapam.c or the read stream
> code might have changed something to make the binary more sensitive to
> caching effects in this area.  One thing I often try when I can't
> pinpoint the exact offending commit is to write a script to try the
> first commit of each day for, say, 30 days to see if there is any
> saw-toothing in performance numbers over that period.
>
> David
>


-- 
Mark Callaghan
mdcal...@gmail.com

Re: debugging what might be a perf regression in 17beta2

2024-07-08 Thread MARK CALLAGHAN

A writeup for the benchmark results is here -
https://smalldatum.blogspot.com/2024/07/postgres-17beta2-vs-sysbench-looking.html
pg17beta2 and pg17beta1 look good so far

On Mon, Jul 8, 2024 at 10:49 AM MARK CALLAGHAN  wrote:

> My results have too much variance so this is a false alarm. One day I
> might learn whether the noise is from HW, Postgres or my test method.
> I ended up trying 10 builds between 17beta1 and 17beta2, but even with
> that I don't have a clear signal.
>
> On Fri, Jul 5, 2024 at 8:48 PM David Rowley  wrote:
>
>> On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN  wrote:
>> > On small servers I have at home I can reproduce the problem without
>> concurrent queries and 17beta2 is 5% to 10% slower there.
>> >
>> > The SQL statement for the scan microbenchmark is:
>> > SELECT * from %s WHERE LENGTH(c) < 0
>>
>> Can you share the CREATE TABLE and script to populate so others can try?
>>
>> Also, have you tried with another compiler?  It does not seem
>> impossible that the refactoring done in heapam.c or the read stream
>> code might have changed something to make the binary more sensitive to
>> caching effects in this area.  One thing I often try when I can't
>> pinpoint the exact offending commit is to write a script to try the
>> first commit of each day for, say, 30 days to see if there is any
>> saw-toothing in performance numbers over that period.
>>
>> David
>>
>
>
> --
> Mark Callaghan
> mdcal...@gmail.com
>


-- 
Mark Callaghan
mdcal...@gmail.com

Re: Detoasting optionally to make Explain-Analyze less misleading

2024-07-08 Thread Michael Christofides

> I'm not sure there's a need for it.  When a query runs under
> auto_explain, the output values will be sent to the client,
> so those cycles should be accounted for anyway, no?
>

Yes, great point, the total duration reported by auto_explain includes it.
Explicit serialization stats might still be helpful for folks when it is
the bottleneck, but less useful for sure (especially if nothing else causes
big discrepancies between the duration reported by auto_explain and the
"actual total time" of the root node).

(Perhaps the auto_explain documentation should mention this?)
>

I'd value this. I notice the folks working on the other new explain
parameter (memory) opted to add a comment to the auto_explain source code
to say it wasn't supported.

Thanks again,
Michael

Re: Confine vacuum skip logic to lazy_scan_skip

2024-07-08 Thread Melanie Plageman

On Sun, Jul 7, 2024 at 10:49 AM Noah Misch  wrote:
>
> On Fri, Jun 28, 2024 at 05:36:25PM -0400, Melanie Plageman wrote:
> > I've attached a WIP v11 streaming vacuum patch set here that is
> > rebased over master (by Thomas), so that I could add a CF entry for
> > it. It still has the problem with the extra WAL write and fsync calls
> > investigated by Thomas above. Thomas has some work in progress doing
> > streaming write-behind to alleviate the issues with the buffer access
> > strategy and streaming reads. When he gets a version of that ready to
> > share, he will start a new "Streaming Vacuum" thread.
>
> To avoid reviewing the wrong patch, I'm writing to verify the status here.
> This is Needs Review in the commitfest.  I think one of these two holds:
>
> 1. Needs Review is valid.
> 2. It's actually Waiting on Author.  You're commissioning a review of the
>future-thread patch, not this one.
>
> If it's (1), given the WIP marking, what is the scope of the review you seek?
> I'm guessing performance is out of scope; what else is in or out of scope?

Ah, you're right. I moved it to "Waiting on Author" as we are waiting
on Thomas' version which has a fix for the extra WAL write/sync
behavior.

Sorry for the "Needs Review" noise!

- Melanie

Re: Should we work around msvc failing to compile tab-complete.c?

2024-07-08 Thread Tom Lane

Andres Freund  writes:
> Compiling postgres on windows with tab-completion support fails either with
> "fatal error C1026: parser stack overflow, program too complex”.
> or (in recent versions) with
> "…/src/bin/psql/tab-complete.c(4023): fatal error C1001: Internal compiler 
> error."

> It's pretty easy to work around the error [2]. I wonder if we should just do
> that, then we don't have to insist on a very new msvc version being used and
> it'll work even if they just decide to fix the internal compiler error.

I'm on board with doing something here, but wouldn't we want to
back-patch at least a minimal fix to all supported branches?

As for the specific thing to do, that long if-chain seems horrid
from an efficiency standpoint as well as stressing compiler
implementations.  I realize that this pretty much only needs to run
at human-reaction-time speed, but it still offends my inner nerd.
I also wonder when we are going to hit problems with some earlier
test unexpectedly pre-empting some later one.

Could we perhaps have a table of first words of each interesting
match, and do a lookup in that before dispatching to code segments
that are individually similar to what's there now?

regards, tom lane

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-08 Thread Melanie Plageman

On Tue, Jul 2, 2024 at 7:07 PM Melanie Plageman
 wrote:
>
> On Mon, Jun 24, 2024 at 4:27 AM Heikki Linnakangas  wrote:
> >
> > Would it be possible to make it robust so that we could always run it
> > with "make check"? This seems like an important corner case to
> > regression test.
>
> Okay, I've attached a new version of the patch and a new version of
> the repro that may be fast and stable enough to commit. It is more
> minimal than the previous version. I made the table as small as I
> could to still trigger two rounds of index vacuuming. I tried to make
> it as stable as possible. I also removed the cursor on the standby
> that could trigger recovery conflicts. It would be super helpful if
> someone could take a look at the test and point out any ways I could
> make it even more likely to be stable.

Attached v3 has one important additional component in the test -- I
use pg_stat_progress_vacuum to confirm that we actually do more than
one pass of index vacuuming. Otherwise, it would have been trivial for
the test to incorrectly pass.

I could still use another pair of eyes on the test (looking out for
stability enhancing measures I could take). I also would be happy if
someone looked at the commit message and/or comments to let me know if
they make sense.

I'll finish with versions of the patch and test targeted at v14-16 and
propose those before committing this.

- Melanie
From b83a5e4ab84cb4b4de104fee05daf0bf88fe70ff Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 20 Jun 2024 16:38:30 -0400
Subject: [PATCH v3 1/2] Test that vacuum removes tuples older than OldestXmin

If vacuum fails to prune a tuple killed before OldestXmin, it will
decide to freeze its xmax and later error out in pre-freeze checks.
---
 src/test/recovery/meson.build |   1 +
 .../recovery/t/043_vacuum_horizon_floor.pl| 236 ++
 2 files changed, 237 insertions(+)
 create mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..1d55d6bf560 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
   't/040_standby_failover_slots_sync.pl',
   't/041_checkpoint_at_promote.pl',
   't/042_low_level_backup.pl',
+  't/043_vacuum_horizon_floor.pl',
 ],
   },
 }
diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl
new file mode 100644
index 000..1b82d935be5
--- /dev/null
+++ b/src/test/recovery/t/043_vacuum_horizon_floor.pl
@@ -0,0 +1,236 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Test that vacuum prunes away all dead tuples killed before OldestXmin
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 'physical');
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+hot_standby_feedback = on
+log_recovery_conflict_waits = true
+autovacuum = off
+maintenance_work_mem = 1024
+]);
+$node_primary->start;
+
+my $node_replica = PostgreSQL::Test::Cluster->new('standby');
+
+$node_primary->backup('my_backup');
+$node_replica->init_from_backup($node_primary, 'my_backup',
+	has_streaming => 1);
+
+$node_replica->start;
+
+my $test_db = "test_db";
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+# Save the original connection info for later use
+my $orig_conninfo = $node_primary->connstr();
+
+my $table1 = "vac_horizon_floor_table";
+
+# Long-running Primary Session A
+my $psql_primaryA =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Long-running Primary Session B
+my $psql_primaryB  =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Insert and update enough rows that we force at least one round of index
+# vacuuming before getting to a dead tuple which was killed after the standby
+# is disconnected.
+#
+# We need multiple index vacuuming passes to repro because after the standby
+# reconnects to the primary, our backend's GlobalVisStates will not have been
+# updated with the new horizon until something forces them to be updated.
+#
+# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
+# end of a round of index vacuuming, updating the backend's GlobalVisState
+# and, in our case, moving maybe_needed backwards.
+#
+# Then vacuum's first pass will continue and pruning will find our later
+# inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
+# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
+$node_primary->safe_psql($test_db, qq[
+	CREATE TABLE ${table1}(col1 int) with (autovacuum_enabled=false, fillfactor=10);
+	INSERT INTO $table1 VALUES(7);
+	INSERT INTO $table1 SELECT generate_series(1, 40) % 3;
+	CREATE INDEX on ${table1}(col1);
+	UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
+	INSERT INTO $table1 VALUES(7);
+]);
+
+# We will later move

Re: Allow logical failover slots to wait on synchronous replication

2024-07-08 Thread John H

Hi,

Thanks Bertrand for taking a look at the patch.

On Mon, Jun 17, 2024 at 8:19 AM Bertrand Drouvot
 wrote:

>
> +   int mode = SyncRepWaitMode;
>
> It's set to SyncRepWaitMode and then never change. Worth to get rid of "mode"?
>

I took a deeper look at this with GDB and I think it's necessary to
cache the value of "mode".
We first check:

if (mode == SYNC_REP_NO_WAIT)
return true;

However after this check it's possible to receive a SIGHUP changing
SyncRepWaitMode
to SYNC_REP_NO_WAIT (e.g. synchronous_commit = 'on' -> 'off'), leading
to lsn[-1].

> 2 
>
> +   static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE] = 
> {InvalidXLogRecPtr};
>
> I did some testing and saw that the lsn[] values were not always set to
> InvalidXLogRecPtr right after. It looks like that, in that case, we should
> avoid setting the lsn[] values at compile time. Then, what about?
>
> 1. remove the "static".
>
> or
>
> 2. keep the static but set the lsn[] values after its declaration.

I'd prefer to keep the static because it reduces unnecessary
contention on SyncRepLock if logical client has fallen behind.
I'll add a change with your second suggestion.

> 3 
>
> -   /*
> -* Return false if not all the standbys have caught up to the 
> specified
> -* WAL location.
> -*/
> -   if (caught_up_slot_num != standby_slot_names_config->nslotnames)
> -   return false;
> +   if (!XLogRecPtrIsInvalid(lsn[mode]) && lsn[mode] >= 
> wait_for_lsn)
> +   return true;
>
> lsn[] values are/(should have been, see 2 above) just been initialized to
> InvalidXLogRecPtr so that XLogRecPtrIsInvalid(lsn[mode]) will always return
> true. I think this check is not needed then.

Removed.

> Out of curiosity, did you compare with standby_slot_names_from_syncrep set to 
> off
> and standby_slot_names not empty?

I didn't think 'standby_slot_names' would impact TPS as much since
it's not grabbing the SyncRepLock but here's a quick test.
Writer with 5 synchronous replicas, 10 pg_recvlogical clients and
pgbench all running from the same server.

Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5

Result with: standby_slot_names =
'replica_1,replica_2,replica_3,replica_4,replica_5'

latency average = 5.600 ms
latency stddev = 2.854 ms
initial connection time = 5.503 ms
tps = 714.148263 (without initial connection time)

Result with: standby_slot_names_from_syncrep = 'true',
synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'

latency average = 5.740 ms
latency stddev = 2.543 ms
initial connection time = 4.093 ms
tps = 696.776249 (without initial connection time)

Result with nothing set:

latency average = 5.090 ms
latency stddev = 3.467 ms
initial connection time = 4.989 ms
tps = 785.665963 (without initial connection time)

Again I think it's possible to improve the synchronous numbers if we
cache but I'll try that out in a bit.

-- 
John Hsu - Amazon Web Services

Re: Allow logical failover slots to wait on synchronous replication

2024-07-08 Thread John H

Hi Amit,

Thanks for taking a look.

On Tue, Jun 18, 2024 at 10:34 PM Amit Kapila  wrote:
>

>
> The reading indicates when you set 'standby_slot_names_from_syncrep',
> the TPS reduces as compared to when it is not set. It would be better
> to see the data comparing 'standby_slot_names_from_syncrep' and the
> existing parameter 'standby_slot_names'.

I added new benchmark numbers in the reply to Bertrand, but I'll
include in this thread for posterity.

Writer with 5 synchronous replicas, 10 pg_recvlogical clients and
pgbench all running from the same server.
Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5

Result with: standby_slot_names =
'replica_1,replica_2,replica_3,replica_4,replica_5'

latency average = 5.600 ms
latency stddev = 2.854 ms
initial connection time = 5.503 ms
tps = 714.148263 (without initial connection time)

Result with: standby_slot_names_from_syncrep = 'true',
synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'

latency average = 5.740 ms
latency stddev = 2.543 ms
initial connection time = 4.093 ms
tps = 696.776249 (without initial connection time)

Result with nothing set:

latency average = 5.090 ms
latency stddev = 3.467 ms
initial connection time = 4.989 ms
tps = 785.665963 (without initial connection time)

> Can we make it a default
> behavior that logical slots marked with a failover option will wait
> for 'synchronous_standby_names' as per your patch's idea unless
> 'standby_slot_names' is specified? I don't know if there is any value
> in setting the 'failover' option for a slot without specifying
> 'standby_slot_names', so was wondering if we can additionally tie it
> to 'synchronous_standby_names'. Any better ideas?
>

No, I think that works pretty cleanly actually. Reserving some special
keyword isn't great
which is the only other thing I can think of. I've updated the patch
and tests to reflect that.

Attached the patch that addresses these changes.



-- 
John Hsu - Amazon Web Services


0002-Wait-on-synchronous-replication-by-default-for-logic.patch
Description: Binary data

Re: allow changing autovacuum_max_workers without restarting

2024-07-08 Thread Nathan Bossart

Here is a rebased patch.

One thing that still bugs me is that there is no feedback sent to the user
when autovacuum_max_workers is set higher than autovacuum_worker_slots.  I
think we should at least emit a WARNING, perhaps from the autovacuum
launcher, i.e., once when the launcher starts and then again as needed via
HandleAutoVacLauncherInterrupts().  Or we could fail to start in
PostmasterMain() and then ignore later misconfigurations via a GUC check
hook.  I'm not too thrilled about adding more GUC check hooks that depend
on the value of other GUCs, but I do like the idea of failing instead of
silently proceeding with a different value than the user configured.  Any
thoughts?

-- 
nathan
>From bd486d1ab302c4654b9cfbc57230bcf9b140711e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jun 2024 15:05:44 -0500
Subject: [PATCH v7 1/1] allow changing autovacuum_max_workers without
 restarting

---
 doc/src/sgml/config.sgml  | 28 +++-
 doc/src/sgml/runtime.sgml | 12 ++---
 src/backend/access/transam/xlog.c |  2 +-
 src/backend/postmaster/autovacuum.c   | 44 ---
 src/backend/postmaster/postmaster.c   |  2 +-
 src/backend/storage/lmgr/proc.c   |  6 +--
 src/backend/utils/init/postinit.c |  6 +--
 src/backend/utils/misc/guc_tables.c   | 11 -
 src/backend/utils/misc/postgresql.conf.sample |  3 +-
 src/include/postmaster/autovacuum.h   |  1 +
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  1 +
 11 files changed, 83 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f627a3e63c..da30d1ea4f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8544,6 +8544,25 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
   
  
 
+ 
+  autovacuum_worker_slots (integer)
+  
+   autovacuum_worker_slots configuration 
parameter
+  
+  
+  
+   
+Specifies the number of backend slots to reserve for autovacuum worker
+processes.  The default is 16.  This parameter can only be set at 
server
+start.
+   
+   
+When changing this value, consider also adjusting
+.
+   
+  
+ 
+
  
   autovacuum_max_workers (integer)
   
@@ -8554,7 +8573,14 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

 Specifies the maximum number of autovacuum processes (other than the
 autovacuum launcher) that may be running at any one time.  The default
-is three.  This parameter can only be set at server start.
+is three.  This parameter can only be set in the
+postgresql.conf file or on the server command 
line.
+   
+   
+Note that a setting for this value which is higher than
+ will have no effect,
+since autovacuum workers are taken from the pool of slots established
+by that setting.

   
  
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 2f7c618886..4bb37faffe 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -781,13 +781,13 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such

 SEMMNI
 Maximum number of semaphore identifiers (i.e., sets)
-at least ceil((max_connections + 
autovacuum_max_workers + max_wal_senders + max_worker_processes + 7) / 
16) plus room for other applications
+at least ceil((max_connections + 
autovacuum_worker_slots + max_wal_senders + max_worker_processes + 7) / 
16) plus room for other applications

 

 SEMMNS
 Maximum number of semaphores system-wide
-ceil((max_connections + autovacuum_max_workers + 
max_wal_senders + max_worker_processes + 7) / 16) * 17 plus room for 
other applications
+ceil((max_connections + autovacuum_worker_slots + 
max_wal_senders + max_worker_processes + 7) / 16) * 17 plus room for 
other applications

 

@@ -838,7 +838,7 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such
 When using System V semaphores,
 PostgreSQL uses one semaphore per allowed 
connection
 (), allowed autovacuum worker process
-(), allowed WAL sender process
+(), allowed WAL sender process
 (), and allowed background
 process (), in sets of 16.
 Each such set will
@@ -847,13 +847,13 @@ psql: error: connection to server on socket 
"/tmp/.s.PGSQL.5432" failed: No such
 other applications. The maximum number of semaphores in the system
 is set by SEMMNS, which consequently must be at least
 as high as max_connections plus
-autovacuum_max_workers plus 
max_wal_senders,
+autovacuum_worker_slots plus 
max_wal_senders,
 plus max_worker_processes, plus one extra for each 16
 allowed connections plus workers (see t

Re: Improve the connection failure error messages

2024-07-08 Thread Tom Lane

Nisha Moond  writes:
> Attached v5 patch with the translator comments as suggested.

I looked at this, and I agree with the goal, but I find just about all
of the translator comments unnecessary.  The ones that are useful are
useful only because the message is violating one of our message style
guidelines [1]:

When citing the name of an object, state what kind of object it is.

Rationale: Otherwise no one will know what “foo.bar.baz” refers to.

So, for example, where you have

+
+/*
+ * translator: first %s is the subscription name, second %s is the
+ * error
+ */
+ errmsg("subscription \"%s\" could not connect to the 
publisher: %s", stmt->subname, err)));

I don't find that that translator comment is adding anything.
But there are a couple of places like

+/*
+ * translator: first %s is the slotsync worker name, second %s is the
+ * error
+ */
+errmsg("\"%s\" could not connect to the primary server: %s", 
app_name.data, err));

I think that the right cure for the ambiguity here is not to add a
translator comment, but to label the name properly, perhaps like

errmsg("synchronization worker \"%s\" could not connect to the 
primary server: %s",
   app_name.data, err));


regards, tom lane

[1] 
https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-OBJECT-TYPE

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-08 Thread Robert Haas

On Thu, Jul 4, 2024 at 10:35 AM Fujii Masao  wrote:
> +1 for v18 or later. However, since the reported issue is in v17,
> it needs to be addressed without such a improved check mechanism.

Here is a draft patch for that. This is only lightly tested at this
point, so further testing would be greatly appreciated, as would code
review.

NOTE THAT this seems to require bumping XLOG_PAGE_MAGIC and
PG_CONTROL_VERSION, which I imagine won't be super-popular considering
we've already shipped beta2. However, I don't see another principled
way to fix the reported problem. We do currently log an
XLOG_PARAMETER_CHANGE message at startup when wal_level has changed,
but that's useless for this purpose. Consider that WAL summarization
on a standby can start from any XLOG_CHECKPOINT_SHUTDOWN or
XLOG_CHECKPOINT_REDO record, and the most recent XLOG_PARAMETER_CHANGE
may be arbitrarily far behind that point, and replay may be
arbitrarily far ahead of that point by which things may have changed
again. The only way to know for certain what wal_level was in effect
at the time one of those records was generated is if the record itself
contains that information, so that's what I did. If I'm reading the
code correctly, this doesn't increase the size of the WAL, because
XLOG_CHECKPOINT_REDO was previously storing a dummy integer, and
CheckPoint looks to have an alignment padding hole where I inserted
the new member. Even if the size did increase, I don't think it would
matter: these records aren't emitted frequently enough for it to be a
problem. But I think it doesn't. However, it does change the format of
WAL, and because the checkpoint data is also stored in the control
file, it changes the format of that, too.

If we really, really don't want to let those values change at this
point in the release cycle, then we could alternatively just document
the kinds of scenarios that this protects against as unsupported and
admonish people sternly not to do that kind of thing. I think it's
actually sort of rare, because you have to do something like: enable
WAL summarization, do some stuff, restart with wal_level=minimal and
summarize_wal=off, do some more stuff but not so much stuff that any
of the WAL you generate gets removed, then restart again with
wal_level=replica/logical and summarize_wal=on again, so that the WAL
generated at the lower WAL level gets summarized before it gets
removed. Then, an incremental backup that you took after doing that
dance against a prior backup taken before doing all that might get
confused about what to do. Although that does not sound like a
terribly likely or advisable sequence of steps, I bet some people will
do it and then it will be hard to figure out what went wrong (and even
after we do, it won't make those people whole). And on the flip side I
don't think that bumping XLOG_PAGE_MAGIC or PG_CONTROL_VERSION will
cause huge inconvenience, because people don't tend to put beta2 into
production -- the early adopters tend to go for .0 or .1, not betaX.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v1-0001-Do-not-summarize-WAL-if-generated-with-wal_level-.patch
Description: Binary data

Re: ssl tests fail due to TCP port conflict

2024-07-08 Thread Andrew Dunstan




On 2024-07-08 Mo 8:00 AM, Alexander Lakhin wrote:

Hello,

07.06.2024 17:25, Tom Lane wrote:

Andrew Dunstan  writes:
I still think my patch to force TCP mode for the SSL test makes 
sense as

well.

+1 to both things.  If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.



I have push patches for both of those (i.e. start SSL test nodes in TCP 
mode and change the range of ports we allocate server ports from)


I didn't see this email until after I had pushed them.




I'd like to add that the kerberos/001_auth test also suffers from the 
port

conflict, but slightly differently. Look for example at [1]:
krb5kdc.log contains:
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](info): 
setting up network...
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): 
Address already in use - Cannot bind server socket on 127.0.0.1.55853
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): 
Failed setting up a UDP socket (for 127.0.0.1.55853)
Jul 02 09:29:41 andres-postgres-buildfarm-v5 krb5kdc[471964](Error): 
Address already in use - Error setting up network


As far as I can see, the port for kdc is chosen by
PostgreSQL::Test::Kerberos, via
PostgreSQL::Test::Cluster::get_free_port(), which checks only for TCP
port availability (with can_bind()), but not for UDP, so this increases
the probability of the conflict for this test (a similar failure: [2]).
Although we can also find a failure with TCP: [3]

(It's not clear to me, what processes can use UDP ports while testing,
but maybe those buildfarm animals are running on the same logical
machine simultaneously?)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-02%2009%3A27%3A15
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-05-15%2001%3A25%3A07
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-07-04%2008%3A28%3A19





Let's see if this persists now we are testing for free ports in a 
different range, not the range usually used for ephemeral ports.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Should we work around msvc failing to compile tab-complete.c?

2024-07-08 Thread Andres Freund

Hi,

On 2024-07-08 14:18:03 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Compiling postgres on windows with tab-completion support fails either with
> > "fatal error C1026: parser stack overflow, program too complex”.
> > or (in recent versions) with
> > "…/src/bin/psql/tab-complete.c(4023): fatal error C1001: Internal compiler 
> > error."
>
> > It's pretty easy to work around the error [2]. I wonder if we should just do
> > that, then we don't have to insist on a very new msvc version being used and
> > it'll work even if they just decide to fix the internal compiler error.
>
> I'm on board with doing something here, but wouldn't we want to
> back-patch at least a minimal fix to all supported branches?

I think we'd need to backpatch more for older branches. At least

commit 3f28bd7337d
Author: Thomas Munro 
Date:   2022-12-22 17:14:23 +1300

Add work-around for VA_ARGS_NARGS() on MSVC.

Given that - afaict - tab completion never used to work with msvc, I think
it'd be ok to just do it in 17 or 16+17 or such. Obviously nobody is currently
building with readline support for windows - not sure if any packager is going
to go back and add support for it in older branches.

> As for the specific thing to do, that long if-chain seems horrid
> from an efficiency standpoint as well as stressing compiler
> implementations.

Indeed.

Even with gcc it's one of the slowest files to compile in our codebase. At -O2
tab-complete.c takes 7.3s with gcc 14.  Just adding an
else if (HeadMatches("ALTER"))
{
}
around all the ALTER branches reduces that to 4.5s to me.  Doing that for
COMMENT and CREATE gets down to 3.2s.

> I realize that this pretty much only needs to run
> at human-reaction-time speed, but it still offends my inner nerd.

Same.

> Could we perhaps have a table of first words of each interesting
> match, and do a lookup in that before dispatching to code segments
> that are individually similar to what's there now?

Eventually it seems yet, it ought to be table driven in some form. But I
wonder if that's setting the bar too high for now. Even just doing some manual
re-nesting seems like it could get us quite far?

I'm thinking of something like an outer if-else-if chain for CREATE, ALTER,
DROP, COMMENT and everything that doesn't fit within those
(e.g. various TailMatches(), backslash command etc) we'd have reduced the
number of redundant checks a lot.

The amount of whitespace changes that'd imply isn't great, but I don't really
see a way around that.

Greetings,

Andres Freund

Re: tests fail on windows with default git settings

2024-07-08 Thread Andres Freund

On 2024-07-07 06:30:57 -0400, Andrew Dunstan wrote:
> 
> On 2024-07-07 Su 1:26 AM, Tom Lane wrote:
> > Andres Freund  writes:
> > > Do we want to support checking out with core.autocrlf?
> > -1.  It would be a constant source of breakage, and you could never
> > expect that (for example) making a tarball from such a checkout
> > would match anyone else's results.

> Yeah, totally agree.
> 
> 
> > > If we do not want to support that, ISTM we ought to raise an error 
> > > somewhere?
> > +1, if we can figure out how.
> > 
> > 
> 
> 
> 
> ISTM the right fix is probably to use PG_BINARY_R mode instead of "r" when
> opening the files, at least in the case if the test_json_parser tests.

That does seem like it'd fix this issue, assuming the parser can cope with
\r\n.

I'm actually mildly surprised that the tests don't fail when *not* using
autocrlf, because afaict test_json_parser_incremental.c doesn't set stdout to
binary and thus we presumably end up with \r\n in the output? Except that that
can't be true, because the test does pass on repos without autocrlf...

That approach does seem to mildly conflict with Tom and your preference for
fixing this by disallowing core.autocrlf? If we do so, the test never ought to
see a crlf?

Greetings,

Andres Freund

Re: Interrupts vs signals

2024-07-08 Thread Robert Haas

On Mon, Jul 8, 2024 at 5:38 AM Heikki Linnakangas  wrote:
> Another approach would be to move the responsibility of background
> worker state notifications out of postmaster completely. When a new
> background worker is launched, the worker process itself could send the
> notification that it has started. And similarly, when a worker exits, it
> could send the notification just before exiting. There's a little race
> condition with exiting: if a process is waiting for the bgworker to
> exit, and launches a new worker immediately when the old one exits,
> there will be a brief period when the old and new process are alive at
> the same time. The old worker wouldn't be doing anything interesting
> anymore since it's exiting, but it still counts towards
> max_worker_processes, so launching the new process might fail because of
> hitting the limit. Maybe we should just bump up max_worker_processes. Or
> postmaster could check PMChildFlags and not count processes that have
> already deregistered from PMChildFlags towards the limit.

I can testify that the current system is the result of a lot of trial
and error. I'm not saying it can't be made better, but my initial
attempts at getting this to work (back in the 9.4 era) resembled what
you proposed here, were consequently a lot simpler than what we have
now, and also did not work. Race conditions like you mention here were
part of that. Another consideration is that fork() can fail, and in
that case, the process that tried to register the new background
worker needs to find out that the background worker won't ever be
starting. Yet another problem is that, even if fork() succeeds, the
new process might fail before it executes any of our code e.g. because
it seg faults very early, a case that actually happened to me -
inadvertently - while I was testing these facilities. I ended up
deciding that we can't rely on the new process to do anything until
it's given us some signal that it is alive and able to carry out its
duties. If it dies before telling us that, or never starts in the
first place, we have to have some other way of finding that out, and
it's difficult to see how that can happen without postmaster
involvement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Should we work around msvc failing to compile tab-complete.c?

2024-07-08 Thread Tom Lane

Andres Freund  writes:
> Given that - afaict - tab completion never used to work with msvc, I think
> it'd be ok to just do it in 17 or 16+17 or such. Obviously nobody is currently
> building with readline support for windows - not sure if any packager is going
> to go back and add support for it in older branches.

If it doesn't even compile in existing branches, that seems to me to
be plenty enough reason to call it a new HEAD-only feature.  That
gives us some leeway for experimentation and a more invasive patch
than we might want to risk in stable branches.

> On 2024-07-08 14:18:03 -0400, Tom Lane wrote:
>> Could we perhaps have a table of first words of each interesting
>> match, and do a lookup in that before dispatching to code segments
>> that are individually similar to what's there now?

> Eventually it seems yet, it ought to be table driven in some form. But I
> wonder if that's setting the bar too high for now. Even just doing some manual
> re-nesting seems like it could get us quite far?

What I'm thinking is that (as you say) any fix that's not as ugly
as Kirk's hack is going to involve massive reindentation, with
corresponding breakage of any pending patches.  It would be a
shame to do that twice, so I'd rather look for a long-term fix
instead of putting in a stop-gap.

Let me play with the "table" idea a little bit and see what
I get.

regards, tom lane

Re: 010_pg_basebackup.pl vs multiple filesystems

2024-07-08 Thread Andres Freund

Hi,

On 2024-07-07 09:10:48 -0400, Andrew Dunstan wrote:
> On 2024-07-07 Su 7:28 AM, Andrew Dunstan wrote:
> > I'll be happy to hear of one. I agree it's a mess.  Maybe we could test
> > that the temp directory is on the same device on Windows and skip the
> > test if not? You could still get the test to run by setting TMPDIR
> > and/or friends.

> Maybe we should just not try to rename the directory. Looking at the test
> I'm pretty sure the directory should be empty. Instead of trying to move it,
> let's just remove it, and recreate it in the tmp location.

Good catch, yes, that'd be much better!


> diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> index 489dde4adf..c0c334c6fc 100644
> --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> @@ -363,8 +363,8 @@ my $sys_tempdir =
> PostgreSQL::Test::Utils::tempdir_short;
>  # Elsewhere use $tempdir to avoid file system boundary issues with moving.
>  my $tmploc = $windows_os ? $sys_tempdir : $tempdir;

The comment would need a bit of editing, I guess. I think we should consider
just getting rid of the os-dependant switch now, it shouldn't be needed
anymore?


> -rename("$pgdata/pg_replslot", "$tmploc/pg_replslot")
> -  or BAIL_OUT "could not move $pgdata/pg_replslot";
> +rmtree("$pgdata/pg_replslot");

Should this perhaps be an rmdir, to ensure that we're not removing something
we don't want (e.g. somebody adding an earlier test for slots that then gets
broken by the rmtree)?

Greetings,

Andres Freund

Re: tests fail on windows with default git settings

2024-07-08 Thread Tom Lane

Andres Freund  writes:
> On 2024-07-07 06:30:57 -0400, Andrew Dunstan wrote:
>> ISTM the right fix is probably to use PG_BINARY_R mode instead of "r" when
>> opening the files, at least in the case if the test_json_parser tests.

> That approach does seem to mildly conflict with Tom and your preference for
> fixing this by disallowing core.autocrlf? If we do so, the test never ought to
> see a crlf?

Is this code that will *never* be applied to user-supplied files?
We certainly should tolerate \r\n in the general case (we even
have written-down project policy about that!).  While I wouldn't
complain too hard about assuming that our own test files don't
contain \r\n, if the code might get copied into a non-test
scenario then it could create problems later.

regards, tom lane

Re: 010_pg_basebackup.pl vs multiple filesystems

2024-07-08 Thread Robert Haas

On Sun, Jul 7, 2024 at 3:02 AM Andres Freund  wrote:
> While working on [1] I encountered the issue that, on github-actions,
> 010_pg_basebackup.pl fails on windows.
>
> The reason for that is that github actions uses two drives, with TMP/TEMP
> located on c:, the tested code on d:.  This causes the following code to fail:
>
>   # Create a temporary directory in the system location.
>   my $sys_tempdir = PostgreSQL::Test::Utils::tempdir_short;

Whatever we end up doing about this, it would be a good idea to check
the other places that use tempdir_short and see if they also need
adjustment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: tests fail on windows with default git settings

2024-07-08 Thread Andrew Dunstan



On 2024-07-08 Mo 4:16 PM, Andres Freund wrote:

On 2024-07-07 06:30:57 -0400, Andrew Dunstan wrote:

On 2024-07-07 Su 1:26 AM, Tom Lane wrote:

Andres Freund  writes:

Do we want to support checking out with core.autocrlf?

-1.  It would be a constant source of breakage, and you could never
expect that (for example) making a tarball from such a checkout
would match anyone else's results.

Yeah, totally agree.



If we do not want to support that, ISTM we ought to raise an error somewhere?

+1, if we can figure out how.





ISTM the right fix is probably to use PG_BINARY_R mode instead of "r" when
opening the files, at least in the case if the test_json_parser tests.

That does seem like it'd fix this issue, assuming the parser can cope with
\r\n.



Yes, the parser can handle \r\n. Note that they can only be white space 
in JSON - they can only be present in string values via escapes.





I'm actually mildly surprised that the tests don't fail when *not* using
autocrlf, because afaict test_json_parser_incremental.c doesn't set stdout to
binary and thus we presumably end up with \r\n in the output? Except that that
can't be true, because the test does pass on repos without autocrlf...


That approach does seem to mildly conflict with Tom and your preference for
fixing this by disallowing core.autocrlf? If we do so, the test never ought to
see a crlf?



IDK. I normally use core.autocrlf=false core.eol=lf on Windows. The 
editors I use are reasonably well behaved ;-)


What I suggest (see attached) is we run the diff command with 
--strip-trailing-cr on Windows. Then we just won't care if the expected 
file and/or the output file has CRs.


Not sure what the issue is with pg_bsd_indent, though.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/modules/test_json_parser/t/003_test_semantic.pl b/src/test/modules/test_json_parser/t/003_test_semantic.pl
index 74e0fa5bb1..2a430ad0c4 100644
--- a/src/test/modules/test_json_parser/t/003_test_semantic.pl
+++ b/src/test/modules/test_json_parser/t/003_test_semantic.pl
@@ -29,7 +29,9 @@ print $fh $stdout, "\n";
 
 close($fh);
 
-($stdout, $stderr) = run_command([ "diff", "-u", $fname, $test_out ]);
+my @diffopts = ("-u");
+push(@diffopts, "--strip-railing-cr") if $windows_os;
+($stdout, $stderr) = run_command([ "diff", @diffopts, $fname, $test_out ]);
 
 is($stdout, "", "no output diff");
 is($stderr, "", "no diff error");
diff --git a/src/test/modules/test_json_parser/test_json_parser_incremental.c b/src/test/modules/test_json_parser/test_json_parser_incremental.c
index f4c442ac36..47040e1e42 100644
--- a/src/test/modules/test_json_parser/test_json_parser_incremental.c
+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
@@ -124,7 +124,7 @@ main(int argc, char **argv)
 	makeJsonLexContextIncremental(&lex, PG_UTF8, need_strings);
 	initStringInfo(&json);
 
-	if ((json_file = fopen(testfile, "r")) == NULL)
+	if ((json_file = fopen(testfile, PG_BINARY_R)) == NULL)
 		pg_fatal("error opening input: %m");
 
 	if (fstat(fileno(json_file), &statbuf) != 0)
diff --git a/src/test/modules/test_json_parser/test_json_parser_perf.c b/src/test/modules/test_json_parser/test_json_parser_perf.c
index ea85626cbd..74cc5f3f54 100644
--- a/src/test/modules/test_json_parser/test_json_parser_perf.c
+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
@@ -55,7 +55,7 @@ main(int argc, char **argv)
 
 	sscanf(argv[1], "%d", &iter);
 
-	if ((json_file = fopen(argv[2], "r")) == NULL)
+	if ((json_file = fopen(argv[2], PG_BINARY_R)) == NULL)
 		pg_fatal("Could not open input file '%s': %m", argv[2]);
 
 	while ((n_read = fread(buff, 1, 6000, json_file)) > 0)

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-07-08 Thread Jacob Champion

On Sun, Jun 30, 2024 at 10:48 AM Noah Misch  wrote:
> That looks like a reasonable user experience.  Is any field newly-nullable?

Technically I think the answer is no, since backends such as walwriter
already have null database and user fields. It's new for a client
backend to have nulls there, though.

> That said, it
> may be more fruitful to arrange for authentication timeout to cut through PAM
> etc.

That seems mostly out of our hands -- the misbehaving modules are free
to ignore our signals (and do). Is there another way to force the
issue?

> Hanging connection slots hurt even if they lack an xmin.

Oh, would releasing the xmin not really move the needle, then?

> I assume it
> takes an immediate shutdown to fix them?

That's my understanding, yeah.

> > Would anyone like me to be more aggressive, and create a pgstat entry
> > as soon as we have the opening transaction? Or... as soon as a
> > connection is made?
>
> All else being equal, I'd like backends to have one before taking any lmgr
> lock or snapshot.

I can look at this for the next patchset version.

> > I haven't decided how to test these patches. Seems like a potential
> > use case for injection points, but I think I'd need to preload an
> > injection library rather than using the existing extension. Does that
> > seem like an okay way to go?
>
> Yes.

I misunderstood how injection points worked. No preload module needed,
so v2 adds a waitpoint and a test along with a couple of needed tweaks
to BackgroundPsql. I think 0001 should probably be applied
independently.

Thanks,
--Jacob

v2-0001-BackgroundPsql-handle-empty-query-results.patch
Description: Binary data

v2-0002-Test-Cluster-let-background_psql-work-asynchronou.patch
Description: Binary data

v2-0004-WIP-report-external-auth-calls-as-wait-events.patch
Description: Binary data

v2-0003-pgstat-report-in-earlier-with-STATE_AUTHENTICATIN.patch
Description: Binary data

Re: 010_pg_basebackup.pl vs multiple filesystems

2024-07-08 Thread Andrew Dunstan




On 2024-07-08 Mo 4:45 PM, Robert Haas wrote:

On Sun, Jul 7, 2024 at 3:02 AM Andres Freund  wrote:

While working on [1] I encountered the issue that, on github-actions,
010_pg_basebackup.pl fails on windows.

The reason for that is that github actions uses two drives, with TMP/TEMP
located on c:, the tested code on d:.  This causes the following code to fail:

   # Create a temporary directory in the system location.
   my $sys_tempdir = PostgreSQL::Test::Utils::tempdir_short;

Whatever we end up doing about this, it would be a good idea to check
the other places that use tempdir_short and see if they also need
adjustment.



I don't think it's a problem. There are lots of systems that have 
tempdir on a different device. That's why we previously saw lots of 
errors from this code, resulting in the present incomplete workaround. 
The fact that we haven't seen such errors from other tests means we 
should be ok.



cheers


andrew

--

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-07-08 Thread Nathan Bossart

Committed.

-- 
nathan

Re: 010_pg_basebackup.pl vs multiple filesystems

2024-07-08 Thread Andres Freund

Hi,

On 2024-07-08 16:45:42 -0400, Robert Haas wrote:
> On Sun, Jul 7, 2024 at 3:02 AM Andres Freund  wrote:
> > While working on [1] I encountered the issue that, on github-actions,
> > 010_pg_basebackup.pl fails on windows.
> >
> > The reason for that is that github actions uses two drives, with TMP/TEMP
> > located on c:, the tested code on d:.  This causes the following code to 
> > fail:
> >
> >   # Create a temporary directory in the system location.
> >   my $sys_tempdir = PostgreSQL::Test::Utils::tempdir_short;
> 
> Whatever we end up doing about this, it would be a good idea to check
> the other places that use tempdir_short and see if they also need
> adjustment.

FWIW, this was the only test that failed due to TMP being on a separate
partition.

I couldn't quite enable all tests (PG_TEST_EXTRA including libpq_encryption
fails due to some gssapi issue, kerberos fails due to paths being wrong / not
set up for windows, load_balance because I haven't set up hostnames), but I
don't think any of the issues around those tests are related.

I did also grep for other uses of tmpdir_short, they all seem to be
differently motivated.

I also looked for uses of rename() in tests. While
src/bin/pg_verifybackup/t/007_wal.pl does move stuff around, it afaict is
always going to be on the same filesystem.

Greetings,

Andres Freund

Re: Document use of ldapurl with LDAP simple bind

2024-07-08 Thread Jacob Champion

On Fri, Jun 28, 2024 at 12:11 AM Peter Eisentraut  wrote:
> This appears to imply that specifying ldapurl is only applicable for
> search+bind.  Maybe that whole message should be simplified to something
> like
>
> "configuration mixes arguments for simple bind and search+bind"
>
> (The old wording also ignores that the error might arise via "ldapsuffix".)

I kept the imperative "cannot" and tried to match the terminology with
our documentation at [1]:

cannot mix options for simple bind and search+bind modes

WDYT?

--Jacob

[1] https://www.postgresql.org/docs/17/auth-ldap.html


v2-0001-docs-explain-how-to-use-ldapurl-with-simple-bind.patch
Description: Binary data


v2-0002-ldap-test-ldapurl-with-simple-bind.patch
Description: Binary data


v2-0003-hba-improve-error-when-mixing-LDAP-bind-modes.patch
Description: Binary data

Re: tests fail on windows with default git settings

2024-07-08 Thread Andres Freund

Hi,

On 2024-07-08 16:56:10 -0400, Andrew Dunstan wrote:
> On 2024-07-08 Mo 4:16 PM, Andres Freund wrote:
> > I'm actually mildly surprised that the tests don't fail when *not* using
> > autocrlf, because afaict test_json_parser_incremental.c doesn't set stdout 
> > to
> > binary and thus we presumably end up with \r\n in the output? Except that 
> > that
> > can't be true, because the test does pass on repos without autocrlf...
> > 
> > 
> > That approach does seem to mildly conflict with Tom and your preference for
> > fixing this by disallowing core.autocrlf? If we do so, the test never ought 
> > to
> > see a crlf?
> > 
> 
> IDK. I normally use core.autocrlf=false core.eol=lf on Windows. The editors
> I use are reasonably well behaved ;-)

:)

> What I suggest (see attached) is we run the diff command with
> --strip-trailing-cr on Windows. Then we just won't care if the expected file
> and/or the output file has CRs.

I was wondering about that too, but I wasn't sure we can rely on that flag
being supported...

> Not sure what the issue is with pg_bsd_indent, though.

I think it's purely that we *read* with fopen("r") and write with
fopen("wb"). Which means that any \r\n in the input will be converted to \n in
the output. That's not a problem if the repo has been cloned without autocrlf,
as there are no crlf in the expected files, but if autocrlf has been used, the
expected files don't match.

It doesn't look like it'd be trivial to make indent remember what was used in
the input. So I think for now the best path is to just use .gitattributes to
exclude the expected files from crlf conversion.  If we don't want to do so
repo wide, we can do so just for these files.

Greetings,

Andres Freund

1 2 >

1 - 100 of 146 matches

Mail list logo