RE: [Proposal] Add foreign-server health checks infrastructure

2023-03-13 Thread Hayato Kuroda (Fujitsu)
Dear Katsuragi-san,

> Hi,
> 
> I updated the status of the patch to ready for committer.
> 
> regards,

Thank you so much for your reviewing!
Now we can wait comments from senior members and committers.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Add macros for ReorderBufferTXN toptxn

2023-03-13 Thread vignesh C
On Fri, 10 Mar 2023 at 04:36, Peter Smith  wrote:
>
> Hi,
>
> During a recent code review, I was confused multiple times by the
> toptxn member of ReorderBufferTXN, which is defined only for
> sub-transactions.
>
> e.g. txn->toptxn member == NULL means the txn is a top level txn.
> e.g. txn->toptxn member != NULL means the txn is not a top level txn
>
> It makes sense if you squint and read it slowly enough, but IMO it's
> too easy to accidentally misinterpret the meaning when reading code
> that uses this member.
>
> ~
>
> Such code can be made easier to read just by introducing some simple macros:
>
> #define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
> #define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
> #define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)
>
> ~
>
> PSA a small patch that does this.
>
> (Tests OK using make check-world)
>
> Thoughts?

Few comments:
1) Can we move the macros along with the other macros present in this
file, just above this structure, similar to the macros added for
txn_flags:
/* Toplevel transaction for this subxact (NULL for top-level). */
+#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
+#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
+#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)

2) The macro name can be changed to rbtxn_is_toptxn, rbtxn_is_subtxn
and rbtxn_get_toptxn to keep it consistent with others:
/* Toplevel transaction for this subxact (NULL for top-level). */
+#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
+#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
+#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)

3) We could add separate comments for each of the macros:
/* Toplevel transaction for this subxact (NULL for top-level). */
+#define isa_toptxn(rbtxn) (rbtxn->toptxn == NULL)
+#define isa_subtxn(rbtxn) (rbtxn->toptxn != NULL)
+#define get_toptxn(rbtxn) (isa_subtxn(rbtxn) ? rbtxn->toptxn : rbtxn)

4) We check if txn->toptxn is not null twice here both in if condition
and in the assignment, we could retain the assignment operation as
earlier to remove the 2nd check:
-   if (txn->toptxn)
-   txn = txn->toptxn;
+   if (isa_subtxn(txn))
+   txn = get_toptxn(txn);

We could avoid one check again by:
+   if (isa_subtxn(txn))
+   txn = txn->toptxn;

Regards,
Vignesh




Re: ICU locale validation / canonicalization

2023-03-13 Thread Peter Eisentraut

On 09.03.23 21:17, Jeff Davis wrote:

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de@collation=phonebook');
NOTICE:  0: using language tag "de-u-co-phonebk" for locale
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to
migrate
people to the "new" spellings, but the old ones aren't wrong.


I see what you mean; I'm not sure the best thing to do here. We are
adjusting the string passed by the user, and it feels like some users
might want to know that. It's a NOTICE, not a WARNING, so it's not
meant to imply that it's wrong.


For clarification, I wasn't complaining about the notice, but about the 
automatic conversion from old-style ICU locale ID to language tag.



It also doesn't appear to address
how to handle ICU before version 54.


Do you have a specific concern here?


What we had discussed a while ago in one of these threads is that ICU 
before version 54 do not support keyword lists, and we have custom code 
to do that parsing ourselves, but we don't have code to do the same for 
language tags.  Therefore, if I understand this right, if we 
automatically convert ICU locale IDs to language tags, as shown above, 
then we break support for such locales in those older ICU versions.





Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()

2023-03-13 Thread Michael Paquier
On Fri, Mar 10, 2023 at 10:42:08AM +0900, Michael Paquier wrote:
> Perhaps you are right and there is no actual reason to worry here.

I have been thinking about that for the last few days, and yes a
backpatch should be OK, so done now down to 12.
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump versus hash partitioning

2023-03-13 Thread Julien Rouhaud
On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
> Julien Rouhaud  writes:
> > The BEGIN + TRUNCATE is only there to avoid generating WAL records just in 
> > case
> > the wal_level is minimal.  I don't remember if that optimization still 
> > exists,
> > but if yes we could avoid doing that if the server's wal_level is replica or
> > higher?  That's not perfect but it would help in many cases.
>
> After thinking about this, it seems like a better idea is to skip the
> TRUNCATE if we're doing load-via-partition-root.  In that case it's
> clearly a dangerous thing to do regardless of deadlock worries, since
> it risks discarding previously-loaded data that came over from another
> partition.  (IOW this seems like an independent, pre-existing bug in
> load-via-partition-root mode.)

It's seems quite unlikely to be able to actually truncate already restored data
without eventually going into a deadlock, but it's still possible so agreed.

> The trick is to detect in pg_restore whether pg_dump chose to do
> load-via-partition-root.  If we have a COPY statement we can fairly
> easily examine it to see if the target table is what we expect or
> something else.  However, if the table was dumped as INSERT statements
> it'd be far messier; the INSERTs are not readily accessible from the
> code that needs to make the decision.
>
> What I propose we do about that is further tweak things so that
> load-via-partition-root forces dumping via COPY.  AFAIK the only
> compelling use-case for dump-as-INSERTs is in transferring data
> to a non-Postgres database, which is a context in which dumping
> partitioned tables as such is pretty hopeless anyway.

It seems acceptable to me.

> (I wonder if
> we should have some way to dump all the contents of a partitioned
> table as if it were unpartitioned, to support such migration.)

(this would be nice to have)

> An alternative could be to extend the archive TOC format to record
> directly whether a given TABLE DATA object loads data via partition
> root or normally.  Potentially we could do that without an archive
> format break by defining te->defn for TABLE DATA to be empty for
> normal dumps (as it is now) or something like "-- load via partition root"
> for the load-via-partition-root case.  However, depending on examination
> of the COPY command would already work for the vast majority of existing
> archive files, so I feel like it might be the preferable choice.

Given that this approach wouldn't help with existing dump files (at least if
using COPY, in any case the one using INSERT are doomed), so I'm slightly in
favor of the first approach, and later add an easy and non magic incantation
way to produce dumps that don't depend on partitioning.  It would mean that you
would only be able to produce such dumps using pg16 client binaries, but such
version would also work with older server versions so it doesn't seem like a
huge problem in the long run.




Re: Assert failure of the cross-check for nullingrels

2023-03-13 Thread Richard Guo
On Fri, Mar 10, 2023 at 4:13 PM Richard Guo  wrote:

> I wonder if we should consider syn_xxxhand rather than min_xxxhand in
> clause_is_computable_at when we check if clause mentions any nullable
> Vars.  But I'm not sure about that.
>

No, considering syn_xxxhand is not right.  After some join order
commutation we may form the join with only its min_lefthand and
min_righthand.  In this case if we check against syn_xxxhand rather than
min_xxxhand in clause_is_computable_at, we may end up with being unable
to find a proper place for some quals.  I can see this problem in below
query.

select * from t1 left join ((select t2.x from t2 left join t3 on t2.x where
t3.x is null) s left join t4 on s.x) on s.x = t1.x;

Suppose we've formed join t1/t2 and go ahead to form the join of t1/t2
to t3.  If we consider t1/t2 join's syn_xxxhand, then the pushed down
qual 't3.x is null' would not be computable at this level because it
mentions nullable Vars from t1/t2 join's syn_righthand and meanwhile is
not marked with t1/t2 join.  This is not correct and would trigger an
Assert.

Back to the original issue, if a join has more than one quals, actually
we treat them as a whole when we check if identity 3 applies as well as
when we adjust them to be suitable for commutation according to identity
3.  So when we check if a qual is computable at a given level, I think
we should also consider the join's quals as a whole.  I'm thinking that
we use a 'group' notion for RestrictInfos and then use the clause_relids
of the 'group' in clause_is_computable_at.  Does this make sense?

Thanks
Richard


RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-13 Thread houzj.f...@fujitsu.com
On Monday, March 13, 2023 2:23 PM Önder Kalacı   wrote:
Hi,

> > >
> > >
> > > Reading [1], I think I can follow what you suggest. So, basically,
> > > if the leftmost column is not filtered, we have the following:
> > >
> > >>  but the entire index would have to be scanned, so in most cases the 
> > >> planner
> > would prefer a sequential table scan over using the index.
> > >
> > >
> > > So, in our case, we could follow a similar approach. If the leftmost 
> > > column of
> > the index
> > > is not sent over the wire from the pub, we can prefer the sequential scan.
> > >
> > > Is my understanding of your suggestion accurate?
> > >
> > 
> > Yes. I request an opinion from Shi-San who has reported the problem.
> > 
> 
> I also agree with this.
> And I think we can mention this in the comments if we do so.
> 
> Already commented on FindUsableIndexForReplicaIdentityFull() on v44.

Thanks for updating the patch.

I noticed one problem:

+static bool
+RemoteRelContainsLeftMostColumnOnIdx(IndexInfo  *indexInfo,
+
LogicalRepRelation  *remoterel)
+{
+   AttrNumber  keycol;
+
+   if (indexInfo->ii_NumIndexAttrs < 1)
+   return false;
+
+   keycol = indexInfo->ii_IndexAttrNumbers[0];
+   if (!AttributeNumberIsValid(keycol))
+   return false;
+
+   return bms_is_member(keycol-1, remoterel->attkeys);
+}

In this function, it used the local column number(keycol) to match the remote
column number(attkeys), I think it will cause problem if the column order
between pub/sub doesn't match. Like:

---
- pub
CREATE TABLE test_replica_id_full (x int, y int);
ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;
CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full;
- sub
CREATE TABLE test_replica_id_full (z int, y int, x int);
CREATE unique INDEX idx ON test_replica_id_full(z);
CREATE SUBSCRIPTION tap_sub_rep_full_0 CONNECTION 'dbname=postgres port=5432' 
PUBLICATION tap_pub_rep_full;
---

I think we need to use the attrmap->attnums to convert the column number before
comparing. Just for reference, attach a diff(0001) which I noted down when 
trying to
fix the problem.

Besides, I also look at the "WIP: Optimize for non-pkey / non-RI unique
indexes" patch, I think it also had a similar problem about the column
matching. And another thing I think we can improved in this WIP patch is that
we can cache the result of IsIdxSafeToSkipDuplicates() instead of doing it for
each UPDATE, because the cost of this function becomes bigger after applying
this patch. And for reference, I tried to improve the WIP for the same, and
here is a slight modified version of this WIP(0002). Feel free to modify or 
merge
it if needed.
Thanks for Shi-san for helping to finish these fixes.


Best Regards,
Hou zj
From 3e7f94ff0492a98d2d7970146d3a9a1a43cecd92 Mon Sep 17 00:00:00 2001
From: Hou Zhijie 
Date: Mon, 13 Mar 2023 17:36:55 +0800
Subject: [PATCH 2/2] WIP: Optimize for non-pkey / non-RI unique indexes

---
 src/backend/executor/execReplication.c |   7 +-
 src/backend/replication/logical/relation.c | 150 +
 src/backend/replication/logical/worker.c   |  40 --
 src/include/executor/executor.h|   2 +
 src/include/replication/logicalrelation.h  |   4 +-
 5 files changed, 158 insertions(+), 45 deletions(-)

diff --git a/src/backend/executor/execReplication.c 
b/src/backend/executor/execReplication.c
index cd17be0681..ba19ef2bf8 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -135,6 +135,7 @@ build_replindex_scan_key(ScanKey skey, Relation rel, 
Relation idxrel,
  */
 bool
 RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
+bool 
isIdxSafeToSkipDuplicates,
 LockTupleMode lockmode,
 TupleTableSlot 
*searchslot,
 TupleTableSlot 
*outslot)
@@ -147,13 +148,10 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
Relationidxrel;
boolfound;
TypeCacheEntry **eq = NULL;
-   boolisIdxSafeToSkipDuplicates;
 
/* Open the index. */
idxrel = index_open(idxoid, RowExclusiveLock);
 
-   isIdxSafeToSkipDuplicates = IsIdxSafeToSkipDuplicates(rel, idxoid);
-
InitDirtySnapshot(snap);
 
/* Build scan key. */
@@ -171,8 +169,7 @@ retry:
while (index_getnext_slot(scan, ForwardScanDirection, outslot))
{
/*
-* Avoid expensive equality check if the index is primary key or
-* replica identity index.
+* Avoid expensive equality check if the index is unique.
 */
if (!isIdxSafeToSkipDuplicates)
  

Re: Assert failure of the cross-check for nullingrels

2023-03-13 Thread Richard Guo
On Mon, Mar 13, 2023 at 5:03 PM Richard Guo  wrote:

> Back to the original issue, if a join has more than one quals, actually
> we treat them as a whole when we check if identity 3 applies as well as
> when we adjust them to be suitable for commutation according to identity
> 3.  So when we check if a qual is computable at a given level, I think
> we should also consider the join's quals as a whole.  I'm thinking that
> we use a 'group' notion for RestrictInfos and then use the clause_relids
> of the 'group' in clause_is_computable_at.  Does this make sense?
>

I'm imagining something like attached (no comments and test cases yet).

Thanks
Richard


v1-0001-Draft-group-RestrictInfos.patch
Description: Binary data


Re: [BUG] pg_stat_statements and extended query protocol

2023-03-13 Thread Drouvot, Bertrand

Hi,

On 3/2/23 8:27 AM, Michael Paquier wrote:

On Wed, Jan 25, 2023 at 11:22:04PM +, Imseih (AWS), Sami wrote:

Doing some work with extended query protocol, I encountered the same
issue that was discussed in [1]. It appears when a client is using
extended query protocol and sends an Execute message to a portal with
max_rows, and a portal is executed multiple times,
pg_stat_statements does not correctly track rows and calls.


Well, it is one of these areas where it seems to me we have never been
able to put a definition on what should be the correct behavior when
it comes to pg_stat_statements.  Could it be possible to add some
regression tests using the recently-added \bind command and see how
this affects things?  I would suggest splitting these into their own
SQL file, following an effort I have been doing recently for the
regression tests of pg_stat_statements.  It would be good to know the
effects of this change for pg_stat_statements.track = (top|all), as
well.

@@ -657,7 +657,9 @@ typedef struct EState
  
 List   *es_tupleTable;  /* List of TupleTableSlots */
  
-   uint64  es_processed;   /* # of tuples processed */

+   uint64  es_processed;   /* # of tuples processed at the top 
level only */
+   uint64  es_calls;   /* # of calls */
+   uint64  es_total_processed; /* total # of tuples processed */

So the root of the logic is here.  Anything that makes the executor
structures larger freaks me out, FWIW, and that's quite an addition.
--
Michael


I wonder if we can't "just" make use of the "count" parameter passed to the
ExecutorRun_hook.

Something like?

- Increment a "es_total_processed" counter in pgss based on the count received 
in pgss_ExecutorRun()
- In pgss_ExecutorEnd(): substract the last count we received in pgss_ExecutorRun() 
and add queryDesc->estate->es_processed? (we'd
need to be able to distinguish when we should apply this rule or not).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow tests to pass in OpenSSL FIPS mode

2023-03-13 Thread Peter Eisentraut

On 06.03.23 17:06, Daniel Gustafsson wrote:

On 6 Mar 2023, at 15:55, Tom Lane  wrote:
Daniel Gustafsson  writes:



For readers without all context, wouldn't it be better to encode in the
function name why we're not just calling a hash like md5?  Something like
fips_allowed_hash() or similar?


I'd prefer shorter than that --- all these queries are laid out on the
expectation of a very short function name.  Maybe "fipshash()"?

We could make the comment introducing the function declarations more
elaborate, too.


fipshash() with an explanatory comments sounds like a good idea.


committed like that

(I'm going to close the CF item and revisit the other test suites for 
the next release.)





Re: Allow tests to pass in OpenSSL FIPS mode

2023-03-13 Thread Daniel Gustafsson
> On 13 Mar 2023, at 11:06, Peter Eisentraut 
>  wrote:
> On 06.03.23 17:06, Daniel Gustafsson wrote:

>> fipshash() with an explanatory comments sounds like a good idea.
> 
> committed like that

+1. Looks like there is a just a slight diff in the compression.sql test suite.

--
Daniel Gustafsson





Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-13 Thread Bharath Rupireddy
On Mon, Mar 13, 2023 at 12:26 PM Michael Paquier  wrote:
>
> On Fri, Mar 10, 2023 at 04:45:06PM +0530, Bharath Rupireddy wrote:
>
> After that comes the rest of the patch, and I have found a couple of
> mistakes.
>
> -  pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn)
> +  pg_get_wal_records_info(start_lsn pg_lsn, end_lsn pg_lsn DEFAULT NULL)
>returns setof record
> [...]
> -  pg_get_wal_stats(start_lsn pg_lsn, end_lsn pg_lsn, per_record boolean 
> DEFAULT false)
> +  pg_get_wal_stats(start_lsn pg_lsn, end_lsn pg_lsn DEFAULT NULL, 
> per_record boolean DEFAULT false)
>
> This part of the documentation is now incorrect.

Oh, yeah. Thanks for fixing it.

> +-- Make sure checkpoints don't interfere with the test.
> +SELECT 'init' FROM 
> pg_create_physical_replication_slot('regress_pg_walinspect_slot', true, 
> false);
>
> Adding a physical slot is better for stability of course, but the test
> also has to drop it or installcheck would cause an existing cluster to
> have that still around.  The third argument could be true, as well, so
> as we'd use a temporary slot.

# Disabled because these tests require "wal_level=replica", which
# some installcheck users do not have (e.g. buildfarm clients).
NO_INSTALLCHECK = 1

pg_walinspect can't be run under installcheck. I don't think dropping
the slot at the end is needed, it's unnecessary. I saw
oldextversions.sql using the same replication slot name, well no
problem, but I changed it to a unique name.

> Hmm.  I would simplify that, and just mention that an error is raised
> when the start LSN is available, without caring about the rest (valid
> end LSN being past the current insert LSN, and error if start > end,
> the second being obvious).

Okay.

> + 
> +  
> +   Note that pg_get_wal_records_info_till_end_of_wal and
> +   pg_get_wal_stats_till_end_of_wal functions have been
> +   removed in the pg_walinspect version
> +   1.1. The same functionality can be achieved with
> +   pg_get_wal_records_info and
> +   pg_get_wal_stats functions by specifying a future
> +   end_lsn. However, 
> till_end_of_wal
> +   functions will still work if the extension is installed explicitly with
> +   version 1.0.
> +  
> + 
>
> Not convinced that this is necessary.

As hackers we know that these functions have been removed and how to
achieve till_end_of_wal with the other functions. I noticed that
you've removed my changes (see below) from the docs that were saying
how to get info/stats till_end_of_wal. That leaves end users confused
as to how they can achieve till_end_of_wal functionality. All users
can't look for commit history/message but they can easily read the
docs. I prefer to have the following (did so in the attached v7) and
get rid of the above note if you don't feel strongly about it.

+ If a future end_lsn
+  (i.e. the LSN server doesn't know about) is specified, it returns
+  informaton till end of WAL.

> +   GetInputLSNs(fcinfo, &start_lsn, &end_lsn, till_end_of_wal);
> +
> +   stats_per_record = PG_GETARG_BOOL(2);
>
> This code in GetWalStats() is incorrect.
> pg_get_wal_stats_till_end_of_wal() has a stats_per_record, but as
> *second* argument, so this would be broken.

Oh, yeah. Thanks for fixing it.

> +   GetInputLSNs(fcinfo, &start_lsn, &end_lsn, till_end_of_wal);
>
> Coming from the last point, I think that this interface is confusing,
> and actually incorrect.  From what I can see, we should be doing what
> ~15 has by grepping the argument values within the main function
> calls, and just pass them down to the internal routines GetWalStats()
> and GetWALRecordsInfo().

Hm, what you have in v6 works for me.

> At the end, I am finishing with the attached.  ValidateInputLSNs()
> ought to be called, IMO, when the caller of the SQL functions can
> directly specify an end_lsn.  This means that there is no point to do
> this check in the two till_end_* functions.  This has as cost two
> extra checks to make sure that the start_lsn is not higher than the
> current LSN, but I am fine to live with that.  It seemed rather
> natural to me to let ValidateInputLSNs() do a refresh of the end_lsn
> if it sees that it is higher than the current LSN.  And if you look
> closely, you will see that we only call *once* GetCurrentLSN() for
> each function call, so the maths are more precise.
>
> I have cleaned up the comments of the modules, while on it, as there
> was not much value in copy-pasting how a function fails while there is
> a centralized validation code path.  The tests for the till_end()
> functions have been moved to the test path where we install 1.0.

I have some comments and fixed them in the attached v7 patch:

1.
+ * pg_get_wal_records_info
  *
+ * pg_get_wal_stats
  *
I think you wanted to be consistent with function comments with
function names atop, but missed adding for all functions. Actually, I
don't have a strong opinion on these changes as they unnecessarily
bloat the changes, so I removed them.

2.

Re: Allow logical replication to copy tables in binary format

2023-03-13 Thread Melih Mutlu
Hi,

Attached v12 with a unified option.

Setting binary = true now allows the initial sync to happen in binary
format.

Thanks,
-- 
Melih Mutlu
Microsoft


v12-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-13 Thread Önder Kalacı
Hi Hou zj,  Shi-san, all


> In this function, it used the local column number(keycol) to match the
> remote
> column number(attkeys), I think it will cause problem if the column order
> between pub/sub doesn't match. Like:
>
> ---
> - pub
> CREATE TABLE test_replica_id_full (x int, y int);
> ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;
> CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full;
> - sub
> CREATE TABLE test_replica_id_full (z int, y int, x int);
> CREATE unique INDEX idx ON test_replica_id_full(z);
> CREATE SUBSCRIPTION tap_sub_rep_full_0 CONNECTION 'dbname=postgres
> port=5432' PUBLICATION tap_pub_rep_full;
> ---
>
> I think we need to use the attrmap->attnums to convert the column number
> before
> comparing. Just for reference, attach a diff(0001) which I noted down when
> trying to
> fix the problem.
>

I'm always afraid of these types of last minute additions to the patch, and
here we have
this issue on one of the latest addition :(

Thanks for reporting the problem and also providing guidance on the fix.
After reading
codes on attrMap and debugging this case further, I think your suggestion
makes sense.

I only made some small changes, and included them in the patch.


> Besides, I also look at the "WIP: Optimize for non-pkey / non-RI unique
> indexes" patch, I think it also had a similar problem about the column
> matching


Right, I'll incorporate this fix to that one as well.


> . And another thing I think we can improved in this WIP patch is that
> we can cache the result of IsIdxSafeToSkipDuplicates() instead of doing it
> for
> each UPDATE, because the cost of this function becomes bigger after
> applying
> this patch


Yes, it makes sense.


>
> Thanks for Shi-san for helping to finish these fixes.
>
> Thank you both!


Onder Kalaci
From 115999a134635ff6bef4caab434e6ece5d77c1b8 Mon Sep 17 00:00:00 2001
From: Onder Kalaci 
Date: Wed, 22 Feb 2023 14:12:56 +0300
Subject: [PATCH v45] Allow the use of indexes other than PK and REPLICA
 IDENTITY on the subscriber.

Using REPLICA IDENTITY FULL on the publisher can lead to a full table
scan per tuple change on the subscription when REPLICA IDENTITY or PK
index is not available. This makes REPLICA IDENTITY FULL impractical
to use apart from some small number of use cases.

This patch allows using indexes other than PRIMARY KEY or REPLICA
IDENTITY on the subscriber during apply of update/delete. The index
that can be used must be a btree index, not a partial index, and it
must have at least one column reference (i.e. cannot consist of only
expressions). We can uplift these restrictions in the future. There is
no smart mechanism to pick the index. If there is more than one index
that satisfies these requirements, we just pick the first one. We discussed
using some of the optimizer's low-level APIs for this but ruled it out
as that can be a maintenance burden in the long run.

This patch improves the performance in the vast majority of cases and
the improvement is proportional to the amount of data in the table.
However, there could be some regression in a small number of cases
where the indexes have a lot of duplicate and dead rows. It was
discussed that those are mostly impractical cases but we can provide a
table or subscription level option to disable this feature if
required.

Author: Onder Kalaci
Reviewed-by: Peter Smith, Shi yu, Hou Zhijie, Vignesh C, Kuroda
Hayato, Amit Kapila
Discussion: https://postgr.es/m/CACawEhVLqmAAyPXdHEPv1ssU2c=dqOniiGz7G73HfyS7+nGV4w@mail.gmail.com
---
 doc/src/sgml/logical-replication.sgml |   9 +-
 src/backend/executor/execReplication.c| 111 ++--
 src/backend/replication/logical/relation.c| 227 +++-
 src/backend/replication/logical/worker.c  |  56 +-
 src/include/replication/logicalrelation.h |   5 +
 src/test/subscription/meson.build |   1 +
 .../subscription/t/032_subscribe_use_index.pl | 514 ++
 7 files changed, 854 insertions(+), 69 deletions(-)
 create mode 100644 src/test/subscription/t/032_subscribe_use_index.pl

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 1bd5660c87..6b0e300adc 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -132,7 +132,14 @@
certain additional requirements) can also be set to be the replica
identity.  If the table does not have any suitable key, then it can be set
to replica identity full, which means the entire row becomes
-   the key.  This, however, is very inefficient and should only be used as a
+   the key.  When replica identity full is specified,
+   indexes can be used on the subscriber side for searching the rows.  Candidate
+   indexes must be btree, non-partial, and have at least one column reference
+   (i.e. cannot consist of only expressions).  These restrictions
+   on the non-unique index properties adhere to some of the restrictions that
+   are enforced for p

RE: Dropped and generated columns might cause wrong data on subs when REPLICA IDENTITY FULL

2023-03-13 Thread shiy.f...@fujitsu.com
On Sun, Mar 12, 2023 4:00 AM Önder Kalacı  wrote:
> 
> Attaching a patch that could possibly solve the problem. 
> 

Thanks for your patch. I tried it and it worked well.
Here are some minor comments.

1.
@@ -243,6 +243,17 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
Form_pg_attribute att;
TypeCacheEntry *typentry;
 
+
+   Form_pg_attribute attr = 
TupleDescAttr(slot1->tts_tupleDescriptor, attrnum);
+

I think we can use "att" instead of a new variable. They have the same value.

2. 
+# The bug was that when when the REPLICA IDENTITY FULL is used with dropped

There is an extra "when".

Regards,
Shi Yu


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-13 Thread Amit Kapila
On Mon, Mar 13, 2023 at 2:44 AM Peter Smith  wrote:
>
> On Sat, Mar 11, 2023 at 6:05 PM Amit Kapila  wrote:
> >
> > On Fri, Mar 10, 2023 at 7:58 PM Önder Kalacı  wrote:
> > >
> > >
> > > I think one option could be to drop some cases altogether, but not sure 
> > > we'd want that.
> > >
> > > As a semi-related question: Are you aware of any setting that'd make 
> > > pg_stat_all_indexes
> > > reflect the changes sooner? It is hard to debug what is the bottleneck in 
> > > the tests, but
> > > I have a suspicion that there might be several poll_query_until() calls on
> > > pg_stat_all_indexes, which might be the reason?
> > >
> >
> > Yeah, I also think poll_query_until() calls on pg_stat_all_indexes is
> > the main reason for these tests taking more time. When I commented
> > those polls, it drastically reduces the test time. On looking at
> > pgstat_report_stat(), it seems we don't report stats sooner than 1s
> > and as most of this patch's test relies on stats, it leads to taking
> > more time. I don't have a better idea to verify this patch without
> > checking whether the index scan is really used by referring to
> > pg_stat_all_indexes. I think trying to reduce the poll calls may help
> > in reducing the test timings further. Some ideas on those lines are as
> > follows:
>
> If the reason for the stats polling was only to know if some index is
> chosen or not, I was wondering if you can just convey the same
> information to the TAP test via some conveniently placed (DEBUG?)
> logging.
>

I had thought about it but didn't convince myself that it would be a
better approach because it would LOG a lot of messages for bulk
updates/deletes. Note for each row update on the publisher a new
index/sequence scan will happen. So, instead, I tried to further
change the test cases to remove unnecessary parts. I have changed
below tests:

1.
+# subscriber gets the missing table information
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION tap_sub_rep_full REFRESH PUBLICATION");

This and the follow-on test was not required after we have removed
Dropped columns test.

2. Reduce the number of updates/deletes in the first test to two rows.

3. Removed the cases for dropping the index. This ensures that after
dropping the index on the table we switch to either an index scan (if
a new index is created) or to a sequence scan. It doesn't seem like a
very interesting case to me.

Apart from the above, I have removed the explicit setting of
'wal_retrieve_retry_interval = 1ms' as the same is not done for any
other subscription tests. I know setting wal_retrieve_retry_interval
avoids the launcher sometimes taking more time to launch apply worker
but it is better to be consistent. See the changes in
changes_amit_1.patch, if you agree with the same then please include
them in the next version.

After doing the above, the test time on my machine is closer to what
other tests take which is ~5s.


--
With Regards,
Amit Kapila.


v45_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


changes_amit_1.patch
Description: Binary data


Re: Ability to reference other extensions by schema in extension scripts

2023-03-13 Thread 'Sandro Santilli'
On Sat, Mar 11, 2023 at 03:18:18AM -0500, Regina Obe wrote:
> Attached is a revised patch with these changes in place.

I've given a try to this patch. It builds and regresses fine.

My own tests also worked fine. As long as ext1 was found
in the ext2's no_relocate list it could not be relocated,
and proper error message is given to user trying it.

Nitpicking, there are a few things that are weird to me:

1) I don't get any error/warning if I put an arbitrary
string into no_relocate (there's no check to verify the
no_relocate is a subset of the requires).

2) An extension can still reference extensions it depends on
without putting them in no_relocate. This may be intentional,
as some substitutions may not require blocking relocation, but
felt inconsistent with the normal @extschema@ which is never
replaced unless an extension is marked as non-relocatable.

--strk;

  Libre GIS consultant/developer
  https://strk.kbt.io/services.html




Re: [PATCH] Add pretty-printed XML output option

2023-03-13 Thread Jim Jones

On 10.03.23 15:32, Tom Lane wrote:

Jim Jones  writes:

On 09.03.23 21:21, Tom Lane wrote:

I've looked through this now, and have some minor complaints and a major
one.  The major one is that it doesn't work for XML that doesn't satisfy
IS DOCUMENT.  For example,

How do you suggest the output should look like?

I'd say a series of node trees, each starting on a separate line.


v22 attached enables the usage of INDENT with non singly-rooted xml.

postgres=# SELECT xmlserialize (CONTENT 'x="y">4273' AS text INDENT);

 xmlserialize
---
 +
   42+
    +
 73
(1 row)

I tried several libxml2 dump functions and none of them could cope very 
well with an xml string without a root node. So added them into a 
temporary root node, so that I could iterate over its children and add 
them one by one (formatted) into the output buffer.


I slightly modified the existing xml_parse() function to return the list 
of nodes parsed by xmlParseBalancedChunkMemory:


xml_parse(text *data, XmlOptionType xmloption_arg, bool preserve_whitespace,
      int encoding, Node *escontext, *xmlNodePtr *parsed_nodes*)

res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0,
utf8string + count, *parsed_nodes*);


I was mistakenly calling xml_parse with GetDatabaseEncoding(). It now
uses the encoding of the given doc and UTF8 if not provided.

Mmmm  doing this differently from what we do elsewhere does not
sound like the right path forward.  The input *is* (or had better be)
in the database encoding.

I changed that behavior. It now uses GetDatabaseEncoding();

Thanks!

Best, Jim
From 85873e505aa04dea4ed92267dd07160d39460a59 Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Fri, 10 Mar 2023 13:47:16 +0100
Subject: [PATCH v22] Add pretty-printed XML output option

This patch implements the XML/SQL:2011 feature 'X069, XMLSERIALIZE: INDENT.'
It adds the options INDENT and NO INDENT (default) to the existing
xmlserialize function. It uses the indentation feature of xmlSaveToBuffer
from libxml2 to indent XML strings - see option XML_SAVE_FORMAT.

Although the INDENT feature is designed to work with xml strings of type
DOCUMENT, this implementation also allows the usage of CONTENT type strings
as long as it contains a well balanced xml.

This patch also includes documentation, regression tests and their three
possible output files xml.out, xml_1.out and xml_2.out.
---
 doc/src/sgml/datatype.sgml|   8 +-
 src/backend/catalog/sql_features.txt  |   2 +-
 src/backend/executor/execExprInterp.c |   9 +-
 src/backend/parser/gram.y |  14 ++-
 src/backend/parser/parse_expr.c   |   1 +
 src/backend/utils/adt/xml.c   | 137 +--
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |   4 +-
 src/include/parser/kwlist.h   |   1 +
 src/include/utils/xml.h   |   1 +
 src/test/regress/expected/xml.out | 153 ++
 src/test/regress/expected/xml_1.out   |  84 ++
 src/test/regress/expected/xml_2.out   | 153 ++
 src/test/regress/sql/xml.sql  |  32 ++
 14 files changed, 582 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 467b49b199..53d59662b9 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4460,14 +4460,18 @@ xml 'bar'
 xml, uses the function
 xmlserialize:xmlserialize
 
-XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type )
+XMLSERIALIZE ( { DOCUMENT | CONTENT } value AS type [ [NO] INDENT ] )
 
 type can be
 character, character varying, or
 text (or an alias for one of those).  Again, according
 to the SQL standard, this is the only way to convert between type
 xml and character types, but PostgreSQL also allows
-you to simply cast the value.
+you to simply cast the value. The option INDENT allows to
+indent the serialized xml output - the default is NO INDENT.
+It is designed to indent XML strings of type DOCUMENT, but it can also
+   be used with CONTENT as long as value
+   contains a well-formed XML.

 

diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index 0fb9ab7533..bb4c135a7f 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -621,7 +621,7 @@ X061	XMLParse: character string input and DOCUMENT option			YES
 X065	XMLParse: binary string input and CONTENT option			NO	
 X066	XMLParse: binary string input and DOCUMENT option			NO	
 X068	XMLSerialize: BOM			NO	
-X069	XMLSerialize: INDENT			NO	
+X069	XMLSerialize: INDENT			YES	
 X070	XMLSerialize: character string serialization and CONTENT option			YES	
 X071	XMLSerialize: character string serialization and DOCUMENT option			YES	
 X072	XMLSerialize: character string serialization			YES	
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/ex

Re: Lock mode in ExecMergeMatched()

2023-03-13 Thread Dean Rasheed
> On Fri, 10 Mar 2023 at 21:42, Alexander Korotkov  wrote:
> >
> > I wonder why does ExecMergeMatched() determine the lock mode using
> > ExecUpdateLockMode().  Why don't we use lock mode set by
> > table_tuple_update() like ExecUpdate() does?  I skim through the
> > MERGE-related threads, but didn't find an answer.
> >
> > I also noticed that we use ExecUpdateLockMode() even for CMD_DELETE.
> > That ends up by usage of LockTupleNoKeyExclusive for CMD_DELETE, which
> > seems plain wrong for me.
>

I pushed the patch for bug #17809, which in the end didn't directly
touch this code. I considered including the change of lockmode in that
patch, but in the end decided against it, since it wasn't directly
related to the issues being fixed there, and I wanted more time to
think about what changing the lockmode here really means.

I'm wondering now if it really matters what lock mode we use here. If
the point of calling table_tuple_lock() after a concurrent update is
detected is to prevent more concurrent updates, so that the retry is
guaranteed to succeed, then wouldn't even LockTupleNoKeyExclusive be
sufficient in all cases? After all, that does block concurrent updates
and deletes.

Perhaps there is an issue with using LockTupleNoKeyExclusive, and then
having to upgrade it to LockTupleExclusive later? But I wonder if that
can already happen -- consider a regular UPDATE (not via MERGE) of
non-key columns on a partitioned table, that initially does a simple
update, but upon retrying needs to do a cross-partition update (DELETE
+ INSERT).

But perhaps I'm thinking about this in the wrong way. Do you have an
example test case where this makes a difference?

Regards,
Dean




Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-13 Thread Sandro Santilli
On Wed, Mar 08, 2023 at 08:27:29AM -0500, Robert Haas wrote:

> I wonder if a solution to this problem might be to provide some kind
> of a version map file. Let's suppose that the map file is a bunch of
> lines of the form X Y Z, where X, Y, and Z are version numbers. The
> semantics could be: we (the extension authors) promise that if you
> want to upgrade from X to Z, it suffices to run the script that knows
> how to upgrade from Y to Z.
> This would address Tom's concern, because
> if you write a master upgrade script, you have to explicitly declare
> the versions to which it applies.

This would just move the problem from having 1968 files to having
to write 1968 lines in control files, and does not solve the problem
of allowing people with an hot-patched old version not being able to
upgrade to a newer (bug-free) version released before the hot-patch.

We maintain multiple stable branches (5, at the moment: 2.5, 3.0, 3.1,
3.2, 3.3) and would like to allow anyone running an old patched
version to still upgrade to a newer version released earlier.

--strk;

  Libre GIS consultant/developer
  https://strk.kbt.io/services.html




Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-13 Thread 'Sandro Santilli'
On Wed, Mar 08, 2023 at 03:18:06PM -0500, Regina Obe wrote:
> 
> Then question arises if you have such a map file and you have files as well 
> (the old way).

One idea I had in the past about the .control file was to
advertise an executable that would take current version and next
version and return a list of paths to SQL files to use to upgrade.

Then our script could decide what to do, w/out Tom looking :P

--strk;




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-13 Thread Ants Aasma
On Sat, 11 Mar 2023 at 16:55, Melanie Plageman
 wrote:
>
> > On Tue, Feb 28, 2023 at 3:16 AM Bharath Rupireddy
> >  wrote:
> >
> > > On Thu, Jan 12, 2023 at 6:06 AM Andres Freund  wrote:
> > > >
> > > > On 2023-01-11 17:26:19 -0700, David G. Johnston wrote:
> > > > > Should we just add "ring_buffers" to the existing "shared_buffers" and
> > > > > "temp_buffers" settings?
> > > >
> > > > The different types of ring buffers have different sizes, for good 
> > > > reasons. So
> > > > I don't see that working well. I also think it'd be more often useful to
> > > > control this on a statement basis - if you have a parallel import tool 
> > > > that
> > > > starts NCPU COPYs you'd want a smaller buffer than a single threaded 
> > > > COPY. Of
> > > > course each session can change the ring buffer settings, but still.
> > >
> > > How about having GUCs for each ring buffer (bulk_read_ring_buffers,
> > > bulk_write_ring_buffers, vacuum_ring_buffers - ah, 3 more new GUCs)?
> > > These options can help especially when statement level controls aren't
> > > easy to add (COPY, CREATE TABLE AS/CTAS, REFRESH MAT VIEW/RMV)? If
> > > needed users can also set them at the system level. For instance, one
> > > can set bulk_write_ring_buffers to other than 16MB or -1 to disable
> > > the ring buffer to use shared_buffers and run a bunch of bulk write
> > > queries.
>
> In attached v3, I've changed the name of the guc from buffer_usage_limit
> to vacuum_buffer_usage_limit, since it is only used for vacuum and
> autovacuum.

Sorry for arriving late to this thread, but what about sizing the ring
dynamically? From what I gather the primary motivation for larger ring
size is avoiding WAL flushes due to dirty buffer writes. We already
catch that event with StrategyRejectBuffer(). So maybe a dynamic
sizing algorithm could be applied to the ringbuffer. Make the buffers
array in strategy capable of holding up to the limit of buffers, but
set ring size conservatively. If we have to flush WAL, double the ring
size (up to the limit). If we loop around the ring without flushing,
decrease the ring size by a small amount to let clock sweep reclaim
them for use by other backends.

-- 
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-13 Thread Önder Kalacı
Hi Amit, Peter, all


> > If the reason for the stats polling was only to know if some index is
> > chosen or not, I was wondering if you can just convey the same
> > information to the TAP test via some conveniently placed (DEBUG?)
> > logging.
> >
>
> I had thought about it but didn't convince myself that it would be a
> better approach because it would LOG a lot of messages for bulk
> updates/deletes.


I'm also hesitant to add any log messages for testing purposes, especially
something like this one, where a single UPDATE on the source code
leads to an unbounded number of logs.


>
> 1.
> +# subscriber gets the missing table information
> +$node_subscriber->safe_psql('postgres',
> + "ALTER SUBSCRIPTION tap_sub_rep_full REFRESH PUBLICATION");
>
> This and the follow-on test was not required after we have removed
> Dropped columns test.
>
>
Right, I kept it with the idea that we might get the dropped column changes
earlier, so that I can rebase and add the drop column ones.

But, sure, we can add that later to other tests.


> 2. Reduce the number of updates/deletes in the first test to two rows.
>

We don't have any particular reasons to have more tuples. Given the
time constraints, I don't have any objections to change this.


>
> 3. Removed the cases for dropping the index. This ensures that after
> dropping the index on the table we switch to either an index scan (if
> a new index is created) or to a sequence scan. It doesn't seem like a
> very interesting case to me.
>

For that test, my goal was to ensure/show that the invalidation callback
is triggered after `DROP / CREATE INDEX` commands.

Can we always assume that this would never change? Because if this
behavior ever changes, the users would stuck with the wrong/old
index until VACUUM happens.


>
> Apart from the above, I have removed the explicit setting of
> 'wal_retrieve_retry_interval = 1ms' as the same is not done for any
> other subscription tests. I know setting wal_retrieve_retry_interval
> avoids the launcher sometimes taking more time to launch apply worker
> but it is better to be consistent


Hmm, I cannot remember why I added that. It was probably to make
poll_query_until/wait_for_catchup to happen faster.

But, running the test w/wout this setting, I cannot observe any noticeable
difference. So, probably fine to remove.


> . See the changes in
> changes_amit_1.patch, if you agree with the same then please include
> them in the next version.
>

included all, but I'm not very sure to remove (3). If you think we have
coverage for that in other cases, I'm fine with that.


>
> After doing the above, the test time on my machine is closer to what
> other tests take which is ~5s.
>
> Yes, same for me.

Thanks, attaching v46
From ad08f37ac79d5b68f0335cb785820db50c55b85a Mon Sep 17 00:00:00 2001
From: Onder Kalaci 
Date: Wed, 22 Feb 2023 14:12:56 +0300
Subject: [PATCH v46] Allow the use of indexes other than PK and REPLICA
 IDENTITY on the subscriber.

Using REPLICA IDENTITY FULL on the publisher can lead to a full table
scan per tuple change on the subscription when REPLICA IDENTITY or PK
index is not available. This makes REPLICA IDENTITY FULL impractical
to use apart from some small number of use cases.

This patch allows using indexes other than PRIMARY KEY or REPLICA
IDENTITY on the subscriber during apply of update/delete. The index
that can be used must be a btree index, not a partial index, and it
must have at least one column reference (i.e. cannot consist of only
expressions). We can uplift these restrictions in the future. There is
no smart mechanism to pick the index. If there is more than one index
that satisfies these requirements, we just pick the first one. We discussed
using some of the optimizer's low-level APIs for this but ruled it out
as that can be a maintenance burden in the long run.

This patch improves the performance in the vast majority of cases and
the improvement is proportional to the amount of data in the table.
However, there could be some regression in a small number of cases
where the indexes have a lot of duplicate and dead rows. It was
discussed that those are mostly impractical cases but we can provide a
table or subscription level option to disable this feature if
required.

Author: Onder Kalaci
Reviewed-by: Peter Smith, Shi yu, Hou Zhijie, Vignesh C, Kuroda
Hayato, Amit Kapila
Discussion: https://postgr.es/m/CACawEhVLqmAAyPXdHEPv1ssU2c=dqOniiGz7G73HfyS7+nGV4w@mail.gmail.com
---
 doc/src/sgml/logical-replication.sgml |   9 +-
 src/backend/executor/execReplication.c| 111 +++--
 src/backend/replication/logical/relation.c| 227 -
 src/backend/replication/logical/worker.c  |  56 +--
 src/include/replication/logicalrelation.h |   5 +
 src/test/subscription/meson.build |   1 +
 .../subscription/t/032_subscribe_use_index.pl | 435 ++
 7 files changed, 775 insertions(+), 69 deletions(-)
 create mode 100644 src/test/subscription/t/032_subscribe_us

Re: Dropped and generated columns might cause wrong data on subs when REPLICA IDENTITY FULL

2023-03-13 Thread Önder Kalacı
Hi Shi Yu,


> 1.
> @@ -243,6 +243,17 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot
> *slot2,
> Form_pg_attribute att;
> TypeCacheEntry *typentry;
>
> +
> +   Form_pg_attribute attr =
> TupleDescAttr(slot1->tts_tupleDescriptor, attrnum);
> +
>
> I think we can use "att" instead of a new variable. They have the same
> value.
>

ah, of course :)


>
> 2.
> +# The bug was that when when the REPLICA IDENTITY FULL is used with
> dropped
>
> There is an extra "when".
>
>
Fixed, thanks


Attaching v2
From caae511206ee904db54d0a9300ecef04c86aadb6 Mon Sep 17 00:00:00 2001
From: Onder Kalaci 
Date: Sat, 11 Mar 2023 12:09:31 +0300
Subject: [PATCH v2] Skip dropped and generated columns when REPLICA IDENTITY
 FULL

Dropped columns and generated columns are filled with NULL values on
slot_store_data() but not on table_scan_getnextslot(). With this commit,
we skip such columns while checking tuple equality.
---
 src/backend/executor/execReplication.c | 10 
 src/test/subscription/t/100_bugs.pl| 67 ++
 2 files changed, 77 insertions(+)

diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 4f5083a598..9d447b3c00 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -243,6 +243,16 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
 		Form_pg_attribute att;
 		TypeCacheEntry *typentry;
 
+		att = TupleDescAttr(slot1->tts_tupleDescriptor, attrnum);
+
+		/*
+		 * Dropped columns and generated columns are filled with NULL values on
+		 * slot_store_data() but not on table_scan_getnextslot, so ignore
+		 * for the comparison.
+		 */
+		if (att->attisdropped || att->attgenerated)
+			continue;
+
 		/*
 		 * If one value is NULL and other is not, then they are certainly not
 		 * equal
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index 036576acab..5f902d26b6 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -373,4 +373,71 @@ is( $result, qq(1
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
+# https://www.postgresql.org/message-id/CACawEhUu6S8E4Oo7%2Bs5iaq%3DyLRZJb6uOZeEQSGJj-7NVkDzSaw%40mail.gmail.com
+
+# The bug was that when the REPLICA IDENTITY FULL is used with dropped
+# or generated columns, the equality checks for the tuples was trying to
+# compare NULL Datum values (coming from slot_store_data()) with the
+# non-visible actua Datum values (coming from table_scan_getnextslot()).
+my $node_publisher_d_cols = PostgreSQL::Test::Cluster->new('node_publisher_d_cols');
+$node_publisher_d_cols->init(allows_streaming => 'logical');
+$node_publisher_d_cols->start;
+
+my $node_subscriber_d_cols = PostgreSQL::Test::Cluster->new('node_subscriber_d_cols');
+$node_subscriber_d_cols->init(allows_streaming => 'logical');
+$node_subscriber_d_cols->start;
+
+$node_publisher_d_cols->safe_psql(
+	'postgres', qq(
+	CREATE TABLE dropped_cols (a int, b_drop int, c int);
+	ALTER TABLE dropped_cols REPLICA IDENTITY FULL;
+
+	CREATE TABLE generated_cols (a int, b_gen int GENERATED ALWAYS AS (5 * a) STORED, c int);
+	ALTER TABLE generated_cols REPLICA IDENTITY FULL;
+
+	CREATE PUBLICATION pub_dropped_cols FOR TABLE dropped_cols, generated_cols;
+
+	-- some initial data
+	INSERT INTO dropped_cols VALUES (1, 1, 1);
+	INSERT INTO generated_cols (a,c) VALUES (1, 1);
+));
+
+$node_subscriber_d_cols->safe_psql(
+	'postgres', qq(
+	 CREATE TABLE dropped_cols (a int, b_drop int, c int);
+	 CREATE TABLE generated_cols (a int, b_gen int GENERATED ALWAYS AS (5 * a) STORED, c int);
+
+));
+
+my $publisher_connstr_d_cols = $node_publisher_d_cols->connstr . ' dbname=postgres';
+$node_subscriber_d_cols->safe_psql('postgres',
+	"CREATE SUBSCRIPTION sub_dropped_cols CONNECTION '$publisher_connstr_d_cols' PUBLICATION pub_dropped_cols"
+);
+$node_subscriber_d_cols->wait_for_subscription_sync;
+
+$node_publisher_d_cols->safe_psql(
+	'postgres', qq(
+		ALTER TABLE dropped_cols DROP COLUMN b_drop;
+));
+$node_subscriber_d_cols->safe_psql(
+	'postgres', qq(
+		ALTER TABLE dropped_cols DROP COLUMN b_drop;
+));
+
+$node_publisher_d_cols->safe_psql(
+	'postgres', qq(
+		UPDATE dropped_cols SET a = 100;
+		UPDATE generated_cols SET a = 100;
+));
+$node_publisher_d_cols->wait_for_catchup('sub_dropped_cols');
+
+is($node_subscriber_d_cols->safe_psql('postgres', "SELECT count(*) FROM dropped_cols WHERE a = 100"),
+	qq(1), 'replication with RI FULL and dropped columns');
+
+is($node_subscriber_d_cols->safe_psql('postgres', "SELECT count(*) FROM generated_cols WHERE a = 100"),
+	qq(1), 'replication with RI FULL and generated columns');
+
+$node_publisher_d_cols->stop('fast');
+$node_subscriber_d_cols->stop('fast');
+
 done_testing();
-- 
2.34.1



Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-13 Thread Amit Kapila
On Mon, Mar 13, 2023 at 6:14 PM Önder Kalacı  wrote:
>
>>
>>
>> 3. Removed the cases for dropping the index. This ensures that after
>> dropping the index on the table we switch to either an index scan (if
>> a new index is created) or to a sequence scan. It doesn't seem like a
>> very interesting case to me.
>
>
> For that test, my goal was to ensure/show that the invalidation callback
> is triggered after `DROP / CREATE INDEX` commands.
>

Fair point. I suggest in that case just keep one of the tests for Drop
Index such that after that it will pick up a sequence scan. However,
just do the poll for the number of index scans stat once. I think that
will cover the case you are worried about without having a noticeable
impact on test timing.

-- 
With Regards,
Amit Kapila.




Re: Refactoring SysCacheGetAttr to know when attr cannot be NULL

2023-03-13 Thread Daniel Gustafsson
> On 2 Mar 2023, at 15:44, Tom Lane  wrote:
> 
> Peter Eisentraut  writes:
>> I think an error message like
>> "unexpected null value in system cache %d column %d"
>> is sufficient.  Since these are "can't happen" errors, we don't need to 
>> spend too much extra effort to make it prettier.
> 
> I'd at least like to see it give the catalog's OID.  That's easily
> convertible to a name, and it doesn't tend to move around across PG
> versions, neither of which are true for syscache IDs.
> 
> Also, I'm fairly unconvinced that it's a "can't happen" --- this
> would be very likely to fire as a result of catalog corruption,
> so it would be good if it's at least minimally interpretable
> by a non-expert.  Given that we'll now have just one copy of the
> code, ISTM there's a good case for doing the small extra work
> to report catalog and column by name.

Rebased v3 on top of recent conflicting ICU changes causing the patch to not
apply anymore.  Also took another look around the tree to see if there were
missed callsites but found none new.

--
Daniel Gustafsson



v3-0001-Add-SysCacheGetAttrNotNull-for-guaranteed-not-nul.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-13 Thread John Naylor
On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada 
wrote:
>
> On Sun, Mar 12, 2023 at 12:54 AM John Naylor
>  wrote:
> >
> > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada 
wrote:

> > > * Additional size classes. It's important for an alternative of path
> > > compression as well as supporting our decoupling approach. Middle
> > > priority.
> >
> > I'm going to push back a bit and claim this doesn't bring much gain,
while it does have a complexity cost. The node1 from Andres's prototype is
32 bytes in size, same as our node3, so it's roughly equivalent as a way to
ameliorate the lack of path compression.
>
> But does it mean that our node1 would help reduce the memory further
> since since our base node type (i.e. RT_NODE) is smaller than the base
> node type of Andres's prototype? The result I shared before showed
> 1.2GB vs. 1.9GB.

The benefit is found in a synthetic benchmark with random integers. I
highly doubt that anyone would be willing to force us to keep
binary-searching the 1GB array for one more cycle on account of not adding
a size class here. I'll repeat myself and say that there are also
maintenance costs.

In contrast, I'm fairly certain that our attempts thus far at memory
accounting/limiting are not quite up to par, and lacking enough to
jeopardize the feature. We're already discussing that, so I'll say no more.

> > I say "roughly" because the loop in node3 is probably noticeably
slower. A new size class will by definition still use that loop.
>
> I've evaluated the performance of node1 but the result seems to show
> the opposite.

As an aside, I meant the loop in our node3 might make your node1 slower
than the prototype's node1, which was coded for 1 member only.

> > > * Node shrinking support. Low priority.
> >
> > This is an architectural wart that's been neglected since the tid store
doesn't perform deletion. We'll need it sometime. If we're not going to
make this work, why ship a deletion API at all?
> >
> > I took a look at this a couple weeks ago, and fixing it wouldn't be
that hard. I even had an idea of how to detect when to shrink size class
within a node kind, while keeping the header at 5 bytes. I'd be willing to
put effort into that, but to have a chance of succeeding, I'm unwilling to
make it more difficult by adding more size classes at this point.
>
> I think that the deletion (and locking support) doesn't have use cases
> in the core (i.e. tidstore) but is implemented so that external
> extensions can use it.

I think these cases are a bit different: Doing anything with a data
structure stored in shared memory without a synchronization scheme is
completely unthinkable and insane. I'm not yet sure if
deleting-without-shrinking is a showstopper, or if it's preferable in v16
than no deletion at all.

Anything we don't implement now is a limit on future use cases, and thus a
cause for objection. On the other hand, anything we implement also
represents more stuff that will have to be rewritten for high-concurrency.

> FYI, I've run TPC-C workload over the weekend, and didn't get any
> failures of the assertion proving tidstore and the current tid lookup
> return the same result.

Great!

--
John Naylor
EDB: http://www.enterprisedb.com


Re: MERGE ... RETURNING

2023-03-13 Thread Dean Rasheed
On Sun, 26 Feb 2023 at 09:50, Dean Rasheed  wrote:
>
> Another rebase.
>

And another rebase.

Regards,
Dean
diff --git a/doc/src/sgml/dml.sgml b/doc/src/sgml/dml.sgml
new file mode 100644
index cbbc5e2..ff2a827
--- a/doc/src/sgml/dml.sgml
+++ b/doc/src/sgml/dml.sgml
@@ -283,10 +283,15 @@ DELETE FROM products;
RETURNING
   
 
+  
+   MERGE
+   RETURNING
+  
+
   
Sometimes it is useful to obtain data from modified rows while they are
being manipulated.  The INSERT, UPDATE,
-   and DELETE commands all have an
+   DELETE, and MERGE commands all have an
optional RETURNING clause that supports this.  Use
of RETURNING avoids performing an extra database query to
collect the data, and is especially valuable when it would otherwise be
@@ -339,6 +344,24 @@ DELETE FROM products
 
   
 
+  
+   In a MERGE, the data available to RETURNING is
+   the content of the source row plus the content of the inserted, updated, or
+   deleted target row.  In addition, the merge support functions
+and 
+   may be used to return information about which merge action was executed for
+   each row.  Since it is quite common for the source and target to have many
+   of the same columns, specifying RETURNING * can lead to a
+   lot of duplicated columns, so it is often more useful to just return the
+   target row.  For example:
+
+MERGE INTO products p USING new_products n ON p.product_no = n.product_no
+  WHEN NOT MATCHED THEN INSERT VALUES (n.product_no, n.name, n.price)
+  WHEN MATCHED THEN UPDATE SET name = n.name, price = n.price
+  RETURNING pg_merge_action(), p.*;
+
+  
+
   
If there are triggers () on the target table,
the data available to RETURNING is the row as modified by
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 9c6107f..d07f90a
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21216,6 +21216,99 @@ SELECT count(*) FROM sometable;
 
  
 
+ 
+  Merge Support Functions
+
+  
+   MERGE
+   RETURNING
+  
+
+  
+   The merge support functions shown in
+may be used in the
+   RETURNING list of a  command
+   to return additional information about the action taken for each row.
+  
+
+  
+Merge Support Functions
+
+
+ 
+  
+   
+Function
+   
+   
+Description
+   
+  
+ 
+
+ 
+  
+   
+
+ pg_merge_action
+
+pg_merge_action ( )
+text
+   
+   
+Returns the action performed on the current row ('INSERT',
+'UPDATE', or 'DELETE').
+   
+  
+
+  
+   
+
+ pg_merge_when_clause
+
+pg_merge_when_clause ( )
+integer
+   
+   
+Returns the 1-based index of the WHEN clause executed
+for the current row.
+   
+  
+ 
+
+  
+
+  
+   Example:
+
+  
+
+  
+   Note that it is an error to use these functions anywhere other than the
+   RETURNING list of a MERGE command.
+  
+
+ 
+
  
   Subquery Expressions
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 7c01a54..4317cfc
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1323,9 +1323,9 @@
  to a client upon the
  completion of an SQL command, usually a
  SELECT but it can be an
- INSERT, UPDATE, or
- DELETE command if the RETURNING
- clause is specified.
+ INSERT, UPDATE,
+ DELETE, or MERGE command if the
+ RETURNING clause is specified.
 
 
  The fact that a result set is a relation means that a query can be used
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 7c8a49f..b839695
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1020,8 +1020,8 @@ INSERT INTO mytable VALUES (1,'one'), (2
 
 
 
- If the command does return rows (for example SELECT,
- or INSERT/UPDATE/DELETE
+ If the command does return rows (for example SELECT, or
+ INSERT/UPDATE/DELETE/MERGE
  with RETURNING), there are two ways to proceed.
  When the command will return at most one row, or you only care about
  the first row of output, write the command as usual but add
@@ -1149,6 +1149,7 @@ SELECT select_expressionsexpressions INTO STRICT target;
 UPDATE ... RETURNING expressions INTO STRICT target;
 DELETE ... RETURNING expressions INTO STRICT target;
+MERGE ... RETURNING expressions INTO STRICT target;
 
 
  where target can be a record variable, a row
@@ -1159,8 +1160,8 @@ DELETE ... RETURNING expres
  INTO clause) just as described above,
  and the plan is cached in the same way.
  This works for SELECT,
- INSERT/UPDATE/DELETE with
- RETURNING, and certain utility commands
+ INSERT/UPDATE/DELETE/MERGE
+ with RETURNING, and certain utility commands
  that return row sets, such as EXPLAIN.
  Except for the INTO clause, the SQL command is the same
  a

Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-13 Thread Justin Pryzby
On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Sun, Mar 12, 2023 at 04:14:06PM -0400, Tom Lane wrote:
> >> Hm.  Could we get rid of count_leaf_partitions by doing the work in
> >> ProcessUtilitySlow?  Or at least passing that OID list forward instead
> >> of recomputing it?
> 
> > count_leaf_partitions() is called in two places:
> 
> > Once to get PROGRESS_CREATEIDX_PARTITIONS_TOTAL.  It'd be easy enough to
> > pass an integer total via IndexStmt (but I think we wanted to avoid
> > adding anything there, since it's not a part of the statement).
> 
> I agree that adding such a field to IndexStmt would be a very bad idea.
> However, adding another parameter to DefineIndex doesn't seem like a
> problem.

It's a problem since this is a bug and it's desirable to backpatch a
fix, right ?

> Or could we just call pgstat_progress_update_param directly from
> ProcessUtilitySlow, after counting the partitions in the existing loop?

That'd be fine if it was only needed for TOTAL, but it doesn't handle
the 2nd call to count_leaf_partitions().

On Sun, Mar 12, 2023 at 08:15:28PM -0400, Tom Lane wrote:
> I wrote:
> > Justin Pryzby  writes:
> >> count_leaf_partitions() is also called for sub-partitions, in the case
> >> that a matching "partitioned index" already exists, and the progress
> >> report needs to be incremented by the number of leaves for which indexes
> >> were ATTACHED.
> 
> > Can't you increment progress by one at the point where the actual attach
> > happens?
> 
> Oh, never mind; now I realize that the point is that you didn't ever
> iterate over those leaf indexes.
> 
> However, consider a thought experiment: assume for whatever reason that
> all the actual index builds happen first, then all the cases where you
> succeed in attaching a sub-partitioned index happen at the end of the
> command.  In that case, the percentage-done indicator would go from
> some-number to 100% more or less instantly.
> 
> What if we simply do nothing at sub-partitioned indexes?  Or if that's
> slightly too radical, just increase the PARTITIONS_DONE counter by 1?
> That would look indistinguishable from the case where all the attaches
> happen at the end.

Incrementing by 0 sounds terrible, since someone who has intermediate
partitioned tables is likely to always see 0% done.  (It's true that
intermediate partitioned tables don't seem to have been considered by
the original patch, and it's indisputable that progress reporting
currently misbehaves in that case).

And incrementing PARTITIONS_DONE by 1 could lead to bogus progress
reporting with "N_done > N_Total" if an intermediate partitioned table
had no leaf partitions at all.  That's one of the problems this thread
is trying to fix (the other being "total changing in the middle of the
command").

Maybe your idea is usable though, since indirect partitioned indexes
*can* be counted correctly during recursion.  What's hard to fix is the
case that an index is both *partitioned* and *attached*.  Maybe it's
okay to count that case as 0.  The consequence is that the command would
end before the progress report got to 100%.

The other option seems to be to define the progress report to count only
*direct* children.
https://www.postgresql.org/message-id/20221213044331.GJ27893%40telsasoft.com

> The reason I find this annoying is that the non-optional nature of the
> progress reporting mechanism was sold on the basis that it would add
> only negligible overhead.  Adding extra pass(es) over pg_inherits
> breaks that promise.  Maybe it's cheap enough to not matter in the
> big scheme of things, but we should not be having to make arguments
> like that one.

If someone is running a DDL command involving nested partitions, I'm not
so concerned about the cost of additional scans of pg_inherits.  They
either have enough data to justify partitioning partitions, or they're
doing something silly.

-- 
Justin




Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-03-13 Thread Alexander Lakhin

12.03.2023 10:18, Thomas Munro wrote:

And again:

TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna...

https://cirrus-ci.com/task/6558324615806976
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt


Here we have duplicate PIDs too:
...
2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0] LOG:  
disconnection: session time: 0:00:00.268 user=SYSTEM database=regression 
host=[local]

...
2023-03-11 13:57:22.320 GMT [4340][client backend] [pg_regress/join][8/947:0] 
LOG:  statement: set enable_hashjoin to 0;
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), 
File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 2152


And I see the following code in postmaster.c:
CleanupBackend(int pid,
           int exitstatus)    /* child's exit status. */
{
...
    dlist_foreach_modify(iter, &BackendList)
    {
    Backend    *bp = dlist_container(Backend, elem, iter.cur);
    if (bp->pid == pid)
    {
        if (!bp->dead_end)
        {
            if (!ReleasePostmasterChildSlot(bp->child_slot))
...

so if a backend with the same PID happened to start (but not reached
InitProcess() yet), when CleanBackend() is called to clean after a just
finished backend, the slot of the starting one will be released.

I am yet to construct a reproduction of the case, but it seems to me that
the race condition is not impossible here.

Best regards,
Alexander




Re: Reducing connection overhead in pg_upgrade compat check phase

2023-03-13 Thread Daniel Gustafsson
> On 23 Feb 2023, at 15:12, Daniel Gustafsson  wrote:
> 
>> On 22 Feb 2023, at 20:20, Nathan Bossart  wrote:
> 
>> One thing I noticed is that the
>> "failed check" log is only printed once, even if multiple data type checks
>> failed.  I believe this is because this message uses PG_STATUS.  If I
>> change it to PG_REPORT, all of the "failed check" messages appear.  TBH I'm
>> not sure we need this message at all since a more detailed explanation will
>> be printed afterwards.  If we do keep it around, I think it should be
>> indented so that it looks more like this:
>> 
>>  Checking for data type usagechecking 
>> all databases  
>>  failed check: incompatible aclitem data type in user tables
>>  failed check: reg* data types in user tables
> 
> Thats a good point, that's better.  I think it makes sense to keep it around.
> 
>>> One change this brings is that check.c contains version specific checks in 
>>> the
>>> struct.  Previously these were mostly contained in version.c (some, like the
>>> 9.4 jsonb check was in check.c) which maintained some level of separation.
>>> Splitting the array init is of course one option but it also seems a tad 
>>> messy.
>>> Not sure what's best, but for now I've documented it in the array comment at
>>> least.
>> 
>> Hm.  We could move check_for_aclitem_data_type_usage() and
>> check_for_jsonb_9_4_usage() to version.c since those are only used for
>> determining whether the check applies now.  Otherwise, IMO things are in
>> roughly the right place.  I don't think it's necessary to split the array.
> 
> Will do, thanks.

The attached v3 is a rebase to handle conflicts and with the above comments
adressed.

--
Daniel Gustafsson



v3-0001-pg_upgrade-run-all-data-type-checks-per-connectio.patch
Description: Binary data


Re: Add support for DEFAULT specification in COPY FROM

2023-03-13 Thread Andrew Dunstan



On 2022-12-02 Fr 09:11, Israel Barth Rubio wrote:

Hello all,

I'm submitting a new version of the patch. Instead of changing signature
of several functions in order to use the defaults parameter, it is now 
storing
that in the cstate structure, which is already passed to all functions 
that

were previously modified.



Thanks, committed.


cheers


andrew


--

Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-13 Thread Önder Kalacı
Hi Amit, all

> >
> > For that test, my goal was to ensure/show that the invalidation callback
> > is triggered after `DROP / CREATE INDEX` commands.
> >
>
> Fair point. I suggest in that case just keep one of the tests for Drop
> Index such that after that it will pick up a sequence scan. However,
> just do the poll for the number of index scans stat once. I think that
> will cover the case you are worried about without having a noticeable
> impact on test timing.
>
>
So, after dropping the index, it is not possible to poll for the idxscan.

But, I think, after the drop index, it is enough to check if the
modification
is applied properly on the target (wait_for_catchup + safe_psql).
If it were to cache the indexOid, the update/delete would fail anyway.

Attaching v47.


Thanks,
Onder KALACI


v47_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


Bug in jsonb_in function (14 & 15 version are affected)

2023-03-13 Thread Nikolay Shaplov
Hi!

I found a bug in jsonb_in function (it converts json from sting representation
 into jsonb internal representation).

To reproduce this bug (the way I found it) you should get 8bit instance of 
postgres db:

1. add en_US locale (dpkg-reconfigure locales in debian)
2. initdb with latin1 encoding: 

LANG=en_US ./initdb --encoding=LATIN1 -D my_pg_data

3. run database and execute the query:

SELECT 
E'{\x0a"\x5cb\x5c"\x5c\x5c\x5c/\x5cb\x5cf\x5cn\x5cr\x5ct\x5c"\x5c\x5c\x5c\x5crZt\x5c"\x5c\x5c\x5c/\x5cb\x5c"\x5c\x5c\x5c/\x5cb\x5c"\x5cu000f0\x5cu000f000\x5cuDFFF"000\x5cu000\xb4\x5cuDBFF\x5cuDFFF0002000\x5cuDBFF'::jsonb;

In postgres 14 and 15, the backend will crash.

The packtrace produce with ASan is in the attached file.

This bug was found while fuzzing postgres input functions, using AFL++.
For now we are using lightweight wrapper around input functions that 
create minimal environment for these functions to run conversion, and run the, 
in fuzzer.


My colleagues (they will come here shortly) have narrowed down this query to 

SELECT E'\n"\\u0"'::jsonb;

and says that is crashes even in utf8 locale.

They also have a preliminary version of patch to fix it. They will tell about 
it soon, I hope.

-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su
=
==90012==ERROR: AddressSanitizer: negative-size-param: (size=-1)
#0 0x533f04 in __asan_memcpy 
(/home/nataraj/dev/.install/default/bin/postgres+0x533f04)
#1 0x102145a in report_json_context 
/home/nataraj/dev/postgrespro/src/backend/utils/adt/jsonfuncs.c:685:2
#2 0x1020a32 in json_ereport_error 
/home/nataraj/dev/postgrespro/src/backend/utils/adt/jsonfuncs.c:631:3
#3 0x1020976 in pg_parse_json_or_ereport 
/home/nataraj/dev/postgrespro/src/backend/utils/adt/jsonfuncs.c:511:3
#4 0x1009176 in jsonb_from_cstring 
/home/nataraj/dev/postgrespro/src/backend/utils/adt/jsonb.c:265:2
#5 0x1008f3b in jsonb_in 
/home/nataraj/dev/postgrespro/src/backend/utils/adt/jsonb.c:81:9
#6 0x121d547 in InputFunctionCall 
/home/nataraj/dev/postgrespro/src/backend/utils/fmgr/fmgr.c:1533:11
#7 0x121dd10 in OidInputFunctionCall 
/home/nataraj/dev/postgrespro/src/backend/utils/fmgr/fmgr.c:1636:9
#8 0x8cbd0e in stringTypeDatum 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_type.c:662:9
#9 0x87eb68 in coerce_type 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_coerce.c
#10 0x87dec1 in coerce_to_target_type 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_coerce.c:105:11
#11 0x891b40 in transformTypeCast 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_expr.c:2717:11
#12 0x88f0b8 in transformExprRecurse 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_expr.c:168:13
#13 0x88efb1 in transformExpr 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_expr.c:126:11
#14 0x8c358b in transformTargetEntry 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_target.c:96:11
#15 0x8c37c2 in transformTargetList 
/home/nataraj/dev/postgrespro/src/backend/parser/parse_target.c:184:10
#16 0x83a583 in transformSelectStmt 
/home/nataraj/dev/postgrespro/src/backend/parser/analyze.c:1317:20
#17 0x836ad4 in transformStmt 
/home/nataraj/dev/postgrespro/src/backend/parser/analyze.c:366:15
#18 0x836d5e in transformOptionalSelectInto 
/home/nataraj/dev/postgrespro/src/backend/parser/analyze.c:306:9
#19 0x8364d4 in transformTopLevelStmt 
/home/nataraj/dev/postgrespro/src/backend/parser/analyze.c:256:11
#20 0x83640e in parse_analyze_fixedparams 
/home/nataraj/dev/postgrespro/src/backend/parser/analyze.c:124:10
#21 0xef9c4b in pg_analyze_and_rewrite_fixedparams 
/home/nataraj/dev/postgrespro/src/backend/tcop/postgres.c:659:10
#22 0xefe818 in exec_simple_query 
/home/nataraj/dev/postgrespro/src/backend/tcop/postgres.c:1169:20
#23 0xefd92d in PostgresMain 
/home/nataraj/dev/postgrespro/src/backend/tcop/postgres.c
#24 0xd837f0 in BackendRun 
/home/nataraj/dev/postgrespro/src/backend/postmaster/postmaster.c:4538:2
#25 0xd82e6f in BackendStartup 
/home/nataraj/dev/postgrespro/src/backend/postmaster/postmaster.c:4266:3
#26 0xd816b3 in ServerLoop 
/home/nataraj/dev/postgrespro/src/backend/postmaster/postmaster.c:1833:7
#27 0xd7e8e0 in PostmasterMain 
/home/nataraj/dev/postgrespro/src/backend/postmaster/postmaster.c:1505:11
#28 0xb7ddc2 in main 
/home/nataraj/dev/postgrespro/src/backend/main/main.c:209:3
#29 0x77190d09 in __libc_start_main csu/../csu/libc-start.c:308:16
#30 0x4baa99 in _start 
(/home/nataraj/dev/.install/default/bin/postgres+0x4baa99)0x62506e9a is 
located 7578 bytes inside of 8192-byte reg

Re: buildfarm + meson

2023-03-13 Thread Andrew Dunstan


On 2023-03-10 Fr 18:05, Andres Freund wrote:

Hi,

On 2023-03-09 11:55:57 -0800, Andres Freund wrote:

On 2023-03-09 14:47:36 -0500, Andrew Dunstan wrote:

On 2023-03-09 Th 08:28, Andrew Dunstan wrote:

At this stage I think I'm prepared to turn this loose on a couple of my
buildfarm animals, and if nothing goes awry for the remainder of the
month merge the dev/meson branch and push a new release.

Cool!

I moved a few of my animals to it to, so far no problems.

The only other thing I noticed so far is that the status page doesn't yet know
how to generate the right "flags", but that's fairly minor...



The status page should be fixed now. Still a bit of work to do for the 
failures page.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


RE: Ability to reference other extensions by schema in extension scripts

2023-03-13 Thread Regina Obe
> I've given a try to this patch. It builds and regresses fine.
> 
> My own tests also worked fine. As long as ext1 was found in the ext2's
> no_relocate list it could not be relocated, and proper error message is
given
> to user trying it.
> 
> Nitpicking, there are a few things that are weird to me:
> 
> 1) I don't get any error/warning if I put an arbitrary string into
no_relocate
> (there's no check to verify the no_relocate is a subset of the requires).
> 

I thought about that and decided it wasn't worth checking for.  If an
extension author puts in an extension not in requires it's on them as the
docs say it should be in requires.

It will just pretend that extension is not listed in no_relocate.

> 2) An extension can still reference extensions it depends on without
putting
> them in no_relocate. This may be intentional, as some substitutions may
not
> require blocking relocation, but felt inconsistent with the normal
> @extschema@ which is never replaced unless an extension is marked as non-
> relocatable.
> 
> --strk;
> 
>   Libre GIS consultant/developer
>   https://strk.kbt.io/services.html


Yes this is intentional.  As Tom mentioned, if for example an extension
author decides
To schema qualify @extschema:foo@ in their table definition, and they marked
as requiring foo
since such a reference 
is captured by a schema move, there is no need for them to prevent relocate
of the foo extension (assuming foo was relocatable to begin with)





Re: Progress report of CREATE INDEX for nested partitioned tables

2023-03-13 Thread Tom Lane
Justin Pryzby  writes:
> On Sun, Mar 12, 2023 at 06:25:13PM -0400, Tom Lane wrote:
>> I agree that adding such a field to IndexStmt would be a very bad idea.
>> However, adding another parameter to DefineIndex doesn't seem like a
>> problem.

> It's a problem since this is a bug and it's desirable to backpatch a
> fix, right ?

I do not think this is important enough to justify a back-patch.

> Incrementing by 0 sounds terrible, since someone who has intermediate
> partitioned tables is likely to always see 0% done.

How so?  The counter will increase after there's some actual work done,
ie building an index.  If there's no indexes to build then it hardly
matters, because the command will complete in very little time.

> And incrementing PARTITIONS_DONE by 1 could lead to bogus progress
> reporting with "N_done > N_Total" if an intermediate partitioned table
> had no leaf partitions at all.

Well, we could fix that if we made TOTAL be the total number of
descendants rather than just the leaves ;-).  But I think not
incrementing is probably better.

regards, tom lane




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-13 Thread Masahiko Sawada
On Mon, Mar 13, 2023 at 10:28 PM John Naylor
 wrote:
>
> On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada  wrote:
> >
> > On Sun, Mar 12, 2023 at 12:54 AM John Naylor
> >  wrote:
> > >
> > > On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada  
> > > wrote:
>
> > > > * Additional size classes. It's important for an alternative of path
> > > > compression as well as supporting our decoupling approach. Middle
> > > > priority.
> > >
> > > I'm going to push back a bit and claim this doesn't bring much gain, 
> > > while it does have a complexity cost. The node1 from Andres's prototype 
> > > is 32 bytes in size, same as our node3, so it's roughly equivalent as a 
> > > way to ameliorate the lack of path compression.
> >
> > But does it mean that our node1 would help reduce the memory further
> > since since our base node type (i.e. RT_NODE) is smaller than the base
> > node type of Andres's prototype? The result I shared before showed
> > 1.2GB vs. 1.9GB.
>
> The benefit is found in a synthetic benchmark with random integers. I highly 
> doubt that anyone would be willing to force us to keep binary-searching the 
> 1GB array for one more cycle on account of not adding a size class here. I'll 
> repeat myself and say that there are also maintenance costs.
>
> In contrast, I'm fairly certain that our attempts thus far at memory 
> accounting/limiting are not quite up to par, and lacking enough to jeopardize 
> the feature. We're already discussing that, so I'll say no more.

I agree that memory accounting/limiting stuff is the highest priority.
So what kinds of size classes do you think we need? node3, 15, 32, 61
and 256?

>
> > > I say "roughly" because the loop in node3 is probably noticeably slower. 
> > > A new size class will by definition still use that loop.
> >
> > I've evaluated the performance of node1 but the result seems to show
> > the opposite.
>
> As an aside, I meant the loop in our node3 might make your node1 slower than 
> the prototype's node1, which was coded for 1 member only.

Agreed.

>
> > > > * Node shrinking support. Low priority.
> > >
> > > This is an architectural wart that's been neglected since the tid store 
> > > doesn't perform deletion. We'll need it sometime. If we're not going to 
> > > make this work, why ship a deletion API at all?
> > >
> > > I took a look at this a couple weeks ago, and fixing it wouldn't be that 
> > > hard. I even had an idea of how to detect when to shrink size class 
> > > within a node kind, while keeping the header at 5 bytes. I'd be willing 
> > > to put effort into that, but to have a chance of succeeding, I'm 
> > > unwilling to make it more difficult by adding more size classes at this 
> > > point.
> >
> > I think that the deletion (and locking support) doesn't have use cases
> > in the core (i.e. tidstore) but is implemented so that external
> > extensions can use it.
>
> I think these cases are a bit different: Doing anything with a data structure 
> stored in shared memory without a synchronization scheme is completely 
> unthinkable and insane.

Right.

> I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's 
> preferable in v16 than no deletion at all.
>
> Anything we don't implement now is a limit on future use cases, and thus a 
> cause for objection. On the other hand, anything we implement also represents 
> more stuff that will have to be rewritten for high-concurrency.

Okay. Given that adding shrinking support also requires maintenance
costs (and probably new test cases?) and there are no use cases in the
core, I'm not sure it's worth supporting it at this stage. So I prefer
either shipping the deletion API as it is and removing the deletion
API. I think that it's a discussion point that we'd like to hear
feedback from other hackers.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




[PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-03-13 Thread Yurii Rashkovskii
Hi,

I want to suggest a patch against master (it may also be worth backporting
it) that makes it possible to use longer filenames (such as those with
absolute paths) in `BackgroundWorker.bgw_library_name`.

`BackgroundWorker.bgw_library_name` currently allows names up to
BGW_MAXLEN-1, which is generally sufficient if `$libdir` expansion is used.

However, there are use cases where [potentially] longer names are
expected/desired; for example, test benches (where library files may not
[or can not] be copied to Postgres installation) or alternative library
installation methods that do not put them into $libdir.

The patch is backwards-compatible and ensures that bgw_library_name stays
*at least* as long as BGW_MAXLEN. Existing external code that uses
BGW_MAXLEN is a length boundary (for example, in `strncpy`) will continue
to work as expected.

The trade-off of this patch is that the `BackgroundWorker` structure
becomes larger. From my perspective, this is a reasonable cost (less than a
kilobyte of extra space per worker).

The patch builds and `make check` succeeds.

Any feedback is welcome!

-- 
http://omnigres.org
Yurii


v1-0001-Extend-the-length-of-BackgroundWorker.bgw_library_name.patch
Description: Binary data


Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-03-13 Thread Gilles Darold

Le 12/03/2023 à 19:05, Stéphane Tachoires a écrit :


Hi Gilles,

On Ubuntu 22.04.2 all deb's updated, I have an error on a test
I'll try to find where and why, but I think you should know first.

1/1 postgresql:pg_dump / pg_dump/002_pg_dump        ERROR           
 24.40s   exit status 1
―― 
✀ 
 ――

stderr:
#   Failed test 'only_dump_measurement: should dump CREATE TABLE 
test_compression'
#   at 
/media/hddisk/stephane/postgresql/src/postgresql/src/bin/pg_dump/t/002_pg_dump.pl 
 line 4729.
# Review only_dump_measurement results in 
/media/hddisk/stephane/postgresql/build/testrun/pg_dump/002_pg_dump/data/tmp_test_iJxJ

# Looks like you failed 1 test of 10264.


Hi Stephane,


Odd, I do not have this error when running make installcheck, I have the 
same OS version as you.



    +++ tap check in src/bin/pg_dump +++
    t/001_basic.pl  ok
    t/002_pg_dump.pl .. ok
    t/003_pg_dump_with_server.pl .. ok
    t/010_dump_connstr.pl . ok
    All tests successful.
    Files=4, Tests=9531, 11 wallclock secs ( 0.33 usr  0.04 sys + 3.05 
cusr  1.22 csys =  4.64 CPU)

    Result: PASS

Anyway this test must be fixed and this is done in version v5 of the 
patch attached here.



Thanks for the review.

--
Gilles Darold
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 334e4b7fd1..8ce4d5ff41 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -775,6 +775,16 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-table-and-children=pattern
+  
+   
+Same as -T/--exclude-table option but automatically
+exclude partitions or child tables of the specified tables if any.
+   
+  
+ 
+
  
   --exclude-table-data=pattern
   
@@ -793,6 +803,17 @@ PostgreSQL documentation
   
  
 
+ 
+  --exclude-table-data-and-children=pattern
+  
+   
+Same as --exclude-table-data  but automatically
+exclude partitions or child tables of the specified tables if any.
+   
+  
+ 
+
+
  
   --extra-float-digits=ndigits
   
@@ -1158,6 +1179,16 @@ PostgreSQL documentation
   
  
 
+ 
+  --table-and-children=pattern
+  
+   
+Same as -t/--table option but automatically
+include partitions or child tables of the specified tables if any.
+   
+  
+ 
+
  
   --use-set-session-authorization
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4217908f84..09d37991d6 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -130,6 +130,10 @@ static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
 static SimpleStringList extension_include_patterns = {NULL, NULL};
 static SimpleOidList extension_include_oids = {NULL, NULL};
 
+static SimpleStringList table_include_patterns_and_children = {NULL, NULL};
+static SimpleStringList table_exclude_patterns_and_children = {NULL, NULL};
+static SimpleStringList tabledata_exclude_patterns_and_children = {NULL, NULL};
+
 static const CatalogId nilCatalogId = {0, 0};
 
 /* override for standard extra_float_digits setting */
@@ -180,7 +184,8 @@ static void expand_foreign_server_name_patterns(Archive *fout,
 static void expand_table_name_patterns(Archive *fout,
 	   SimpleStringList *patterns,
 	   SimpleOidList *oids,
-	   bool strict_names);
+	   bool strict_names,
+	   bool with_child_tables);
 static void prohibit_crossdb_refs(PGconn *conn, const char *dbname,
   const char *pattern);
 
@@ -421,6 +426,9 @@ main(int argc, char **argv)
 		{"on-conflict-do-nothing", no_argument, &dopt.do_nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
 		{"include-foreign-data", required_argument, NULL, 11},
+		{"table-and-children", required_argument, NULL, 12},
+		{"exclude-table-and-children", required_argument, NULL, 13},
+		{"exclude-table-data-and-children", required_argument, NULL, 14},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -631,6 +639,19 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:			/* include table(s) with children */
+simple_string_list_append(&table_include_patterns_and_children, optarg);
+dopt.include_everything = false;
+break;
+
+			case 13:			/* exclude table(s) with children */
+simple_string_list_append(&table_exclude_patterns_and_children, optarg);
+break;
+
+			case 14:/* exclude table(s) data */
+simple_string_list_append(&tabledata_exclude_patterns_and_children, optarg);
+break;
+
 			default:
 /* getopt_long already emitted a complaint */
 pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -814,17 +835,34 @@ main(int argc

Re: ICU locale validation / canonicalization

2023-03-13 Thread Jeff Davis
On Mon, 2023-03-13 at 08:25 +0100, Peter Eisentraut wrote:
> For clarification, I wasn't complaining about the notice, but about
> the 
> automatic conversion from old-style ICU locale ID to language tag.

Canonicalization means that we pick one format, and automatically
convert to it, right?

> What we had discussed a while ago in one of these threads is that ICU
> before version 54 do not support keyword lists, and we have custom
> code 
> to do that parsing ourselves, but we don't have code to do the same
> for 
> language tags.  Therefore, if I understand this right, if we 
> automatically convert ICU locale IDs to language tags, as shown
> above, 
> then we break support for such locales in those older ICU versions.

Right. In versions 53 and earlier, and during pg_upgrade, we would just
preserve the locale string as entered.

Alternatively, we could canonicalize to the ICU format locale IDs. Or
add something to parse out the attributes from a language tag.

Regards,
Jeff Davis





Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32

2023-03-13 Thread Juan José Santamaría Flecha
On Fri, Mar 10, 2023 at 2:37 AM Michael Paquier  wrote:

> On Fri, Mar 10, 2023 at 12:12:37AM +0100, Juan José Santamaría Flecha
> wrote:
> > I've broken the patch in two:
> > 1. fixes the detection of unseekable files in checkSeek(), using logic
> that
> > hopefully is backpatchable,
> > 2. the improvements on file type detection for stat() proposed by the OP.
>
> I am OK with 0002, so I'll try to get this part backpatched down to
> where the implementation of stat() has been added.  I am not
> completely sure that 0001 is the right way forward, though,
> particularly with the long-term picture..  In the backend, we have one
> caller of fseeko() as of read_binary_file(), so we would never pass
> down a pipe to that.  However, there could be a risk of some silent
> breakages on Windows if some new code relies on that?
>
> There is a total of 11 callers of fseeko() in pg_dump, so rather than
> relying on checkSeek() to see if it actually works, I'd like to think
> that we should have a central policy to make this code more
> bullet-proof in the future.
>

WFM, making fseek() behaviour more resilient seems like a good improvement
overall.

Should we open a new thread to make that part more visible?

Regards,

Juan José Santamaría Flecha


Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-03-13 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 07:57:47AM -0700, Yurii Rashkovskii wrote:
> However, there are use cases where [potentially] longer names are
> expected/desired; for example, test benches (where library files may not
> [or can not] be copied to Postgres installation) or alternative library
> installation methods that do not put them into $libdir.
> 
> The patch is backwards-compatible and ensures that bgw_library_name stays
> *at least* as long as BGW_MAXLEN. Existing external code that uses
> BGW_MAXLEN is a length boundary (for example, in `strncpy`) will continue
> to work as expected.

I see that BGW_MAXLEN was originally set to 64 in 2013 (7f7485a) [0], but
was increased to 96 in 2018 (3a4b891) [1].  It seems generally reasonable
to me to increase the length of bgw_library_name further for the use-case
you describe, but I wonder if it'd be better to simply increase BGW_MAXLEN
again.  However, IIUC bgw_library_name is the only field that is likely to
be used for absolute paths, so only increasing that one to MAXPGPATH makes
sense.

[0] 
https://postgr.es/m/CA%2BTgmoYtQQ-JqAJPxZg3Mjg7EqugzqQ%2BZBrpnXo95chWMCZsXw%40mail.gmail.com
[1] https://postgr.es/m/304a21ab-a9d6-264a-f688-912869c0d7c6%402ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Lock mode in ExecMergeMatched()

2023-03-13 Thread Alvaro Herrera
On 2023-Mar-13, Dean Rasheed wrote:

> I'm wondering now if it really matters what lock mode we use here. If
> the point of calling table_tuple_lock() after a concurrent update is
> detected is to prevent more concurrent updates, so that the retry is
> guaranteed to succeed, then wouldn't even LockTupleNoKeyExclusive be
> sufficient in all cases? After all, that does block concurrent updates
> and deletes.

The difference in lock mode should be visible relative to concurrent
transactions that try to SELECT FOR KEY SHARE the affected row.  If you
are updating a row but not changing the key-columns, then a KEY SHARE
against the same tuple should work concurrently without blocking.  If
you *are* changing the key columns, then such a lock should be made to
wait.

DELETE should be exactly equivalent to an update that changes any
columns in the "key".  After all, the point is that the previous key (as
referenced via a FK from another table) is now gone, which happens in
both these operations, but does not happen when an update only touches
other columns.

Two UPDATEs of the same row should always block each other.


Note that the code to determine which columns are part of the key is not
very careful: IIRC any column part of a unique index is considered part
of the key.  I don't think this has any implications for the discussion
here, but I thought I'd point it out just in case.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-03-13 Thread Yurii Rashkovskii
Nathan,

Thank you for your review.

Indeed, my motivation for doing the change the way I did it was that only
bgw_library_name is expected to be longer, whereas it is much less of a
concern for other fields. If we have increased BGW_MAXLEN, it would have
increased the size of BackgroundWorker for little to no benefit.

On Mon, Mar 13, 2023 at 10:35 AM Nathan Bossart 
wrote:

> On Mon, Mar 13, 2023 at 07:57:47AM -0700, Yurii Rashkovskii wrote:
> > However, there are use cases where [potentially] longer names are
> > expected/desired; for example, test benches (where library files may not
> > [or can not] be copied to Postgres installation) or alternative library
> > installation methods that do not put them into $libdir.
> >
> > The patch is backwards-compatible and ensures that bgw_library_name stays
> > *at least* as long as BGW_MAXLEN. Existing external code that uses
> > BGW_MAXLEN is a length boundary (for example, in `strncpy`) will continue
> > to work as expected.
>
> I see that BGW_MAXLEN was originally set to 64 in 2013 (7f7485a) [0], but
> was increased to 96 in 2018 (3a4b891) [1].  It seems generally reasonable
> to me to increase the length of bgw_library_name further for the use-case
> you describe, but I wonder if it'd be better to simply increase BGW_MAXLEN
> again.  However, IIUC bgw_library_name is the only field that is likely to
> be used for absolute paths, so only increasing that one to MAXPGPATH makes
> sense.
>
> [0]
> https://postgr.es/m/CA%2BTgmoYtQQ-JqAJPxZg3Mjg7EqugzqQ%2BZBrpnXo95chWMCZsXw%40mail.gmail.com
> [1]
> https://postgr.es/m/304a21ab-a9d6-264a-f688-912869c0d7c6%402ndquadrant.com
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>


--
http://omnigres.org
Yurii


Re: Bug in jsonb_in function (14 & 15 version are affected)

2023-03-13 Thread Tom Lane
Nikolay Shaplov  writes:
> I found a bug in jsonb_in function (it converts json from sting representation
>  into jsonb internal representation).

Yeah.  Looks like json_lex_string is failing to honor the invariant
that it needs to set token_terminator ... although the documentation
of the function certainly isn't helping.  I think we need the attached.

A nice side benefit is that the error context reports get a lot more
useful --- somebody should have inquired before as to why they were
so bogus.

regards, tom lane

diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index bdfc48cdf5..7a36f74dad 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -675,6 +675,7 @@ report_json_context(JsonLexContext *lex)
 	line_start = lex->line_start;
 	context_start = line_start;
 	context_end = lex->token_terminator;
+	Assert(context_end >= context_start);
 
 	/* Advance until we are close enough to context_end */
 	while (context_end - context_start >= 50)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index e4ff3f3602..4e2a664603 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -697,6 +697,14 @@ json_lex(JsonLexContext *lex)
 
 /*
  * The next token in the input stream is known to be a string; lex it.
+ *
+ * If lex->strval isn't NULL, fill it with the decoded string.
+ * Set lex->token_terminator to the end of the decoded input, and in
+ * success cases, transfer its previous value to lex->prev_token_terminator.
+ * Return JSON_SUCCESS or an error code.
+ *
+ * Note: be careful that all error exits advance lex->token_terminator
+ * to the point after the character we detected the error on.
  */
 static inline JsonParseErrorType
 json_lex_string(JsonLexContext *lex)
@@ -705,6 +713,19 @@ json_lex_string(JsonLexContext *lex)
 	char	   *const end = lex->input + lex->input_length;
 	int			hi_surrogate = -1;
 
+	/* Convenience macros */
+#define FAIL_AT_CHAR_START(code) \
+	do { \
+		lex->token_terminator = s; \
+		return code; \
+	} while (0)
+#define FAIL_AT_CHAR_END(code) \
+	do { \
+		lex->token_terminator = \
+			s + pg_encoding_mblen_bounded(lex->input_encoding, s); \
+		return code; \
+	} while (0)
+
 	if (lex->strval != NULL)
 		resetStringInfo(lex->strval);
 
@@ -715,10 +736,7 @@ json_lex_string(JsonLexContext *lex)
 		s++;
 		/* Premature end of the string. */
 		if (s >= end)
-		{
-			lex->token_terminator = s;
-			return JSON_INVALID_TOKEN;
-		}
+			FAIL_AT_CHAR_START(JSON_INVALID_TOKEN);
 		else if (*s == '"')
 			break;
 		else if (*s == '\\')
@@ -726,10 +744,7 @@ json_lex_string(JsonLexContext *lex)
 			/* OK, we have an escape character. */
 			s++;
 			if (s >= end)
-			{
-lex->token_terminator = s;
-return JSON_INVALID_TOKEN;
-			}
+FAIL_AT_CHAR_START(JSON_INVALID_TOKEN);
 			else if (*s == 'u')
 			{
 int			i;
@@ -739,10 +754,7 @@ json_lex_string(JsonLexContext *lex)
 {
 	s++;
 	if (s >= end)
-	{
-		lex->token_terminator = s;
-		return JSON_INVALID_TOKEN;
-	}
+		FAIL_AT_CHAR_START(JSON_INVALID_TOKEN);
 	else if (*s >= '0' && *s <= '9')
 		ch = (ch * 16) + (*s - '0');
 	else if (*s >= 'a' && *s <= 'f')
@@ -750,10 +762,7 @@ json_lex_string(JsonLexContext *lex)
 	else if (*s >= 'A' && *s <= 'F')
 		ch = (ch * 16) + (*s - 'A') + 10;
 	else
-	{
-		lex->token_terminator = s + pg_encoding_mblen_bounded(lex->input_encoding, s);
-		return JSON_UNICODE_ESCAPE_FORMAT;
-	}
+		FAIL_AT_CHAR_END(JSON_UNICODE_ESCAPE_FORMAT);
 }
 if (lex->strval != NULL)
 {
@@ -763,20 +772,20 @@ json_lex_string(JsonLexContext *lex)
 	if (is_utf16_surrogate_first(ch))
 	{
 		if (hi_surrogate != -1)
-			return JSON_UNICODE_HIGH_SURROGATE;
+			FAIL_AT_CHAR_END(JSON_UNICODE_HIGH_SURROGATE);
 		hi_surrogate = ch;
 		continue;
 	}
 	else if (is_utf16_surrogate_second(ch))
 	{
 		if (hi_surrogate == -1)
-			return JSON_UNICODE_LOW_SURROGATE;
+			FAIL_AT_CHAR_END(JSON_UNICODE_LOW_SURROGATE);
 		ch = surrogate_pair_to_codepoint(hi_surrogate, ch);
 		hi_surrogate = -1;
 	}
 
 	if (hi_surrogate != -1)
-		return JSON_UNICODE_LOW_SURROGATE;
+		FAIL_AT_CHAR_END(JSON_UNICODE_LOW_SURROGATE);
 
 	/*
 	 * Reject invalid cases.  We can't have a value above
@@ -786,7 +795,7 @@ json_lex_string(JsonLexContext *lex)
 	if (ch == 0)
 	{
 		/* We can't allow this, since our TEXT type doesn't */
-		return JSON_UNICODE_CODE_POINT_ZERO;
+		FAIL_AT_CHAR_END(JSON_UNICODE_CODE_POINT_ZERO);
 	}
 
 	/*
@@ -800,7 +809,7 @@ json_lex_string(JsonLexContext *lex)
 		char		cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1];
 
 		if (!pg_unicode_to_server_noerror(ch, (unsigned char *) cbuf))
-			return JSON_UNICODE_UNTRANSLATABLE;
+			FAIL_AT_CHAR_END(JSON_UNICODE_UNTRANSLATABLE);
 		appendStringInfoString(lex->strval, cbuf);
 	}
 #else

Re: meson: Non-feature feature options

2023-03-13 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 07:27:18AM +0100, Peter Eisentraut wrote:
> I have committed it like this.

I noticed that after 6a30027, if you don't have the OpenSSL headers
installed, 'meson setup' will fail:

meson.build:1195:4: ERROR: C header 'openssl/ssl.h' not found

Shouldn't "auto" cause Postgres to be built without OpenSSL if the required
headers are not present?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Lock mode in ExecMergeMatched()

2023-03-13 Thread Alvaro Herrera
On 2023-Mar-11, Alexander Korotkov wrote:

> I wonder why does ExecMergeMatched() determine the lock mode using
> ExecUpdateLockMode().  Why don't we use lock mode set by
> table_tuple_update() like ExecUpdate() does?  I skim through the
> MERGE-related threads, but didn't find an answer.
> 
> I also noticed that we use ExecUpdateLockMode() even for CMD_DELETE.
> That ends up by usage of LockTupleNoKeyExclusive for CMD_DELETE, which
> seems plain wrong for me.

I agree that in the case of CMD_DELETE it should not run
ExecUpdateLockMode() --- that part seems like a bug.

As I recall, ExecUpdateLockMode is newer code that should do the same as
table_tuple_update does to determine the lock mode ... and looking at
the code, I see that both do a bms_overlap operation on "columns in the
key" vs. "columns modified", so I'm not sure why you say they would
behave differently.

Thinking about Dean's comment downthread, where an UPDATE could be
turned into a DELETE, I wonder if trying to be selective would lead us
to deadlock, in case a concurrent SELECT FOR KEY SHARE is able to
lock the tuple while we're doing UPDATE, and then lock out the MERGE
when the DELETE is retried.

If this is indeed a problem, then I can think of two ways out:

1. if MERGE contains any DELETE, then always use LockTupleExclusive:
otherwise, use LockTupleNoKeyExclusive.  This is best for concurrency
when MERGE does no delete and the key columns are not modified.

2. always use LockTupleExclusive.  This is easier, but does not allow
MERGE to run concurrently with SELECT FOR KEY SHARE on the same tuples.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Reducing connection overhead in pg_upgrade compat check phase

2023-03-13 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 03:10:58PM +0100, Daniel Gustafsson wrote:
> The attached v3 is a rebase to handle conflicts and with the above comments
> adressed.

Thanks for the new version of the patch.

I noticed that git-am complained when I applied the patch:

Applying: pg_upgrade: run all data type checks per connection
.git/rebase-apply/patch:1023: new blank line at EOF.
+
warning: 1 line adds whitespace errors.

+   for (int rowno = 0; rowno < ntups; rowno++)
+   {
+   found = true;

It looks like "found" is set unconditionally a few lines above, so I think
this is redundant.

Also, I think it would be worth breaking check_for_data_types_usage() into
a few separate functions (or doing some other similar refactoring) to
improve readability.  At this point, the function is quite lengthy, and I
count 6 levels of indentation at some lines.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-13 Thread Regina Obe
> > I wonder if a solution to this problem might be to provide some kind
> > of a version map file. Let's suppose that the map file is a bunch of
> > lines of the form X Y Z, where X, Y, and Z are version numbers. The
> > semantics could be: we (the extension authors) promise that if you
> > want to upgrade from X to Z, it suffices to run the script that knows
> > how to upgrade from Y to Z.
> > This would address Tom's concern, because if you write a master
> > upgrade script, you have to explicitly declare the versions to which
> > it applies.
> 
> This would just move the problem from having 1968 files to having to write
> 1968 lines in control files, 

1968 lines in one control file, is still much nicer than 1968 files in my
book.
>From a packaging standpoint also much cleaner, as that single file gets
replaced with each upgrade and no need to uninstall more junk from prior
install.

> We maintain multiple stable branches (5, at the moment: 2.5, 3.0, 3.1,
3.2,
> 3.3) and would like to allow anyone running an old patched version to
still
> upgrade to a newer version released earlier.
> 
> --strk;
> 
>   Libre GIS consultant/developer
>   https://strk.kbt.io/services.html

That said, I really would like a wildcard option for issues strk mentioned.

I still see the main use-case as for those that micro version and for this
use case, they would need a way, not necessarily to have a single upgrade
script, but a script for each minor.

So something like 

3.2.%--3.4.0 = 3.2--3.4.0

In many cases, they don't even need anything done in an upgrade step, aside
from moving the dial button on extension up a notch to coincide with the lib
version.

In our case, yes ours would be below because we already block downgrades
internally in our scripts

%--3.4.0 = ANY--3.4.0

Or we could move to a 

2.%--3.4.0 = 2--3.4.0
3.%--3.4.0 = 3--3.4.0

Thanks,
Regina








Re: meson: Non-feature feature options

2023-03-13 Thread Nazir Bilal Yavuz
Hi,

On Mon, 13 Mar 2023 at 21:04, Nathan Bossart  wrote:
>
> On Mon, Mar 13, 2023 at 07:27:18AM +0100, Peter Eisentraut wrote:
> > I have committed it like this.
>
> I noticed that after 6a30027, if you don't have the OpenSSL headers
> installed, 'meson setup' will fail:
>
> meson.build:1195:4: ERROR: C header 'openssl/ssl.h' not found
>
> Shouldn't "auto" cause Postgres to be built without OpenSSL if the required
> headers are not present?

Yes, I tested again and it is working as expected on my end. It
shouldn't fail like that unless the 'ssl' option is set to 'openssl'.
Is it possible that it has been set to 'openssl' without you noticing?

Regards,
Nazir Bilal Yavuz
Microsoft




Re: meson: Non-feature feature options

2023-03-13 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 09:57:22PM +0300, Nazir Bilal Yavuz wrote:
> On Mon, 13 Mar 2023 at 21:04, Nathan Bossart  wrote:
>> Shouldn't "auto" cause Postgres to be built without OpenSSL if the required
>> headers are not present?
> 
> Yes, I tested again and it is working as expected on my end. It
> shouldn't fail like that unless the 'ssl' option is set to 'openssl'.
> Is it possible that it has been set to 'openssl' without you noticing?

I do not believe so.  For the test in question, here are the build options
reported in meson-log.txt:

Build Options: -Dlz4=enabled -Dplperl=enabled -Dplpython=enabled 
-Dpltcl=enabled -Dlibxml=enabled -Duuid=ossp -Dlibxslt=enabled -Ddebug=true 
-Dcassert=true -Dtap_tests=enabled -Dwerror=True

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Let libpq reject unexpected authentication requests

2023-03-13 Thread Jacob Champion
On Fri, Mar 10, 2023 at 3:16 PM Jacob Champion  wrote:
> > Could you send a new patch with all these adjustments?  That would
> > help a lot.
>
> Will do!

Here's a v16:
- updated 0001 patch message
- all test names should have commas rather than colons now
- new test for an empty require_auth
- new SSPI suite (note that it doesn't run by default on Cirrus, due
to the use of PG_TEST_USE_UNIX_SOCKETS)
- fixed errant comma at EOL

Thanks,
--Jacob
From 8cc020598c7c939e3a45088f18ca04ea801fc87e Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 18 Oct 2022 16:55:36 -0700
Subject: [PATCH v16 3/3] require_auth: decouple SASL and SCRAM

Rather than assume that an AUTH_REQ_SASL* code refers to SCRAM-SHA-256,
future-proof by separating the single allowlist into a list of allowed
authentication request codes and a list of allowed SASL mechanisms.

The require_auth check is now separated into two tiers. The
AUTH_REQ_SASL* codes are always allowed. If the server sends one,
responsibility for the check then falls to pg_SASL_init(), which
compares the selected mechanism against the list of allowed mechanisms.
(Other SCRAM code is already responsible for rejecting unexpected or
out-of-order AUTH_REQ_SASL_* codes, so that's not explicitly handled
with this addition.)

Since there's only one recognized SASL mechanism, conn->sasl_mechs
currently only points at static hardcoded lists. Whenever a second
mechanism is added, the list will need to be managed dynamically.
---
 src/interfaces/libpq/fe-auth.c| 34 +++
 src/interfaces/libpq/fe-connect.c | 41 +++
 src/interfaces/libpq/libpq-int.h  |  3 +-
 src/test/authentication/t/001_password.pl | 14 +---
 src/test/ssl/t/002_scram.pl   |  6 
 5 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 53c7d30eff..7927aebed8 100644
--- a/src/interfaces/libpq/fe-auth.c
+++ b/src/interfaces/libpq/fe-auth.c
@@ -522,6 +522,40 @@ pg_SASL_init(PGconn *conn, int payloadlen)
 		goto error;
 	}
 
+	/*
+	 * Before going ahead with any SASL exchange, ensure that the user has
+	 * allowed (or, alternatively, has not forbidden) this particular mechanism.
+	 *
+	 * In a hypothetical future where a server responds with multiple SASL
+	 * mechanism families, we would need to instead consult this list up above,
+	 * during mechanism negotiation. We don't live in that world yet. The server
+	 * presents one auth method and we decide whether that's acceptable or not.
+	 */
+	if (conn->sasl_mechs)
+	{
+		const char **mech;
+		bool		found = false;
+
+		Assert(conn->require_auth);
+
+		for (mech = conn->sasl_mechs; *mech; mech++)
+		{
+			if (strcmp(*mech, selected_mechanism) == 0)
+			{
+found = true;
+break;
+			}
+		}
+
+		if ((conn->sasl_mechs_denied && found)
+			|| (!conn->sasl_mechs_denied && !found))
+		{
+			libpq_append_conn_error(conn, "auth method \"%s\" requirement failed: server requested unacceptable SASL mechanism \"%s\"",
+	conn->require_auth, selected_mechanism);
+			goto error;
+		}
+	}
+
 	if (conn->channel_binding[0] == 'r' &&	/* require */
 		strcmp(selected_mechanism, SCRAM_SHA_256_PLUS_NAME) != 0)
 	{
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index cbadb3f6af..a048793b46 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1258,12 +1258,25 @@ connectOptions2(PGconn *conn)
 	more;
 		bool		negated = false;
 
+		static const uint32 default_methods = (
+			1 << AUTH_REQ_SASL
+			| 1 << AUTH_REQ_SASL_CONT
+			| 1 << AUTH_REQ_SASL_FIN
+		);
+		static const char *no_mechs[] = { NULL };
+
 		/*
-		 * By default, start from an empty set of allowed options and add to
+		 * By default, start from a minimum set of allowed options and add to
 		 * it.
+		 *
+		 * NB: The SASL method codes are always "allowed" here. If the server
+		 * requests SASL auth, pg_SASL_init() will enforce adherence to the
+		 * sasl_mechs list, which by default is empty.
 		 */
 		conn->auth_required = true;
-		conn->allowed_auth_methods = 0;
+		conn->allowed_auth_methods = default_methods;
+		conn->sasl_mechs = no_mechs;
+		conn->sasl_mechs_denied = false;
 
 		for (first = true, more = true; more; first = false)
 		{
@@ -1290,6 +1303,9 @@ connectOptions2(PGconn *conn)
 	 */
 	conn->auth_required = false;
 	conn->allowed_auth_methods = -1;
+
+	/* conn->sasl_mechs is now a list of denied mechanisms. */
+	conn->sasl_mechs_denied = true;
 }
 else if (!negated)
 {
@@ -1334,10 +1350,23 @@ connectOptions2(PGconn *conn)
 			}
 			else if (strcmp(method, "scram-sha-256") == 0)
 			{
-/* This currently assumes that SCRAM is the only SASL method. */
-bits = (1 << AUTH_REQ_SASL);
-bits |= (1 << AUTH_REQ_SASL_CONT);
-bits |= (1 << AUTH_REQ_SASL_FIN);
+static const char *scram_mechs[] = {
+	SCRAM_SHA_256_NAME,
+			

Re: Improve logging when using Huge Pages

2023-03-13 Thread Justin Pryzby
On Thu, Mar 09, 2023 at 03:02:29PM -0500, Stephen Frost wrote:
> * Justin Pryzby (pry...@telsasoft.com) wrote:
> > On Thu, Mar 09, 2023 at 09:34:10AM -0500, Stephen Frost wrote:
> > > * Nathan Bossart (nathandboss...@gmail.com) wrote:
> > > > On Wed, Feb 15, 2023 at 10:13:17AM -0800, Nathan Bossart wrote:
> > > > > On Tue, Feb 14, 2023 at 07:32:56PM -0600, Justin Pryzby wrote:
> > > > >> On Mon, Feb 13, 2023 at 08:18:52PM -0800, Nathan Bossart wrote:
> > > > >>> I'm curious why you chose to make this a string instead of an enum. 
> > > > >>>  There
> > > > >>> might be little practical difference, but since there are only three
> > > > >>> possible values, I wonder if it'd be better form to make it an enum.
> > > > >> 
> > > > >> It takes more code to write as an enum - see 002.txt.  I'm not 
> > > > >> convinced
> > > > >> this is better.
> > > > >> 
> > > > >> But your comment made me fix its , and reconsider the strings,
> > > > >> which I changed to active={unknown/true/false} rather than 
> > > > >> {unk/on/off}.
> > > > >> It could also be active={unknown/yes/no}...
> > > > > 
> > > > > I think unknown/true/false is fine.  I'm okay with using a string if 
> > > > > no one
> > > > > else thinks it should be an enum (or a bool).
> > > > 
> > > > There's been no response for this, so I guess we can proceed with a 
> > > > string
> > > > GUC.
> > > 
> > > Just happened to see this and I'm not really a fan of this being a
> > > string when it's pretty clear that's not what it actually is.
> > 
> > I don't understand what you mean by that.
> > Why do you say it isn't a string ?
> 
> Because it's clearly a yes/no, either the server is currently running
> with huge pages, or it isn't.  That's the definition of a boolean.

I originally implemented it as a boolean, and I changed it in response
to the feedback that postgres -C huge_pages_active should return
"unknown".

> > Is there an agreement to use a function, instead ?

Alvaro was -1 on using a function
Andres is +0 (?)
Nathan is +1
Stephen is +1

I'm -0.5, but I reimplemented it as a function.  I hope that helps it to
progress.  Please include a suggestion if there's better place for the
function or global var.

-- 
Justin
>From cd171da2150e1ee8cfc6ca4bdee0a591df6f566b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 23 Jan 2023 18:33:51 -0600
Subject: [PATCH v5] add pg_huge_pages_active()

This is useful to show the current state of huge pages when
huge_pages=try.  The effective status is not otherwise visible without
OS level tools like gdb or /proc/N/smaps.

https://www.postgresql.org/message-id/flat/tu4pr8401mb1152ebb0d271f827e2e37a01ee...@tu4pr8401mb1152.namprd84.prod.outlook.com
---
 doc/src/sgml/config.sgml|  3 ++-
 doc/src/sgml/func.sgml  | 14 ++
 src/backend/port/sysv_shmem.c   |  2 ++
 src/backend/port/win32_shmem.c  |  2 ++
 src/backend/utils/adt/misc.c| 11 +++
 src/include/catalog/pg_proc.dat |  5 +
 src/include/miscadmin.h |  3 +++
 7 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6d..d66f73a494a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1700,23 +1700,24 @@ include_dir 'conf.d'
   
   
   

 Controls whether huge pages are requested for the main shared memory
 area. Valid values are try (the default),
 on, and off.  With
 huge_pages set to try, the
 server will try to request huge pages, but fall back to the default if
 that fails. With on, failure to request huge pages
 will prevent the server from starting up. With off,
-huge pages will not be requested.
+huge pages will not be requested.  The actual state of huge pages is
+indicated by pg_huge_pages_active().

 

 At present, this setting is supported only on Linux and Windows. The
 setting is ignored on other systems when set to
 try.  On Linux, it is only supported when
 shared_memory_type is set to mmap
 (the default).

 

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a5bf2f01b57..4e99d5aff5c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -22516,22 +22516,36 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 To request information about a specific log file format, supply
 either csvlog, jsonlog or
 stderr as the
 value of the optional parameter. The result is NULL
 if the log format requested is not configured in
 .
 The result reflects the contents of
 the current_logfiles file.

   
 
+  
+   
+
+ pg_huge_pages_active
+
+pg_huge_pages_active ()
+bool
+   
+   
+Reports whether huge pages are in use by the current instance.
+See  for more information.

Re: meson: Non-feature feature options

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 11:04:32 -0700, Nathan Bossart wrote:
> On Mon, Mar 13, 2023 at 07:27:18AM +0100, Peter Eisentraut wrote:
> > I have committed it like this.
> 
> I noticed that after 6a30027, if you don't have the OpenSSL headers
> installed, 'meson setup' will fail:
> 
>   meson.build:1195:4: ERROR: C header 'openssl/ssl.h' not found
> 
> Shouldn't "auto" cause Postgres to be built without OpenSSL if the required
> headers are not present?

Yea. I found another thing: When dependency() found something, but the headers
weren't present, ssl_int wouldn't exist.

Maybe something like the attached?

Greetings,

Andres Freund
>From b8bd0200667bac16674e40e769a9fb4bc4f54306 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 13 Mar 2023 13:11:37 -0700
Subject: [PATCH v1] meson: fix openssl detection issues in 6a30027

Reported-by: Nathan Bossart 
Discussion: https://postgr.es/m/20230313180432.GA246741@nathanxps13
---
 meson.build | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index 8208815c96c..2ebdf914c1b 100644
--- a/meson.build
+++ b/meson.build
@@ -1189,23 +1189,29 @@ if sslopt in ['auto', 'openssl']
 
   # via pkg-config et al
   ssl = dependency('openssl', required: false)
+  # only meson >= 0.57 supports declare_dependency() in cc.has_function(), so
+  # we pass cc.find_library() results if necessary
+  ssl_int = []
 
   # via library + headers
   if not ssl.found()
 ssl_lib = cc.find_library('ssl',
   dirs: test_lib_d,
   header_include_directories: postgres_inc,
-  has_headers: ['openssl/ssl.h', 'openssl/err.h'])
+  has_headers: ['openssl/ssl.h', 'openssl/err.h'],
+  required: openssl_required)
 crypto_lib = cc.find_library('crypto',
   dirs: test_lib_d,
-  header_include_directories: postgres_inc)
-ssl_int = [ssl_lib, crypto_lib]
-
-ssl = declare_dependency(dependencies: ssl_int,
- include_directories: postgres_inc)
+  required: openssl_required)
+if ssl_lib.found() and crypto_lib.found()
+  ssl_int = [ssl_lib, crypto_lib]
+  ssl = declare_dependency(dependencies: ssl_int, include_directories: postgres_inc)
+endif
   elif cc.has_header('openssl/ssl.h', args: test_c_args, dependencies: ssl, required: openssl_required) and \
cc.has_header('openssl/err.h', args: test_c_args, dependencies: ssl, required: openssl_required)
 ssl_int = [ssl]
+  else
+ssl = not_found_dep
   endif
 
   if ssl.found()
-- 
2.38.0



Re: meson: Non-feature feature options

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 21:57:22 +0300, Nazir Bilal Yavuz wrote:
> On Mon, 13 Mar 2023 at 21:04, Nathan Bossart  wrote:
> >
> > On Mon, Mar 13, 2023 at 07:27:18AM +0100, Peter Eisentraut wrote:
> > > I have committed it like this.
> >
> > I noticed that after 6a30027, if you don't have the OpenSSL headers
> > installed, 'meson setup' will fail:
> >
> > meson.build:1195:4: ERROR: C header 'openssl/ssl.h' not found
> >
> > Shouldn't "auto" cause Postgres to be built without OpenSSL if the required
> > headers are not present?
> 
> Yes, I tested again and it is working as expected on my end. It
> shouldn't fail like that unless the 'ssl' option is set to 'openssl'.
> Is it possible that it has been set to 'openssl' without you noticing?

It worked for the dependency() path, but not the cc.find_library() path. See
the patch I just sent.

Greetings,

Andres Freund




Re: Add LZ4 compression in pg_dump

2023-03-13 Thread Tomas Vondra


On 3/11/23 11:50, gkokola...@pm.me wrote:
> --- Original Message ---
> On Saturday, March 11th, 2023 at 7:00 AM, Alexander Lakhin 
>  wrote:
> 
>> Hello,
>> 23.02.2023 23:24, Tomas Vondra wrote:
>>
>>> On 2/23/23 16:26, Tomas Vondra wrote:
>>>
 Thanks for v30 with the updated commit messages. I've pushed 0001 after
 fixing a comment typo and removing (I think) an unnecessary change in an
 error message.

 I'll give the buildfarm a bit of time before pushing 0002 and 0003.
>>>
>>> I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
>>> and marked the CF entry as committed. Thanks for the patch!
>>>
>>> I wonder how difficult would it be to add the zstd compression, so that
>>> we don't have the annoying "unsupported" cases.
>>
>>
>> With the patch 0003 committed, a single warning -Wtype-limits appeared in the
>> master branch:
>> $ CPPFLAGS="-Og -Wtype-limits" ./configure --with-lz4 -q && make -s -j8
>> compress_lz4.c: In function ‘LZ4File_gets’:
>> compress_lz4.c:492:19: warning: comparison of unsigned expression in ‘< 0’ is
>> always false [-Wtype-limits]
>> 492 | if (dsize < 0)
>> |
> 
> Thank you Alexander. Please find attached an attempt to address it.
> 
>> (I wonder, is it accidental that there no other places that triggers
>> the warning, or some buildfarm animals had this check enabled before?)
> 
> I can not answer about the buildfarms. Do you think that adding an explicit
> check for this warning in meson would help? I am a bit uncertain as I think
> that type-limits are included in extra.
> 
> @@ -1748,6 +1748,7 @@ common_warning_flags = [
>'-Wshadow=compatible-local',
># This was included in -Wall/-Wformat in older GCC versions
>'-Wformat-security',
> +  '-Wtype-limits',
>  ]
> 
>>
>> It is not a false positive as can be proved by the 002_pg_dump.pl modified as
>> follows:
>> - program => $ENV{'LZ4'},
>>
>> + program => 'mv',
>>
>> args => [
>>
>> - '-z', '-f', '--rm',
>> "$tempdir/compression_lz4_dir/blobs.toc",
>> "$tempdir/compression_lz4_dir/blobs.toc.lz4",
>> ],
>> },
> 
> Correct, it is not a false positive. The existing testing framework provides
> limited support for exercising error branches. Especially so when those are
> dependent on generated output. 
> 
>> A diagnostic logging added shows:
>> LZ4File_gets() after LZ4File_read_internal; dsize: 18446744073709551615
>>
>> and pg_restore fails with:
>> error: invalid line in large object TOC file
>> ".../src/bin/pg_dump/tmp_check/tmp_test_22ri/compression_lz4_dir/blobs.toc": 
>> ""
> 
> It is a good thing that the restore fails with bad input. Yet it should
> have failed earlier. The attached makes certain it does fail earlier. 
> 

Thanks for the patch.

I did look if there are other places that might have the same issue, and
I think there are - see attached 0002. For example LZ4File_write is
declared to return size_t, but then it also does

if (LZ4F_isError(status))
{
fs->errcode = status;
return -1;
}

That won't work :-(

And these issues may not be restricted to lz4 code - Gzip_write is
declared to return size_t, but it does

return gzwrite(gzfp, ptr, size);

and gzwrite returns int. Although, maybe that's correct, because
gzwrite() is "0 on error" so maybe this is fine ...

However, Gzip_read assigns gzread() to size_t, and that does not seem
great. It probably will still trigger the following pg_fatal() because
it'd be very lucky to match the expected 'size', but it's confusing.


I wonder whether CompressorState should use int or size_t for the
read_func/write_func callbacks. I guess no option is perfect, i.e. no
data type will work for all compression libraries we might use (lz4 uses
int while lz4f uses size_t, to there's that).

It's a bit weird the "open" functions return int and the read/write
size_t. Maybe we should stick to int, which is what the old functions
(cfwrite etc.) did.


But I think the actual problem here is that the API does not clearly
define how errors are communicated. I mean, it's nice to return the
value returned by the library function without "mangling" it by
conversion to size_t, but what if the libraries communicate errors in
different way? Some may return "0" while others may return "-1".

I think the right approach is to handle all library errors and not just
let them through. So Gzip_write() needs to check the return value, and
either call pg_fatal() or translate it to an error defined by the API.

For example we might say "returns 0 on error" and then translate all
library-specific errors to that.


While looking at the code I realized a couple function comments don't
say what's returned in case of error, etc. So 0004 adds those.

0003 is a couple minor assorted comments/questions:

- Should we move ZLIB_OUT_SIZE/ZLIB_IN_SIZE to compress_gzip.c?

- Why are LZ4 buffer sizes different (ZLIB has both 4kB)?

- I wonder if we actually need LZ4F_HEADER_SIZE_MAX? Is it even possible
for

Re: Transparent column encryption

2023-03-13 Thread Peter Eisentraut

On 12.03.23 01:11, Andres Freund wrote:

Have you done benchmarks of some simple workloads to verify this doesn't cause
slowdowns (when not using encryption, obviously)? printtup.c is a performance
sensitive portion for simple queries, particularly when they return multiple
columns.


The additional code isn't used when column encryption is off, so there 
shouldn't be any impact.






Re: meson: Non-feature feature options

2023-03-13 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 01:13:31PM -0700, Andres Freund wrote:
> On 2023-03-13 11:04:32 -0700, Nathan Bossart wrote:
>> I noticed that after 6a30027, if you don't have the OpenSSL headers
>> installed, 'meson setup' will fail:
>> 
>>  meson.build:1195:4: ERROR: C header 'openssl/ssl.h' not found
>> 
>> Shouldn't "auto" cause Postgres to be built without OpenSSL if the required
>> headers are not present?
> 
> Yea. I found another thing: When dependency() found something, but the headers
> weren't present, ssl_int wouldn't exist.
> 
> Maybe something like the attached?

>  ssl_lib = cc.find_library('ssl',
>dirs: test_lib_d,
>header_include_directories: postgres_inc,
> -  has_headers: ['openssl/ssl.h', 'openssl/err.h'])
> +  has_headers: ['openssl/ssl.h', 'openssl/err.h'],
> +  required: openssl_required)
>  crypto_lib = cc.find_library('crypto',
>dirs: test_lib_d,
> -  header_include_directories: postgres_inc)
> -ssl_int = [ssl_lib, crypto_lib]
> -
> -ssl = declare_dependency(dependencies: ssl_int,
> - include_directories: postgres_inc)
> +  required: openssl_required)
> +if ssl_lib.found() and crypto_lib.found()
> +  ssl_int = [ssl_lib, crypto_lib]
> +  ssl = declare_dependency(dependencies: ssl_int, include_directories: 
> postgres_inc)
> +endif

I was just about to post a patch to set "required" like you have for
ssl_lib and crypto_lib.  It seemed to work alright without the 'if
ssl_lib.found() and crypto_lib.found()' line, but I haven't worked with
these meson files very much, so what you have is probably better form.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Request for comment on setting binary format output per session

2023-03-13 Thread Dave Cramer
Dave Cramer


On Sat, 4 Mar 2023 at 19:39, Dave Cramer  wrote:

>
>
> On Sat, 4 Mar 2023 at 19:06, Tom Lane  wrote:
>
>> Jeff Davis  writes:
>> > On Sat, 2023-03-04 at 18:04 -0500, Dave Cramer wrote:
>> >> Most of the clients know how to decode the builtin types. I'm not
>> >> sure there is a use case for binary encode types that the clients
>> >> don't have a priori knowledge of.
>>
>> > The client could, in theory, have a priori knowledge of a non-builtin
>> > type.
>>
>> I don't see what's "in theory" about that.  There seems plenty of
>> use for binary I/O of, say, PostGIS types.  Even for built-in types,
>> do we really want to encourage people to hard-wire their OIDs into
>> applications?
>>
>
> How does a client read these? I'm pretty narrowly focussed. The JDBC API
> doesn't really have a way to read a non built-in type.  There is a facility
> to read a UDT, but the user would have to provide that transcoder. I guess
> I'm curious how other clients read binary UDT's ?
>
>>
>> I don't see a big problem with driving this off a GUC, but I think
>> it should be a list of type names not OIDs.  We already have plenty
>> of precedent for dealing with that sort of thing; see search_path
>> for the canonical example.  IIRC, there's similar caching logic
>> for temp_tablespaces.
>>
>
> I have no issue with allowing names, OID's were compact, but we could
> easily support both
>

Attached is a preliminary patch that takes a list of OID's. I'd like to
know if this is going in the right direction.

Next step would be to deal with type names as opposed to OID's.
This will be a bit more challenging as type names are schema specific.

Dave

>


0001-Add-a-GUC-format_binary-takes-a-comma-separated-list.patch
Description: Binary data


Re: Improve logging when using Huge Pages

2023-03-13 Thread Stephen Frost
Greetings,

On Mon, Mar 13, 2023 at 21:03 Justin Pryzby  wrote:

> On Thu, Mar 09, 2023 at 03:02:29PM -0500, Stephen Frost wrote:
> > * Justin Pryzby (pry...@telsasoft.com) wrote:
> > > On Thu, Mar 09, 2023 at 09:34:10AM -0500, Stephen Frost wrote:
> > > > * Nathan Bossart (nathandboss...@gmail.com) wrote:
> > > > > On Wed, Feb 15, 2023 at 10:13:17AM -0800, Nathan Bossart wrote:
> > > > > > On Tue, Feb 14, 2023 at 07:32:56PM -0600, Justin Pryzby wrote:
> > > > > >> On Mon, Feb 13, 2023 at 08:18:52PM -0800, Nathan Bossart wrote:
> > > > > >>> I'm curious why you chose to make this a string instead of an
> enum.  There
> > > > > >>> might be little practical difference, but since there are only
> three
> > > > > >>> possible values, I wonder if it'd be better form to make it an
> enum.
> > > > > >>
> > > > > >> It takes more code to write as an enum - see 002.txt.  I'm not
> convinced
> > > > > >> this is better.
> > > > > >>
> > > > > >> But your comment made me fix its , and reconsider the
> strings,
> > > > > >> which I changed to active={unknown/true/false} rather than
> {unk/on/off}.
> > > > > >> It could also be active={unknown/yes/no}...
> > > > > >
> > > > > > I think unknown/true/false is fine.  I'm okay with using a
> string if no one
> > > > > > else thinks it should be an enum (or a bool).
> > > > >
> > > > > There's been no response for this, so I guess we can proceed with
> a string
> > > > > GUC.
> > > >
> > > > Just happened to see this and I'm not really a fan of this being a
> > > > string when it's pretty clear that's not what it actually is.
> > >
> > > I don't understand what you mean by that.
> > > Why do you say it isn't a string ?
> >
> > Because it's clearly a yes/no, either the server is currently running
> > with huge pages, or it isn't.  That's the definition of a boolean.
>
> I originally implemented it as a boolean, and I changed it in response
> to the feedback that postgres -C huge_pages_active should return
> "unknown".


I really don’t see how that’s at all useful.

> > Is there an agreement to use a function, instead ?
>
> Alvaro was -1 on using a function


I don’t entirely get that argument (select thisfunc(); is much worse than
show thisguc; ..?   Also, the former is easier to use with other functions
and such, as you don’t have to do current_setting(‘thisguc’)…).

Andres is +0 (?)


Kinda felt like this was closer to +0.5 or more, for my part anyway.

Nathan is +1
> Stephen is +1
>
> I'm -0.5,


Why..?

but I reimplemented it as a function.


Thanks!

  I hope that helps it to
> progress.  Please include a suggestion if there's better place for the
> function or global var.


Will try to give it a look tomorrow.

Thanks again!

Stephen

>


Re: Transparent column encryption

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 21:22:29 +0100, Peter Eisentraut wrote:
> On 12.03.23 01:11, Andres Freund wrote:
> > Have you done benchmarks of some simple workloads to verify this doesn't 
> > cause
> > slowdowns (when not using encryption, obviously)? printtup.c is a 
> > performance
> > sensitive portion for simple queries, particularly when they return multiple
> > columns.
> 
> The additional code isn't used when column encryption is off, so there
> shouldn't be any impact.

It adds branches, and it makes tupledescs wider. In tight spots, such as
printtup, that can hurt, even if the branches aren't ever entered.

Greetings,

Andres Freund




Re: meson: Non-feature feature options

2023-03-13 Thread Nazir Bilal Yavuz
Hi,

On Mon, 13 Mar 2023 at 23:14, Andres Freund  wrote:
>
> Hi,
>
> On 2023-03-13 21:57:22 +0300, Nazir Bilal Yavuz wrote:
> > On Mon, 13 Mar 2023 at 21:04, Nathan Bossart  
> > wrote:
> > >
> > > On Mon, Mar 13, 2023 at 07:27:18AM +0100, Peter Eisentraut wrote:
> > > > I have committed it like this.
> > >
> > > I noticed that after 6a30027, if you don't have the OpenSSL headers
> > > installed, 'meson setup' will fail:
> > >
> > > meson.build:1195:4: ERROR: C header 'openssl/ssl.h' not found
> > >
> > > Shouldn't "auto" cause Postgres to be built without OpenSSL if the 
> > > required
> > > headers are not present?
> >
> > Yes, I tested again and it is working as expected on my end. It
> > shouldn't fail like that unless the 'ssl' option is set to 'openssl'.
> > Is it possible that it has been set to 'openssl' without you noticing?
>
> It worked for the dependency() path, but not the cc.find_library() path. See
> the patch I just sent.

Thanks for the patch, I understand the problem now and your patch fixes this.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: Transparent column encryption

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 13:41:19 -0700, Andres Freund wrote:
> On 2023-03-13 21:22:29 +0100, Peter Eisentraut wrote:
> > On 12.03.23 01:11, Andres Freund wrote:
> > > Have you done benchmarks of some simple workloads to verify this doesn't 
> > > cause
> > > slowdowns (when not using encryption, obviously)? printtup.c is a 
> > > performance
> > > sensitive portion for simple queries, particularly when they return 
> > > multiple
> > > columns.
> >
> > The additional code isn't used when column encryption is off, so there
> > shouldn't be any impact.
>
> It adds branches, and it makes tupledescs wider. In tight spots, such as
> printtup, that can hurt, even if the branches aren't ever entered.

In fact, I do see a noticable, but not huge, regression:

$ cat /tmp/test.sql
SELECT * FROM pg_class WHERE oid = 1247;

c=1;taskset -c 10 pgbench -n -M prepared -c$c -j$c -f /tmp/test.sql -P1 -T10

with the server also pinned to core 1, and turbo boost disabled. Nothing else
is allowed to run on the core, or its hyperthread sibling. This is my setup
for comparing performance with the least noise in general, not related to this
patch.

head:  28495.858509 28823.055643 28731.074311
patch: 28298.498851 28285.426532 28489.359569

A ~1.1% loss.

pipelined 50 statements (pgbench pinned to a different otherwise unused core)
head:  1147.404506 1147.587475 1151.976547
patch: 1126.525708 1122.375337 1119.088734

A ~2.2% loss.

That might not be prohibitive, but it does seem worth analyzing.

Greetings,

Andres Freund




Re: pg_usleep for multisecond delays

2023-03-13 Thread Nathan Bossart
On Fri, Mar 10, 2023 at 12:28:28PM -0500, Tom Lane wrote:
> A quick grep for pg_usleep with large intervals finds rather more
> than you touched:
> 
> [...]
> 
> Did you have reasons for excluding the rest of these?

I'm still looking into each of these, but my main concerns were 1) ensuring
latch support had been set up and 2) figuring out the right interrupt
function to use.  Thus far, I don't think latch support is a problem
because InitializeLatchSupport() is called quite early.  However, I'm not
as confident with the interrupt functions yet.  Some of these multisecond
sleeps are done very early before the signal handlers are set up.  Others
are done within the sigsetjmp() block.  And at least one is in a code path
that both the startup process and single-user mode touch.

At the moment, I'm thinking about either removing the check_interrupts
function pointer argument altogether or making it optional for code paths
where we might not want any interrupt handling to run.  In the former
approach, we could simply call WaitLatch() without a latch.  While this
wouldn't do any interrupt handling, we'd still get custom wait event
support and better responsiveness when the postmaster dies, which is
strictly better than what's there today.  And many of these sleeps are in
uncommon or debug paths, so delaying interrupt handling might be okay.  In
the latter approach, we would have interrupt handling, but I'm worried that
would be easy to get wrong.  I think it would be nice to have interrupt
handling if possible, so I'm still (over)thinking about this.

I agree with the rest of your comments.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [BUG] pg_stat_statements and extended query protocol

2023-03-13 Thread David Zhang

It appears you must "make clean; make install" to correctly compile after
applying the patch.

In a git repository, I've learnt to rely on this simple formula, even
if it means extra cycles when running ./configure:
git clean -d -x -f

Thank you all for pointing out that it needs make clean first. After 
make clean followed by recompile with the patch then both make check 
from regression test and pg_stat_statements extension report all test 
passed. So the current existing test cases can't really detect any 
change from this patch, then it would be better to add some test cases 
to cover this.


Best regards,

David






Re: Operation log for major operations

2023-03-13 Thread Dmitry Koval

Kirk, I'm sorry about the long pause in my reply.

>We need some kind of semaphore flag that tells us something awkward
>happened. When it happened, and a little bit of extra information.

I agree that we do not have this kind of information.
Additionally, legal events like start of pg_rewind, pg_reset, ...  are 
interesting.



>You also make the point that if such things have happened, it would
>probably be a good idea to NOT allow pg_upgrade to run.
>It might even be a reason to constantly bother someone until
>the issue is repaired.

I think no reason to forbid the run of pg_upgrade for the user 
(especially in automatic mode).
If we automatically do NOT allow pg_upgrade, what should the user do for 
allow pg_upgrade?
Unfortunately, PostgreSQL does not have the utilities to correct errors 
in the database (in case of errors users uses copies of the DB or 
corrects errors manually).

An ordinary user cannot correct errors on his own ...
So we cannot REQUIRE the user to correct database errors, we can only 
INFORM about them.



>To that point, this feels like a "postgresql_panic.log" file (within
>the postgresql files?)...  Something that would prevent pg_upgrade,
>etc. That everyone recognizes is serious. Especially 3rd party vendors.
>I see the need for such a thing. I have to agree with others about
>questioning the proper place to write this.
>Are there other places that make sense, that you could use, especially
>if knowing it exists means there was a serious issue?

The location of the operation log (like a "postgresql_panic.log") is not 
easy question.
Our technical support is sure that the large number of failures are 
caused by "human factor" (actions of database administrators).
It is not difficult for database administrators to delete the 
"postgresql_panic.log" file or edit it (for example, replacing it with 
an old version; CRC will not save you from such an action).


Therefore, our technical support decided to place the operation log at 
the end of the pg_control file, at an offset of 8192 bytes (and protect 
this log with CRC).
About writing to the pg_control file what worries Tom Lane: information 
in pg_control is written once at system startup (twice in case of 
"promote").
Also, some utilities write information to the operation log too - 
pg_resetwal, pg_rewind, pg_upgrade (these utilities also modify the 
pg_control file without the operation log).


If you are interested, I can attach the current patch (for info - I 
think it makes no sense to offer this patch at commitfest).


--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: optimize several list functions with SIMD intrinsics

2023-03-13 Thread Nathan Bossart
Thanks for taking a look.

On Sat, Mar 11, 2023 at 09:41:18AM +, Ankit Kumar Pandey wrote:
> 1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from 
> #ifdef USE_NO_SIMD
>   const ListCell *cell;
> #endif
> to #else like as mentioned below? This will make visual separation between 
> #if cases more cleaner

I would expect to see -Wdeclaration-after-statement warnings if we did
this.

> 2. Lots of duplicated
> if (list == NIL) checks before calling  list_member_inline_internal(list, 
> datum)
> Can we not add this check in list_member_inline_internal itself?

We probably could.  I only extracted the NIL checks to simplify the code in
list_member_inline_internal() a bit.

> 3. if (cell)
>   return list_delete_cell(list, cell) in list_delete_int/oid
> can we change if (cell) to if (cell != NULL) as used elsewhere?

Sure.

> 4. list_member_inline_interal_idx , there is typo in name.

Will fix.

> 5. list_member_inline_interal_idx and list_member_inline_internal are 
> practically same with almost 90+ % duplication.
> can we not refactor both into one and allow cell or true/false as return? 
> Although I see usage of const ListCell so this might be problematic.

The idea was to skip finding the exact match if all we care about is
whether the element exists.  This micro-optimization might be negligible,
in which case we could use list_member_inline_internal_idx() for both
cases.

> 6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in 
> list.c looks duplicated from pg_lfind32 (in pg_lfind.h), can we not reuse 
> that?

The list.c version is slightly different because we need to disregard any
matches that we find in between the data.  For example, in an integer List,
the integer will take up 4 bytes of the 8-byte ListCell (for 64-bit
platforms).

typedef union ListCell
{
void   *ptr_value;
int int_value;
Oid oid_value;
TransactionId xid_value;
} ListCell;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: meson: Non-feature feature options

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 23:46:41 +0300, Nazir Bilal Yavuz wrote:
> Thanks for the patch, I understand the problem now and your patch fixes this.

Pushed the patch.

Greetings,

Andres Freund




Re: Add LZ4 compression in pg_dump

2023-03-13 Thread Tomas Vondra
Hi Justin,

Thanks for the patch.

On 3/8/23 02:45, Justin Pryzby wrote:
> On Wed, Mar 01, 2023 at 04:52:49PM +0100, Tomas Vondra wrote:
>> Thanks. That seems correct to me, but I find it somewhat confusing,
>> because we now have
>>
>>  DeflateCompressorInit vs. InitCompressorGzip
>>
>>  DeflateCompressorEnd vs. EndCompressorGzip
>>
>>  DeflateCompressorData - The name doesn't really say what it does (would
>>  be better to have a verb in there, I think).
>>
>> I wonder if we can make this somehow clearer?
> 
> To move things along, I updated Georgios' patch:
> 
> Rename DeflateCompressorData() to DeflateCompressorCommon();

Hmmm, I don't find "common" any clearer than "data" :-( There needs to
at least be a comment explaining what "common" does.

> Rearrange functions to their original order allowing a cleaner diff to the 
> prior code;

OK. I wasn't very enthusiastic about this initially, but after thinking
about it a bit I think it's meaningful to make diffs clearer. But I
don't see much difference with/without the patch. The

git diff --diff-algorithm=minimal -w
e9960732a~:src/bin/pg_dump/compress_io.c src/bin/pg_dump/compress_gzip.c

Produces ~25k diff with/without the patch. What am I doing wrong?

> Change pg_fatal() to an assertion+comment;

Yeah, that's reasonable. I'd even ditch the assert/comment, TBH. We
could add such protections against "impossible" stuff to a zillion other
places and the confusion likely outweighs the benefits.

> Update the commit message and fix a few typos;
> 

Thanks. I don't want to annoy you too much, but could you split the
patch into the "empty-data" fix and all the other changes (rearranging
functions etc.)? I'd rather not mix those in the same commit.


>> Also, InitCompressorGzip says this:
>>
>>/*
>> * If the caller has defined a write function, prepare the necessary
>> * state. Avoid initializing during the first write call, because End
>> * may be called without ever writing any data.
>> */
>> if (cs->writeF)
>> DeflateCompressorInit(cs);
>>
>> Does it actually make sense to not have writeF defined in some cases?
> 
> InitCompressor is being called for either reading or writing, either of
> which could be null:
> 
> src/bin/pg_dump/pg_backup_custom.c: ctx->cs = 
> AllocateCompressor(AH->compression_spec,
> src/bin/pg_dump/pg_backup_custom.c-   
>NULL,
> src/bin/pg_dump/pg_backup_custom.c-   
>_CustomWriteFunc);
> --
> src/bin/pg_dump/pg_backup_custom.c: cs = 
> AllocateCompressor(AH->compression_spec,
> src/bin/pg_dump/pg_backup_custom.c-   
>   _CustomReadFunc, NULL);
> 
> It's confusing that the comment says "Avoid initializing...".  What it
> really means is "Initialize eagerly...".  But that makes more sense in
> the context of the commit message for this bugfix than in a comment.  So
> I changed that too.
> 
> +   /* If deflation was initialized, finalize it */   
>
> +   if (cs->private_data) 
>
> +   DeflateCompressorEnd(AH, cs); 
>
> 
> Maybe it'd be more clear if this used "if (cs->writeF)", like in the
> init function ?
> 

Yeah, if the two checks are equivalent, it'd be better to stick to the
same check everywhere.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: Ability to reference other extensions by schema in extension scripts

2023-03-13 Thread Regina Obe
> On Sat, Mar 11, 2023 at 03:18:18AM -0500, Regina Obe wrote:
> > Attached is a revised patch with these changes in place.
> 
> I've given a try to this patch. It builds and regresses fine.
> 
> My own tests also worked fine. As long as ext1 was found in the ext2's
> no_relocate list it could not be relocated, and proper error message is
given
> to user trying it.
> 
> Nitpicking, there are a few things that are weird to me:
> 
> 1) I don't get any error/warning if I put an arbitrary string into
no_relocate
> (there's no check to verify the no_relocate is a subset of the requires).
> 
> 2) An extension can still reference extensions it depends on without
putting
> them in no_relocate. This may be intentional, as some substitutions may
not
> require blocking relocation, but felt inconsistent with the normal
> @extschema@ which is never replaced unless an extension is marked as non-
> relocatable.
> 
> --strk;
> 
>   Libre GIS consultant/developer
>   https://strk.kbt.io/services.html

Attached is a slightly revised patch to fix the extra whitespace in the
extend.gml 
document that Sandro noted to me.

Thanks,
Regina


0007-Allow-use-of-extschema-reqextname-to-reference.patch
Description: Binary data


Re: Microsecond-based timeouts

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 18:23:02 +1300, Thomas Munro wrote:
> One question is whether it'd be better to use nanoseconds instead,
> since the relevant high-resolution primitives use those under the
> covers (struct timespec).  On the other hand, microseconds are a good
> match for our TimestampTz which is the ultimate source of many of our
> timeout decisions.

It's hard to believe we'll need nanosecond sleeps anytime soon, given that
even very trivial syscalls take on the order of 100ns.

It's not like we couldn't add another function for waiting for nanoseconds at
a later point.


> I suppose we could also consider an interface with an absolute timeout
> instead, and then stop thinking about the units so much.

That seesm pretty awful to use, and we'd just end up with the same question at
the callsites.

Greetings,

Andres Freund




Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 17:00:00 +0300, Alexander Lakhin wrote:
> 12.03.2023 10:18, Thomas Munro wrote:
> > And again:
> >
> > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna...
> >
> > https://cirrus-ci.com/task/6558324615806976
> > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log
> > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt
>
> Here we have duplicate PIDs too:
> ...
> 2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0]
> LOG:  disconnection: session time: 0:00:00.268 user=SYSTEM
> database=regression host=[local]
> ...
> 2023-03-11 13:57:22.320 GMT [4340][client backend]
> [pg_regress/join][8/947:0] LOG:  statement: set enable_hashjoin to 0;
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line:
> 329, PID: 2152
>
> And I see the following code in postmaster.c:
> CleanupBackend(int pid,
>            int exitstatus)    /* child's exit status. */
> {
> ...
>     dlist_foreach_modify(iter, &BackendList)
>     {
>     Backend    *bp = dlist_container(Backend, elem, iter.cur);
>     if (bp->pid == pid)
>     {
>         if (!bp->dead_end)
>         {
>             if (!ReleasePostmasterChildSlot(bp->child_slot))
> ...
>
> so if a backend with the same PID happened to start (but not reached
> InitProcess() yet), when CleanBackend() is called to clean after a just
> finished backend, the slot of the starting one will be released.

On unix that ought to be unreachable, because we haven't yet reaped the dead
process. But I suspect that there currently is no such guarantee on
windows. Which seems broken.

On windows it looks like pids can't be reused as long as there are handles for
the process. Unfortunately, we close the handle for the process in
pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can
be reused before we get to waitpid(). And thus it can happen while we start
new children.

I think we need to remove the CloseHandle() from
pgwin32_deadchild_callback(). Likely pgwin32_deadchild_callback() shouldn't do
anything other than
UnregisterWaitEx();PostQueuedCompletionStatus(key=childinfo),
pg_queue_signal(), with everything else moved to waitpid().


> I am yet to construct a reproduction of the case, but it seems to me that
> the race condition is not impossible here.

I suspect the issue could be made much more likely by adding a sleep before
the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback().

Greetings,

Andres Freund




Re: Testing autovacuum wraparound (including failsafe)

2023-03-13 Thread Jacob Champion
On Sat, Mar 11, 2023 at 8:47 PM Peter Geoghegan  wrote:
> I was joking. But I did have a real point: once we have tests for the
> xidStopLimit mechanism, why not take the opportunity to correct the
> long standing issue with the documentation advising the use of single
> user mode?

Does https://commitfest.postgresql.org/42/4128/ address that
independently enough?

--Jacob




Re: meson: Non-feature feature options

2023-03-13 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 02:45:29PM -0700, Andres Freund wrote:
> Pushed the patch.

Thanks for the prompt fix.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Show various offset arrays for heap WAL records

2023-03-13 Thread Melanie Plageman
Thanks for the various perspectives and feedback.

Attached v2 has additional info for xl_btree_vacuum and xl_btree_delete.

I've quoted various emails by various senders below and replied.

On Fri, Jan 27, 2023 at 3:02 PM Robert Haas  wrote:
>
> On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
>  wrote:
> > I believe I have addressed this in the attached patch.
>
> I'm not sure what's best in terms of formatting details but I
> definitely like the idea of making pg_waldump show more details. I'd
> even like to have a way to extract the tuple data, when it's
> operations on tuples and we have those tuples in the payload. That'd
> be a lot more verbose than what you are doing here, though, and I'm
> not saying you should go do it right now or anything like that.

If I'm not mistaken, this would be quite difficult without changing
rm_desc to return some kind of self-describing data type.

On Tue, Jan 31, 2023 at 4:52 PM Peter Geoghegan  wrote:
> On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
>  wrote:
> > I started documenting it in the rmgr_utils.h header file in a comment,
> > however it may be worth a README?
> >
> > I haven't polished this description of the format (or added examples,
> > etc) or used it in the btree-related functions because I assume the
> > format and helper function API will need more discussion.
>
> I think that standardization is good, but ISTM that we need clarity on
> what the scope is -- what is *not* being standardized? It may or may
> not be useful to call the end result an API. Or it may not make sense
> to do so in the first committed version, even though we may ultimately
> end up as something that deserves to be called an API. The obligation
> to not break tools that are scraping the output in whatever way seems
> kind of onerous right now -- just not having any gratuitous
> inconsistencies (e.g., fixing totally inconsistent punctuation, making
> the names for fields across WAL records consistent when they serve
> exactly the same purpose) would be a big improvement.

So, we can scrap any README or big comment, but are there other changes
to the code or structure you think would avoid it being seen as an
API?

> As I mentioned in passing already, I actually don't think that the
> B-Tree WAL records are all that special, as far as this stuff goes.
> For example, the DELETE Btree record type is very similar to heapam's
> PRUNE record type, and practically identical to Btree's VACUUM record
> type. All of these record types use the same basic conventions, like a
> snapshotConflictHorizon field for recovery conflicts (which is
> generated in a very similar way during original execution, and
> processed in precisely the same way during REDO), and arrays of page
> offset numbers sorted in ascending order.
>
> There are some remaining details where things from an index AM WAL
> record aren't directly analogous (or pretty much identical) to some
> other heapam WAL records, such as the way that the DELETE Btree record
> type deals with deleting a subset of TIDs from a posting list index
> tuple (generated by B-Tree deduplication). But even these exceptions
> don't require all that much discussion. You could either choose to
> only display the array of deleted index tuple page offset numbers, as
> well as the similar array of "updated" index tuple page offset numbers
> from xl_btree_delete, in which case you just display two arrays of
> page offset numbers, in the same standard way. You may or may not want
> to also show each individual xl_btree_update entry -- doing so would
> be kinda like showing the details of individual freeze plans, except
> that you'd probably display something very similar to the page offset
> number display here too (even though these aren't page offset numbers,
> they're 0-based offsets into the posting list's item pointer data
> array).

I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
the updated/deleted target offset numbers and the updated tuples
metadata.

I wondered if there was any reason to do xl_btree_dedup deduplication
intervals.

> BTW, there is also a tendency for non-btree index AM WAL records to be
> fairly similar or even near-identical to the B-Tree WAL records. While
> Hash indexes are very different to B-Tree indexes at a high level, it
> is nevertheless the case that xl_hash_vacuum_one_page is directly
> based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
> directly based on xl_btree_insert. There are some other WAL record
> types that are completely different across hash and B-Tree, which is a
> reflection of the fact that the index grows using a totally different
> approach in each AM -- but that doesn't seem like something that
> throws up any roadblocks for you (these can all be displayed as simple
> structs anyway).

I chose not to take on any other index types until I saw if this was viable.

> > Perhaps there should also be example output of the offset arrays in
> > pgwalinspect docs?
>

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

2023-03-13 Thread Nathan Bossart
>   * NOTE: although the delay is specified in microseconds, the effective
> - * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
> - * the requested delay to be rounded up to the next resolution boundary.
> + * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
> + * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
> + * Unixen and Windows.

nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
Otherwise, I think this is the right idea.

> + * CAUTION: if interrupted by a signal, this function will return, but its
> + * interface doesn't report that.  It's not a good idea to use this
> + * for long sleeps in the backend, because backends are expected to respond 
> to
> + * interrupts promptly.  Better practice for long sleeps is to use 
> WaitLatch()
> + * with a timeout.

I'm not sure this argument follows.  If pg_usleep() returns if interrupted,
then why are we concerned about delayed responses to interrupts?

> - delay.tv_usec = microsec % 100L;
> - (void) select(0, NULL, NULL, NULL, &delay);
> + delay.tv_nsec = (microsec % 100L) * 1000;
> + (void) nanosleep(&delay, NULL);

Using nanosleep() seems reasonable to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: lz4 --rm on Ubuntu 18.04 (Add LZ4 compression to pg_dump)

2023-03-13 Thread Tomas Vondra


On 3/9/23 19:00, Tomas Vondra wrote:
> 
> 
> On 3/9/23 01:30, Michael Paquier wrote:
>> On Thu, Mar 09, 2023 at 12:39:08AM +0100, Tomas Vondra wrote:
>>> IMO we should fix that. We have a bunch of buildfarm members running on
>>> Ubuntu 18.04 (or older) - it's true none of them seems to be running TAP
>>> tests. But considering how trivial the fix is ...
>>>
>>> Barring objections, I'll push a fix early next week.
>>
>> +1, better to change that, thanks.  Actually, would --rm be OK even on
>> Windows?  As far as I can see, the CI detects a LZ4 command for the
>> VS2019 environment but not the liblz4 libraries that would be needed
>> to trigger the set of tests.
> 
> Thanks for noticing that. I'll investigate next week.
> 

So, here's a fix that should (I think) replace the 'lz4 --rm' with a
simple 'rm'. I have two doubts about this, though:


1) I haven't found a simple way to inject additional command into the
test. The pg_dump runs have a predefined list of "steps" to run:

  -- compress_cmd
  -- glob_patterns
  -- command_like
  -- restore_cmd

and I don't think there's a good place to inject the 'rm' so I ended up
adding a 'cleanup_cmd' right after 'compress_cmd'. But it seems a bit
strange / hacky. Maybe there's a better way?


2) I wonder if Windows will know what 'rm' means. I haven't found any
TAP test doing 'rm' and don't see 'rm' in any $ENV either.


That being said, I have no idea how to make this work on our Windows CI.
As mentioned, the environment is missing the lz4 library - there's a

  setup_additional_packages_script: |
REM choco install -y --no-progress ...

in the .yml file, but AFAICS the chocolatey does not have lz4 :-/


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 9c354213ce..b3dcf6ff6d 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -178,11 +178,18 @@ my %pgdump_runs = (
 		compress_cmd => {
 			program => $ENV{'LZ4'},
 			args=> [
-'-z', '-f', '--rm',
+'-z', '-f',
 "$tempdir/compression_lz4_dir/blobs.toc",
 "$tempdir/compression_lz4_dir/blobs.toc.lz4",
 			],
 		},
+		# remove the source (uncompressed) .toc file
+		cleanup_cmd => {
+			program => 'rm',
+			args=> [
+"$tempdir/compression_lz4_dir/blobs.toc",
+			],
+		},
 		# Verify that data files were compressed
 		glob_patterns => [
 			"$tempdir/compression_lz4_dir/toc.dat",
@@ -4274,6 +4281,20 @@ foreach my $run (sort keys %pgdump_runs)
 		command_ok(\@full_compress_cmd, "$run: compression commands");
 	}
 
+	if ($pgdump_runs{$run}->{cleanup_cmd})
+	{
+		my ($cleanup_cmd) = $pgdump_runs{$run}->{cleanup_cmd};
+		my $cleanup_program = $cleanup_cmd->{program};
+
+		# Skip the rest of the test if the compression program is
+		# not defined.
+		next if (!defined($cleanup_cmd) || $cleanup_cmd eq '');
+
+		my @full_cleanup_cmd =
+		  ($cleanup_cmd->{program}, @{ $cleanup_cmd->{args} });
+		command_ok(\@full_cleanup_cmd, "$run: cleanup commands");
+	}
+
 	if ($pgdump_runs{$run}->{glob_patterns})
 	{
 		my $glob_patterns = $pgdump_runs{$run}->{glob_patterns};


Re: Testing autovacuum wraparound (including failsafe)

2023-03-13 Thread Peter Geoghegan
On Mon, Mar 13, 2023 at 3:25 PM Jacob Champion  wrote:
> Does https://commitfest.postgresql.org/42/4128/ address that
> independently enough?

I wasn't aware of that patch. It looks like it does exactly what I was
arguing in favor of. So yes.

-- 
Peter Geoghegan




Re: Combine pg_walinspect till_end_of_wal functions with others

2023-03-13 Thread Michael Paquier
On Mon, Mar 13, 2023 at 03:53:37PM +0530, Bharath Rupireddy wrote:
> On Mon, Mar 13, 2023 at 12:26 PM Michael Paquier  wrote:
>> +-- Make sure checkpoints don't interfere with the test.
>> +SELECT 'init' FROM 
>> pg_create_physical_replication_slot('regress_pg_walinspect_slot', true, 
>> false);
>>
>> Adding a physical slot is better for stability of course, but the test
>> also has to drop it or installcheck would cause an existing cluster to
>> have that still around.  The third argument could be true, as well, so
>> as we'd use a temporary slot.
> 
> # Disabled because these tests require "wal_level=replica", which
> # some installcheck users do not have (e.g. buildfarm clients).
> NO_INSTALLCHECK = 1
> 
> pg_walinspect can't be run under installcheck. I don't think dropping
> the slot at the end is needed, it's unnecessary. I saw
> oldextversions.sql using the same replication slot name, well no
> problem, but I changed it to a unique name.

-SELECT pg_drop_replication_slot('regress_pg_walinspect_slot');
-
+-- Clean up

In my opinion, it is an incorrect practice to assume that nobody will
ever run these tests on a running instance.  FWIW, I have managed
QE/QA flows in the past that did exactly that.  I cannot say for
already-deployed clusters that could be used for production still I
don't feel comfortable with the idea to assume that nobody would do
ever that, and calls of pg_drop_replication_slot() are not a
bottleneck.  So let's be clean and drop these slots to keep the tests
self-contained.  pg_walinspect in REL_15_STABLE gets that right, IMV,
and that's no different from the role cleanup, as one example.

> As hackers we know that these functions have been removed and how to
> achieve till_end_of_wal with the other functions. I noticed that
> you've removed my changes (see below) from the docs that were saying
> how to get info/stats till_end_of_wal. That leaves end users confused
> as to how they can achieve till_end_of_wal functionality. All users
> can't look for commit history/message but they can easily read the
> docs. I prefer to have the following (did so in the attached v7) and
> get rid of the above note if you don't feel strongly about it.
> 
> + If a future end_lsn
> +  (i.e. the LSN server doesn't know about) is specified, it returns
> +  informaton till end of WAL.

FWIW, I don't see a strong need for that, because this documents a
behavior where we would not fail.  And FWIW, it just feel natural to
me because the process stops the scan up to where it can.  In short,
it should be enough for the docs to mention the error patterns,
nothing else.

> I have some comments and fixed them in the attached v7 patch:
> 
> 1.
> + * pg_get_wal_records_info
>   *
> + * pg_get_wal_stats
>   *
> I think you wanted to be consistent with function comments with
> function names atop, but missed adding for all functions. Actually, I
> don't have a strong opinion on these changes as they unnecessarily
> bloat the changes, so I removed them.

Either is fine if you feel strongly on this one, I am just used to
doing that.

> 2.
> +if (start_lsn > curr_lsn)
>  ereport(ERROR,
>  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> - errmsg("cannot accept future start LSN"),
> - errdetail("Last known WAL LSN on the database system
> is at %X/%X.",
> -   LSN_FORMAT_ARGS(curr_lsn;
> -}
> + errmsg("WAL start LSN must be less than current LSN")));
> 
> I don't like this inconsistency much, especially when
> pg_get_wal_record_info emits "cannot accept future input LSN" with the
> current LSN details (this current LSN will give a bit more information
> to the user). Also, let's be consistent across returning NULLs if
> input LSN/start LSN equal to the current LSN. I've done these changes
> in the attached v7 patch.

No arguments against that, consistency is good.

> 3. I wanted COUNT(*) >= 0 for successful function execution to be
> COUNT(*) >= 1 so that we will check for at least the functions
> returning 1 record. And failures to be SELECT * FROM. This was my
> intention but I don't see that in this patch or in the previous
> test-refactoring commit. I added that in the attached v7 patch again.
> Also, made test comments conssitent.

Noticed that as well, still it feels to me that these had better be
separated from the rest, and be in their own patch, perhaps *after*
the main patch discussed on this thread, or just moved into their own
threads.  If a commit finishes with a list of bullet points referring
to a list of completely different things than the subject, there may
be a problem.  In this v7, we have:
- Change the behavior of the functions for end LSNs, tweaking the
tests to do so.
- Adjust more comments and formats in the tests.
- Adjust some tests to be pickier with detection of generated WAL
records.
- Remove the drop slot calls.
But what we need to care most here is the first point.

I am not argu

ICU 54 and earlier are too dangerous

2023-03-13 Thread Jeff Davis


In ICU 54 and earlier, if ucol_open() is unable to find a matching
locale, it will fall back to the *environment*.

Using ICU 54:

  initdb -D data -N --locale="en_US.UTF-8"
  pg_ctl -D data -l logfile start
  psql postgres -c "create collation asdf(provider=icu, locale='asdf')"
  # returns true
  psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
  psql postgres -c "alter system set lc_messages='C'"
  pg_ctl -D data -l logfile restart
  # returns false and warns about collation version mismatch
  psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"

This was fixed in ICU 55 to fall back to the root locale instead[1],
which is stable, has a collator version, and is not dependent on the
environment. As far as I can tell, 55 and later never fall back to the
environment when opening a collator (unless you explicitly pass NULL to
ucol_open(), which is documented).

It would be nice if we could detect when this fallback-to-environment
happens, so that we could just refuse to create the bogus collation.
But I didn't find a good way. There are non-error return codes from
ucol_open() that seem promising[2], but they aren't actually very
useful to distinguish the fallback-to-environment case as far as I can
tell.

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

Regards,
Jeff Davis


[1] https://icu.unicode.org/download/55m1
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html#a3343c1c8a8377277046774691c98d78c




Re: pg_dump versus hash partitioning

2023-03-13 Thread Tom Lane
Julien Rouhaud  writes:
> On Sun, Mar 12, 2023 at 03:46:52PM -0400, Tom Lane wrote:
>> The trick is to detect in pg_restore whether pg_dump chose to do
>> load-via-partition-root.

> Given that this approach wouldn't help with existing dump files (at least if
> using COPY, in any case the one using INSERT are doomed), so I'm slightly in
> favor of the first approach, and later add an easy and non magic incantation
> way to produce dumps that don't depend on partitioning.

Yeah, we need to do both.  Attached find an updated patch series:

0001: TAP test that exhibits both this deadlock problem and the
different-hash-codes problem.  I'm not sure if we want to commit
this, or if it should be in exactly this form --- the second set
of tests with a manual --load-via-partition-root switch will be
pretty redundant after this patch series.

0002: Make pg_restore detect load-via-partition-root by examining the
COPY commands embedded in the dump, and skip the TRUNCATE if so,
thereby fixing the deadlock issue.  This is the best we can do for
legacy dump files, I think, but it should be good enough.

0003: Also detect load-via-partition-root by adding a label in the
dump.  This is a more bulletproof solution going forward.

0004-0006: same as previous patches, but rebased over these.
This gets us to a place where the new TAP test passes.

I've not done anything about modifying the documentation, but I still
think we could remove the warning label on --load-via-partition-root.

regards, tom lane

From 33acd81f90ccd1ebdd75fbdf58024edb8f10f6a1 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Mon, 13 Mar 2023 19:01:42 -0400
Subject: [PATCH v2 1/6] Create a test case for dump-and-restore of hashed-enum
 partitioning.

I'm not sure we want to commit this, at least not in this form,
but it's able to demonstrate both the deadlock problem (when using
--load-via-partition-root) and the change-of-hash-code problem
(when not).
---
 src/bin/pg_dump/meson.build   |  1 +
 src/bin/pg_dump/t/004_pg_dump_parallel.pl | 66 +++
 2 files changed, 67 insertions(+)
 create mode 100644 src/bin/pg_dump/t/004_pg_dump_parallel.pl

diff --git a/src/bin/pg_dump/meson.build b/src/bin/pg_dump/meson.build
index ab4c25c781..b2fb7ac77f 100644
--- a/src/bin/pg_dump/meson.build
+++ b/src/bin/pg_dump/meson.build
@@ -96,6 +96,7 @@ tests += {
   't/001_basic.pl',
   't/002_pg_dump.pl',
   't/003_pg_dump_with_server.pl',
+  't/004_pg_dump_parallel.pl',
   't/010_dump_connstr.pl',
 ],
   },
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
new file mode 100644
index 00..c3f7d20b13
--- /dev/null
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -0,0 +1,66 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $dbname1 = 'regression_src';
+my $dbname2 = 'regression_dest1';
+my $dbname3 = 'regression_dest2';
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $backupdir = $node->backup_dir;
+
+$node->run_log([ 'createdb', $dbname1 ]);
+$node->run_log([ 'createdb', $dbname2 ]);
+$node->run_log([ 'createdb', $dbname3 ]);
+
+$node->safe_psql(
+	$dbname1,
+	qq{
+create type digit as enum ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9');
+create table t0 (en digit, data int) partition by hash(en);
+create table t0_p1 partition of t0 for values with (modulus 3, remainder 0);
+create table t0_p2 partition of t0 for values with (modulus 3, remainder 1);
+create table t0_p3 partition of t0 for values with (modulus 3, remainder 2);
+insert into t0 select (x%10)::text::digit, x from generate_series(1,1000) x;
+	});
+
+$node->command_ok(
+	[
+		'pg_dump', '-Fd', '--no-sync', '-j2', '-f', "$backupdir/dump1",
+		$node->connstr($dbname1)
+	],
+	'parallel dump');
+
+$node->command_ok(
+	[
+		'pg_restore', '-v',
+		'-d', $node->connstr($dbname2),
+		'-j3',"$backupdir/dump1"
+	],
+	'parallel restore');
+
+$node->command_ok(
+	[
+		'pg_dump', '-Fd', '--no-sync', '-j2', '-f', "$backupdir/dump2",
+		'--load-via-partition-root', $node->connstr($dbname1)
+	],
+	'parallel dump via partition root');
+
+$node->command_ok(
+	[
+		'pg_restore', '-v',
+		'-d', $node->connstr($dbname3),
+		'-j3',"$backupdir/dump2"
+	],
+	'parallel restore via partition root');
+
+done_testing();
-- 
2.31.1

From 685b4a1c0aaa59b2b6047051d082838b88528587 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Mon, 13 Mar 2023 19:05:05 -0400
Subject: [PATCH v2 2/6] Avoid TRUNCATE when restoring load-via-partition-root
 data.

This fixes the deadlock problem, allowing 004_pg_dump_parallel.pl to
succeed in the test where there's a manual --load-via-partition-root
switch.  It relies on the dump having used COPY commands, though.
---
 src/bin/pg_dump/pg_backup_archiver.c | 61 

Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32

2023-03-13 Thread Michael Paquier
On Mon, Mar 13, 2023 at 05:49:41PM +0100, Juan José Santamaría Flecha wrote:
> WFM, making fseek() behaviour more resilient seems like a good improvement
> overall.

I have not looked in details, but my guess would be to add a
win32seek.c similar to win32stat.c with a port of fseek() that's more
resilient to the definitions in POSIX.

> Should we open a new thread to make that part more visible?

Yes, perhaps it makes sense to do so to attract the correct audience,
There may be a few things we are missing.

When it comes to pg_dump, both fixes are required, still it seems to
me that adjusting the fstat() port and the fseek() ports are two
different bugs, as they influence different parts of the code base
when taken individually (aka this fseek() port for WIN32 would need
fstat() to properly detect a pipe, as far as I understand).

Meanwhile, I'll go apply and backpatch 0001 to fix the first bug at
hand with the fstat() port, if there are no objections.
--
Michael


signature.asc
Description: PGP signature


Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-03-13 Thread Thomas Munro
On Tue, Mar 14, 2023 at 11:20 AM Andres Freund  wrote:
> On windows it looks like pids can't be reused as long as there are handles for
> the process. Unfortunately, we close the handle for the process in
> pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can
> be reused before we get to waitpid(). And thus it can happen while we start
> new children.

Ahhh.  Right, of course.  The handle thing makes total sense now that
you point it out, and although I couldn't find it in the fine manual,
a higher authority has it in black and white[1].  Even without knowing
which of those calls is releasing the process table entry, we're doing
all of them on the wrong side of that IOCP.  Alright, here is a patch
to schlep most of that code over into waitpid(), where it belongs.

[1] https://devblogs.microsoft.com/oldnewthing/20110107-00/?p=11803
From cd83520c7d9ca214c98b9c0d4c74c9371fb0 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 14 Mar 2023 11:53:26 +1300
Subject: [PATCH] Fix waitpid() emulation on Windows.

Unix doesn't allow a PID to be recycled for a new process until after
the earlier process is "reaped" by waitpid() (or similar functions). Our
emulation didn't guarantee that, and the postmaster could finish up
tracking multiple children with the same PID, and get very confused.

Fix, by moving the guts of pgwin32_deadchild_callback() into waitpid(),
so that resources are released synchronously.  The process and PID
continue to exist until we close the process handle, when we're ready to
adjust our own bookkeeping of running children.

This seems to explain a couple of failures on CI.  It had never been
reported before, despite the code being as old as the Windows port.
Perhaps Windows started recycling PIDs more rapidly, or perhaps timing
changes due to commit 7389aad6 made it more likely to break.

Diagnosed-by: Andres Freund 
Discussion: https://postgr.es/m/20230208012852.bvkn2am4h4iqjogq%40awork3.anarazel.de
---
 src/backend/postmaster/postmaster.c | 65 -
 1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2552327d90..3da18ab6ac 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -6421,36 +6421,24 @@ ShmemBackendArrayRemove(Backend *bn)
 static pid_t
 waitpid(pid_t pid, int *exitstatus, int options)
 {
+	win32_deadchild_waitinfo *childinfo;
+	DWORD		exitcode;
 	DWORD		dwd;
 	ULONG_PTR	key;
 	OVERLAPPED *ovl;
 
-	/*
-	 * Check if there are any dead children. If there are, return the pid of
-	 * the first one that died.
-	 */
-	if (GetQueuedCompletionStatus(win32ChildQueue, &dwd, &key, &ovl, 0))
+	Assert(pid == -1);
+	Assert(options == WNOHANG);
+
+	/* Try to consume one win32_deadchild_waitinfo from the queue. */
+	if (!GetQueuedCompletionStatus(win32ChildQueue, &dwd, &key, &ovl, 0))
 	{
-		*exitstatus = (int) key;
-		return dwd;
+		errno = EAGAIN;
+		return -1;
 	}
 
-	return -1;
-}
-
-/*
- * Note! Code below executes on a thread pool! All operations must
- * be thread safe! Note that elog() and friends must *not* be used.
- */
-static void WINAPI
-pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
-{
-	win32_deadchild_waitinfo *childinfo = (win32_deadchild_waitinfo *) lpParameter;
-	DWORD		exitcode;
-
-	if (TimerOrWaitFired)
-		return;	/* timeout. Should never happen, since we use
- * INFINITE as timeout value. */
+	childinfo = (win32_deadchild_waitinfo *) key;
+	pid = childinfo->procId;
 
 	/*
 	 * Remove handle from wait - required even though it's set to wait only
@@ -6466,13 +6454,11 @@ pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
 		write_stderr("could not read exit code for process\n");
 		exitcode = 255;
 	}
-
-	if (!PostQueuedCompletionStatus(win32ChildQueue, childinfo->procId, (ULONG_PTR) exitcode, NULL))
-		write_stderr("could not post child completion status\n");
+	*exitstatus = exitcode;
 
 	/*
-	 * Handle is per-process, so we close it here instead of in the
-	 * originating thread
+	 * Close the process handle.  Only after this point can the PID can be
+	 * recycled by the kernel.
 	 */
 	CloseHandle(childinfo->procHandle);
 
@@ -6482,7 +6468,28 @@ pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
 	 */
 	free(childinfo);
 
-	/* Queue SIGCHLD signal */
+	return pid;
+}
+
+/*
+ * Note! Code below executes on a thread pool! All operations must
+ * be thread safe! Note that elog() and friends must *not* be used.
+ */
+static void WINAPI
+pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
+{
+	/* Should never happen, since we use INFINITE as timeout value. */
+	if (!TimerOrWaitFired)
+		return;
+
+	/* Post the win32_deadchild_waitinfo object for waitpid() to deal with. */
+	if (!PostQueuedCompletionStatus(win32ChildQueue,
+	0,
+	(ULONG_PTR) lpParameter,
+	NULL))
+		write_stder

Re: Allow logical replication to copy tables in binary format

2023-03-13 Thread Peter Smith
Here are some review comments for patch v12-0001

==
General

1.
There is no new test code. Are we sure that there are already
sufficient TAP tests doing binary testing with/without copy_data and
covering all the necessary combinations?

==
Commit Message

2.
Without this patch, table are copied in text format even if the
subscription is created with binary option enabled. This patch allows
logical replication
to perform in binary format starting from initial sync. When binary
format is beneficial
to use, allowing the subscription to copy tables in binary in table
sync phase may
reduce the time spent on copy depending on column types.

~

a. "table" -> "tables"

b. I don't think need to keep referring to the initial table
synchronization many times.

SUGGESTION
Without this patch, table synchronization COPY uses text format even
if the subscription is created with the binary option enabled. Copying
tables in binary format may reduce the time spent depending on column
types.

==
doc/src/sgml/logical-replication.sgml

3.
@@ -241,10 +241,11 @@
types of the columns do not need to match, as long as the text
representation of the data can be converted to the target type.  For
example, you can replicate from a column of type integer to a
-   column of type bigint.  The target table can also have
-   additional columns not provided by the published table.  Any such columns
-   will be filled with the default value as specified in the definition of the
-   target table.
+   column of type bigint.  However, replication in
binary format is
+   type specific and does not allow to replicate data between different types
+   according to its restrictions.  The target table can also have additional
+   columns not provided by the published table.  Any such columns
will be filled
+   with the default value as specified in the definition of the target table.
   

I am not sure if there is enough information here about the binary restrictions.
- e.g. does the column order matter for tablesync COPY binary?
- e.g. no mention of the send/receive function requirements of tablesync COPY.

But maybe here is not the place to write all such details anyway;
Instead of duplicating information IMO here should give a link to the
CREATE SUBSCRIPTION notes -- something like:

SUGGESTION
Note that replication in binary format is more restrictive. See CREATE
SUBSCRIPTION binary subscription parameter for details.

==
doc/src/sgml/ref/create_subscription.sgml

4.
@@ -189,11 +189,17 @@ CREATE SUBSCRIPTION subscription_namebinary (boolean)
 
  
-  Specifies whether the subscription will request the publisher to
-  send the data in binary format (as opposed to text).
-  The default is false.
-  Even when this option is enabled, only data types having
-  binary send and receive functions will be transferred in binary.
+  Specifies whether the subscription will both copy the initial data to
+  synchronize relations and request the publisher to send the data in
+  binary format (as opposed to text). The default is
false.
+  Binary format can be faster than the text format, but it is
less portable
+  across machine architectures and PostgreSQL versions.
Binary format is
+  also very data type specific, it will not allow copying
between different
+  column types as opposed to text format. Even when this
option is enabled,
+  only data types having binary send and receive functions will be
+  transferred in binary. Note that the initial synchronization requires
+  all data types to have binary send and receive functions, otherwise
+  the synchronization will fail.
  

There seems to be a small ambiguity because this wording comes more
from our code-level understanding, rather than what the user sees.
e.g. I think "will be transferred" could mean also the COPY phase as
far as the user is concerned. Maybe some slight rewording can help.

There is also some use of "copy" (e.g. "will not allow copying") which
can be confused with the initial tablesync phase which is not what was
intended.

SUGGESTION
Specifies whether the subscription will request the publisher to send
the data in binary format (as opposed to text). The default is
false. Any initial table synchronization copy [link
to copy_data] also uses the same format. Using binary format can be
faster than the text format, but it is less portable across machine
architectures and PostgreSQL versions. Binary format is also data type
specific, it will not allow transfer between different column types as
opposed to text format. Even when the binary option is enabled, only
data types having binary send/receive functions can be transferred in
binary format. If these functions don't exist then the publisher send
will revert to sending text format. Note that the binary initial table
synchronization copy requires all data types to h

Re: psql \watch 2nd argument: iteration count

2023-03-13 Thread Michael Paquier
On Sun, Mar 12, 2023 at 08:59:44PM -0700, Andrey Borodin wrote:
> I've tried this approach, but could not come up with sufficiently
> different error messages...
> 
>>  Wouldn't it be better to have a couple of regression
>> tests, as well?
> Added two tests.

It should have three tests with one for ERANGE on top of the other
two.  Passing down a value like "10e400" should be enough to cause
strtod() to fail, as far as I know.

+   if (sleep == 0)
+   continue;

While on it, forgot to comment on this one..  Indeed, this choice to
authorize 0 and not wait between two commands is more natural.

I have tweaked things as bit as of the attached, and ran pgindent.
What do you think?
--
Michael
From f6dad7551c04c3e7b280c310a744f8d4dfc10d19 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 14 Mar 2023 09:21:32 +0900
Subject: [PATCH v7] Fix incorrect argument handling in psql \watch

Incorrectly parsed argument was silently substituted with 1 second.
This is changed to proper error message.

Authour: Andrey Borodin 
Reviewed-by: Kyotaro Horiguchi 
Reviewed-by: Nathan Bossart 
Reviewed-by: Michael Paquier 
Thread: https://postgr.es/m/CAAhFRxiZ2-n_L1ErMm9AZjgmUK%3DqS6VHb%2B0SaMn8sqqbhF7How%40mail.gmail.com
---
 src/bin/psql/command.c  | 21 ++---
 src/bin/psql/t/001_basic.pl | 17 +
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 955397ee9d..60cabfd2c8 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -2776,9 +2776,21 @@ exec_command_watch(PsqlScanState scan_state, bool active_branch,
 		/* Convert optional sleep-length argument */
 		if (opt)
 		{
-			sleep = strtod(opt, NULL);
-			if (sleep <= 0)
-sleep = 1;
+			char *opt_end;
+			sleep = strtod(opt, &opt_end);
+			if (sleep < 0 || *opt_end || errno == ERANGE)
+			{
+if (*opt_end)
+	pg_log_error("\\watch: incorrect interval value '%s'", opt);
+else if (errno == ERANGE)
+	pg_log_error("\\watch: out-of-range interval value '%s'", opt);
+else
+	pg_log_error("\\watch: interval value '%s' less than zero", opt);
+free(opt);
+resetPQExpBuffer(query_buf);
+psql_scan_reset(scan_state);
+return PSQL_CMD_ERROR;
+			}
 			free(opt);
 		}
 
@@ -5183,6 +5195,9 @@ do_watch(PQExpBuffer query_buf, double sleep)
 		if (pagerpipe && ferror(pagerpipe))
 			break;
 
+		if (sleep == 0)
+			continue;
+
 #ifdef WIN32
 
 		/*
diff --git a/src/bin/psql/t/001_basic.pl b/src/bin/psql/t/001_basic.pl
index 0f394420b2..b86ff99a63 100644
--- a/src/bin/psql/t/001_basic.pl
+++ b/src/bin/psql/t/001_basic.pl
@@ -350,4 +350,21 @@ psql_like(
 	'\copy from with DEFAULT'
 );
 
+# Check \watch errors
+psql_fails_like(
+	$node,
+	'SELECT 1;\watch -10',
+	qr/interval value '-10' less than zero/,
+	'\watch, negative interval');
+psql_fails_like(
+	$node,
+	'SELECT 1;\watch 10ab',
+	qr/incorrect interval value '10ab'/,
+	'\watch incorrect interval');
+psql_fails_like(
+	$node,
+	'SELECT 1;\watch 10e400',
+	qr/out-of-range interval value '10e400'/,
+	'\watch out-of-range interval');
+
 done_testing();
-- 
2.39.2



signature.asc
Description: PGP signature


Re: ICU 54 and earlier are too dangerous

2023-03-13 Thread Tom Lane
Jeff Davis  writes:
> In ICU 54 and earlier, if ucol_open() is unable to find a matching
> locale, it will fall back to the *environment*.

That's not great, but ...

> Unless someone has a better idea, I think we need to bump the minimum
> required ICU version to 55. That would solve the issue in v16 and
> later, but those using old versions of ICU and old versions of postgres
> would still be vulnerable to these kinds of typos.

... that seems like an overreaction.  We know from the buildfarm
that there's still a lot of old ICU out there.  Is it really improving
anybody's life to try to forbid them from using such a version?

regards, tom lane




Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-03-13 Thread Michael Paquier
On Tue, Mar 14, 2023 at 01:01:28PM +1300, Thomas Munro wrote:
> Ahhh.  Right, of course.  The handle thing makes total sense now that
> you point it out, and although I couldn't find it in the fine manual,
> a higher authority has it in black and white[1].  Even without knowing
> which of those calls is releasing the process table entry, we're doing
> all of them on the wrong side of that IOCP.  Alright, here is a patch
> to schlep most of that code over into waitpid(), where it belongs.
> 
> [1] https://devblogs.microsoft.com/oldnewthing/20110107-00/?p=11803

I have a small question here..

The emulation of waitpid() for WIN32 is now in postmaster.c.  Could it
make sense for some of the frontend code to be able to rely on that,
as well?
--
Michael


signature.asc
Description: PGP signature


Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

2023-03-13 Thread Andres Freund
On 2023-03-14 09:29:56 +0900, Michael Paquier wrote:
> The emulation of waitpid() for WIN32 is now in postmaster.c.  Could it
> make sense for some of the frontend code to be able to rely on that,
> as well?

Please not as part of this bugfix. It's intricately tied to postmaster.c
specific code, as it is.




Re: Allow logical replication to copy tables in binary format

2023-03-13 Thread Peter Smith
On Tue, Mar 14, 2023 at 11:06 AM Peter Smith  wrote:
>
> Here are some review comments for patch v12-0001
>
> ==
> General
>
> 1.
> There is no new test code. Are we sure that there are already
> sufficient TAP tests doing binary testing with/without copy_data and
> covering all the necessary combinations?
>

Oops. Please ignore this comment. Somehow I missed seeing those
032_binary_copy.pl tests earlier.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: ICU 54 and earlier are too dangerous

2023-03-13 Thread Andres Freund
Hi,

On 2023-03-13 16:39:04 -0700, Jeff Davis wrote:
> In ICU 54 and earlier, if ucol_open() is unable to find a matching
> locale, it will fall back to the *environment*.
> 
> Using ICU 54:
> 
>   initdb -D data -N --locale="en_US.UTF-8"
>   pg_ctl -D data -l logfile start
>   psql postgres -c "create collation asdf(provider=icu, locale='asdf')"
>   # returns true
>   psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
>   psql postgres -c "alter system set lc_messages='C'"
>   pg_ctl -D data -l logfile restart
>   # returns false and warns about collation version mismatch
>   psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
> 
> This was fixed in ICU 55 to fall back to the root locale instead[1],
> which is stable, has a collator version, and is not dependent on the
> environment. As far as I can tell, 55 and later never fall back to the
> environment when opening a collator (unless you explicitly pass NULL to
> ucol_open(), which is documented).

> It would be nice if we could detect when this fallback-to-environment
> happens, so that we could just refuse to create the bogus collation.
> But I didn't find a good way. There are non-error return codes from
> ucol_open() that seem promising[2], but they aren't actually very
> useful to distinguish the fallback-to-environment case as far as I can
> tell.

What non-error code is returned in the above example?

Can we query the returned collator and see if it matches what we were looking
for?


> Unless someone has a better idea, I think we need to bump the minimum
> required ICU version to 55. That would solve the issue in v16 and
> later, but those using old versions of ICU and old versions of postgres
> would still be vulnerable to these kinds of typos.

I'm a bit confused by the dates. https://icu.unicode.org/download/55m1 says
that version was released 2014-12-17, but the linked issue around root locales
is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823  - I guess
the issue tracker was migrated at some point or such...

If indeed 2014 is the correct year of release, then it might be ok to increase
the minimum version...

Greetings,

Andres Freund




Re: psql \watch 2nd argument: iteration count

2023-03-13 Thread Andrey Borodin
On Mon, Mar 13, 2023 at 5:26 PM Michael Paquier  wrote:
>
> I have tweaked things as bit as of the attached, and ran pgindent.
> What do you think?
>

Looks good to me.
Thanks!

Best regards, Andrey Borodin.




Re: Show various offset arrays for heap WAL records

2023-03-13 Thread Peter Geoghegan
On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman
 wrote:
> On Fri, Jan 27, 2023 at 3:02 PM Robert Haas  wrote:
> > I'm not sure what's best in terms of formatting details but I
> > definitely like the idea of making pg_waldump show more details.

> If I'm not mistaken, this would be quite difficult without changing
> rm_desc to return some kind of self-describing data type.

I'd say that it would depend on how far you went with it. Basic
information about the tuple wouldn't require any of that. I suggest
leaving this part out for now, though.

> So, we can scrap any README or big comment, but are there other changes
> to the code or structure you think would avoid it being seen as an
> API?

I think that it would be good to try to build something that looks
like an API, while making zero promises about its stability -- at
least until further notice. Kind of like how there are no guarantees
about the stability of internal interfaces within the Linux kernel.

There is no reason to not take a firm position on some things now.
Things like punctuation, and symbol names for generic cross-record
symbols like snapshotConflictHorizon. Many of the differences that
exist now are wholly gratuitous -- just accidents. It would make sense
to standardize-away these clearly unnecessary variations. And to
document the new standard. I'd be surprised if anybody disagreed with
me on this point.

> I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
> the updated/deleted target offset numbers and the updated tuples
> metadata.
>
> I wondered if there was any reason to do xl_btree_dedup deduplication
> intervals.

No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
intervals -- each element is a page offset number, and a corresponding
count of index tuples to merge together in the REDO routine. That's
slightly different to anything else, but not in a way that seems like
it requires very much additional effort.

> I wanted to include at least a minimal example for those following along
> with this thread that would cause creation of one of the record types
> which I have enhanced, but I had a little trouble making a reliable
> example.
>
> Below is my strategy for getting a Heap PRUNE record with redirects, but
> it occasionally doesn't end up working and I wasn't sure why (I can do
> more investigation if we think that having some kind of test for this is
> useful).

I'm not sure, but offhand I think that there could be a number of
annoying little implementation details that make it hard to come up
with a perfectly reliable test case. Perhaps try it while using VACUUM
VERBOSE, with the proviso that we should only expect the revised
example workflow to show a redirect record as intended when the
VERBOSE output confirms that VACUUM actually ran as expected, in
whatever way. For example, VACUUM can't have failed to acquire a
cleanup lock on a heap page due to the current phase of the moon.
VACUUM shouldn't have its "removable cutoff" held back by
who-knows-what when the test case is run, either.

Some of the tests for VACUUM use a temp table, since they conveniently
cannot have their "removable cutoff" held back -- not since commit
a7212be8. Of course, that strategy won't help you here. Getting VACUUM
to behave very predictably for testing purposes has proven tricky at
times.

> > I agree, in general, though long term the best approach is one that
> > has a configurable level of verbosity, with some kind of roughly
> > uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
> > probably with only 2 or 3 distinct levels).
>
> Given this comment and Robert's concern quoted below, I am wondering if
> the consensus is that a lack of verbosity control is a dealbreaker for
> adding offsets or not.

There are several different things that seem important to me
personally. These are in tension with each other, to a degree. These
are:

1. Like Andres, I'd really like to have some way of inspecting things
like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
detail. These record types happen to be very important in general, and
the ability to see detailed information about the WAL record would
definitely help with some debugging scenarios. I've really missed
stuff like this while debugging serious issues under time pressure.

2. To a lesser extent I would like to see similar detailed information
for nbtree's DELETE, VACUUM, and possibly DEDUPLICATE record types.
They might also come in handy for debugging, in about the same way.

3. More manageable verbosity.

I think that it would be okay to put off coming up with a solution to
3, for now. 1 and 2 seem more important than 3.

> I think if there was a more structured output of rmgrdesc, then this
> would also solve the verbosity level problem. Consumers could decide on
> their verbosity level -- in various pg_walinspect function outputs, that
> would probably just be column selection. For pg_waldump, I imagine that
> some kind of p

Re: Allow logical replication to copy tables in binary format

2023-03-13 Thread Peter Smith
Here are some review comments for patch v12-0001 (test code only)

==
src/test/subscription/t/014_binary.pl

# Check the synced data on subscribers

~

There are a couple of comments like the above that say: "on
subscribers" instead of "on subscriber".

~~~

I wondered if it might be useful to also include another test case
that demonstrates you can still use COPY with binary format even when
the table column orders are different, so long as the same names have
the same data types. In other words, it shows apparently, the binary
row COPY processes per column; not one single binary data copy
spanning all the replicated columns.

For example,

# 
# Test syncing tables with different column order
$node_publisher->safe_psql(
'postgres', qq(
CREATE TABLE public.test_col_order (
a bigint, b int
);
INSERT INTO public.test_col_order (a,b)
VALUES (1,2),(3,4);
));

$node_subscriber->safe_psql(
'postgres', qq(
CREATE TABLE public.test_col_order (
b int, a bigint
);
ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
));

# Ensure nodes are in sync with each other
$node_subscriber->wait_for_subscription_sync($node_publisher, 'tsub');

# Check the synced data on subscribers
$result = $node_subscriber->safe_psql('postgres', 'SELECT a,b FROM
public.test_col_order;');

is( $result, '1|2
3|4', 'check synced data on subscriber for different column order and
binary = true');
# 

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: psql \watch 2nd argument: iteration count

2023-03-13 Thread Michael Paquier
On Mon, Mar 13, 2023 at 06:14:18PM -0700, Andrey Borodin wrote:
> Looks good to me.

Ok, thanks for looking.  Let's wait a bit and see if others have an
opinion to offer.  At least, the CI is green.
--
Michael


signature.asc
Description: PGP signature


Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

2023-03-13 Thread Thomas Munro
On Tue, Mar 14, 2023 at 12:10 PM Nathan Bossart
 wrote:
> >   * NOTE: although the delay is specified in microseconds, the effective
> > - * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
> > - * the requested delay to be rounded up to the next resolution boundary.
> > + * resolution is only 1/HZ on systems that use periodic kernel ticks to 
> > wake
> > + * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on 
> > older
> > + * Unixen and Windows.
>
> nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
> Otherwise, I think this is the right idea.

Better words welcome; 1-20ms summarises the range I actually measured,
and if reports are correct about Windows' HZ=64 (1/HZ = 15.625ms) then
it neatly covers that too, so I don't feel too bad about not chasing
down the reason for that 10ms/20ms discrepancy; maybe I looked at the
wrong HZ number (which you can change, anyway), I'm not too used to
NetBSD...  BTW they have a project plan to fix that
https://wiki.netbsd.org/projects/project/tickless/

> > + * CAUTION: if interrupted by a signal, this function will return, but its
> > + * interface doesn't report that.  It's not a good idea to use this
> > + * for long sleeps in the backend, because backends are expected to 
> > respond to
> > + * interrupts promptly.  Better practice for long sleeps is to use 
> > WaitLatch()
> > + * with a timeout.
>
> I'm not sure this argument follows.  If pg_usleep() returns if interrupted,
> then why are we concerned about delayed responses to interrupts?

Because you can't rely on it:

1.  Maybe the signal is delivered just before pg_usleep() begins, and
a handler sets some flag we would like to react to.  Now pg_usleep()
will not be interrupted.  That problem is solved by using latches
instead.
2.  Maybe the signal is one that is no longer handled by a handler at
all; these days, latches use SIGURG, which pops out when you read a
signalfd or kqueue, so pg_usleep() will not wake up.  That problem is
solved by using latches instead.

(The word "interrupt" is a bit overloaded, which doesn't help with
this discussion.)

> > - delay.tv_usec = microsec % 100L;
> > - (void) select(0, NULL, NULL, NULL, &delay);
> > + delay.tv_nsec = (microsec % 100L) * 1000;
> > + (void) nanosleep(&delay, NULL);
>
> Using nanosleep() seems reasonable to me.

Thanks for looking!




Re: Allow logical replication to copy tables in binary format

2023-03-13 Thread Amit Kapila
On Tue, Mar 14, 2023 at 6:18 AM Peter Smith  wrote:
>
> On Tue, Mar 14, 2023 at 11:06 AM Peter Smith  wrote:
> >
> > Here are some review comments for patch v12-0001
> >
> > ==
> > General
> >
> > 1.
> > There is no new test code. Are we sure that there are already
> > sufficient TAP tests doing binary testing with/without copy_data and
> > covering all the necessary combinations?
> >
>
> Oops. Please ignore this comment. Somehow I missed seeing those
> 032_binary_copy.pl tests earlier.
>

I think it would better to write the tests for this feature in the
existing test file 014_binary as that would save some time for node
setup/shutdown and also that would be a more appropriate place for
these tests.

-- 
With Regards,
Amit Kapila.




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-13 Thread Bharath Rupireddy
On Sun, Mar 12, 2023 at 11:00 PM Bharath Rupireddy
 wrote:
>
> I have some review comments to fix on
> v8-0001, so, I'll be sending out v9 patch-set soon.

Please find the attached v9 patch set for further review. I moved the
check for just-initialized WAL buffer pages before reading the page.
Up until now, it's the other way around, meaning, read the page and
then check the header if it is just-initialized, which is wrong. The
attached v9 patch set corrects it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v9-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch
Description: Binary data


v9-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patch
Description: Binary data


Re: CI and test improvements

2023-03-13 Thread Justin Pryzby
On Mon, Mar 13, 2023 at 07:39:52AM +0100, Peter Eisentraut wrote:
> On 03.02.23 15:26, Justin Pryzby wrote:
> > rebased, and re-including a patch to show code coverage of changed
> > files.
> 
> This constant flow of patches under one subject doesn't lend itself well to
> the commit fest model of trying to finish things up.
> I can't quite tell which of these patches are ready and agreed upon,
> and which ones are work in progress or experimental.

I'm soliticing feedback on those patches that I've sent recently - I've
elided patches if they have some unresolved issue.

I'm not aware of any loose ends other than what's updated here:

- cirrus: code coverage

I changed this to also run an "initial" coverage report before running
tests.  It's not clear to me what effect that has, though...

Andres seems to think it's a problem that this shows coverage only for
files that were actually changed.  But that's what's intended; it's
sufficient to see if new code is being hit by tests.  It would be slow
and take a lot of extra space to upload a coverage report for every
patch, every day.  It might be nice for cfbot to show how test coverage
changed in the affected files: -15% / +25%.

- cirrus: upload changed html docs as artifacts

Fixed an "only_if" line so cfbot will run the "warnings" task.

Maybe this path is waiting on Andres' patch to "move CompilerWarnings to
meson" ?

> > 7e09035f588 WIP: ci/meson: allow showing only failed tests ..
> 
> I'm not sure I like this one.  I sometimes look up the logs of non-failed
> tests to compare them with failed tests, to get context to could lead to
> failures.  Maybe we can make this behavior adjustable. But I've not been
> bothered by the current behavior.

It's adjustable by un/setting the environment variable.

I'm surprised to hear that anyone using cirrusci (with or without cfbot)
wouldn't prefer the behavior this patch implements.  It's annoying to
search find the logs for the (typically exactly one) failing test in
cirrus' directory of 200some test artifacts.  We're also uploading a lot
of logs for every failure.  (But I suppose this might break cfbot's new
client side parsing of things like build logs...)

-- 
Justin
>From f6174c446b0ee4f69239524ecc1506edaf41b33b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 25 May 2022 21:53:22 -0500
Subject: [PATCH 1/8] cirrus/windows: add compiler_warnings_script

I'm not sure how to write this test in windows shell; it's also easy to
write something that doesn't work in posix sh, since windows shell is
interpretting && and ||...

https://www.postgresql.org/message-id/20220212212310.f645c6vw3njkgxka%40alap3.anarazel.de

See also:
8a1ce5e54f6d144e4f8e19af7c767b026ee0c956
https://cirrus-ci.com/task/6241060062494720
https://cirrus-ci.com/task/6496366607204352

ci-os-only: windows
---
 .cirrus.yml | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 505c50f3285..60c0efc2e63 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -563,28 +563,37 @@ task:
 
   setup_additional_packages_script: |
 REM choco install -y --no-progress ...
 
   # Use /DEBUG:FASTLINK to avoid high memory usage during linking
   configure_script: |
 vcvarsall x64
 meson setup --backend ninja --buildtype debug -Dc_link_args=/DEBUG:FASTLINK -Dcassert=true -Db_pch=true -Dextra_lib_dirs=c:\openssl\1.1\lib -Dextra_include_dirs=c:\openssl\1.1\include -DTAR=%TAR% -DPG_TEST_EXTRA="%PG_TEST_EXTRA%" build
 
   build_script: |
 vcvarsall x64
-ninja -C build
+ninja -C build |tee build.txt
+REM Since pipes lose the exit status of the preceding command, rerun the compilation
+REM without the pipe, exiting now if it fails, to avoid trying to run checks
+ninja -C build > nul
 
   check_world_script: |
 vcvarsall x64
 meson test %MTEST_ARGS% --num-processes %TEST_JOBS%
 
+  # This should be last, so check_world is run even if there are warnings
+  always:
+compiler_warnings_script:
+  # this avoids using metachars which would be interpretted by the windows shell
+  - sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0'
+
   on_failure:
 <<: *on_failure_meson
 crashlog_artifacts:
   path: "crashlog-*.txt"
   type: text/plain
 
 
 task:
   << : *WINDOWS_ENVIRONMENT_BASE
   name: Windows - Server 2019, MinGW64 - Meson
 
-- 
2.34.1

>From 60378805b5b2d634f426850b6ce4d1c64b8aabf4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 24 Jun 2022 00:09:12 -0500
Subject: [PATCH 2/8] cirrus/freebsd: run with more CPUs+RAM and do not
 repartition

There was some historic problem where tests under freebsd took 8+ minutes (and
before 4a288a37f took 15 minutes).

This reduces test time from 10min to 3min.
4 CPUs 4 tests https://cirrus-ci.com/task/4880240739614720
4 CPUs 6 tests https://cirrus-ci.com/task/4664440120410112 https://cirrus-ci.com/task/4586784884523008
4 CPUs 8 tests https://cirrus-ci.com/task/5001995491737600

6 CPUs https://cirr

Re: psql \watch 2nd argument: iteration count

2023-03-13 Thread Kyotaro Horiguchi
At Tue, 14 Mar 2023 11:36:17 +0900, Michael Paquier  wrote 
in 
> Ok, thanks for looking.  Let's wait a bit and see if others have an
> opinion to offer.  At least, the CI is green.

+   if (*opt_end)
+   pg_log_error("\\watch: incorrect 
interval value '%s'", opt);
+   else if (errno == ERANGE)
+   pg_log_error("\\watch: out-of-range 
interval value '%s'", opt);
+   else
+   pg_log_error("\\watch: interval value 
'%s' less than zero", opt);

I'm not sure if we need error messages for that resolution and I'm a
bit happier to have fewer messages to translate:p. Merging the cases
of ERANGE and negative values might be better. And I think we usually
refer to unparsable input as "invalid".

if (*opt_end)
   pg_log_error("\\watch: invalid interval value '%s'", opt);
else
   pg_log_error("\\watch: interval value '%s' out of range", opt);


It looks good other than that.

By the way, I noticed that \watch erases the query buffer. That
behavior differs from other commands, such as \g. And the difference
is not documented. Why do we erase the query buffer only in the case
of \watch?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




  1   2   >