date:20240607

On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
 wrote:
>
> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar  wrote:
> >
> > On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> >  wrote:
> >>
> >> On Wed, 5 Jun 2024 at 18:47, Ranier Vilela  wrote:
> >>>
> >>> Why not use it too, if not binary_upgrade?
> >>
> >> Because in the normal case (not during binary_upgrade) you don't want
> >> to have to generate 2 checkpoints for every created database,
> >> especially not when your shared buffers are large. Checkpoints' costs
> >> scale approximately linearly with the size of shared buffers, so being
> >> able to skip those checkpoints (with strategy=WAL_LOG) will save a lot
> >> of performance in the systems where this performance impact matters
> >> most.
> >
> > I agree with you that we introduced the WAL_LOG strategy to avoid
> > these force checkpoints. However, in binary upgrade cases where no
> > operations are happening in the system, the FILE_COPY strategy should
> > be faster.
>
> While you would be correct if there were no operations happening in
> the system, during binary upgrade we're still actively modifying
> catalogs; and this is done with potentially many concurrent jobs. I
> think it's not unlikely that this would impact performance.

Maybe, but generally, long checkpoints are problematic because they
involve a lot of I/O, which hampers overall system performance.
However, in the case of a binary upgrade, the concurrent operations
are only performing a schema restore, not a real data restore.
Therefore, it shouldn't have a significant impact, and the checkpoints
should also not do a lot of I/O during binary upgrade, right?

> Now that I think about it, arguably, we shouldn't need to run
> checkpoints during binary upgrade for the FILE_COPY strategy after
> we've restored the template1 database and created a checkpoint after
> that: All other databases use template1 as their template database,
> and the checkpoint is there mostly to guarantee the FS knows about all
> changes in the template database before we task it with copying the
> template database over to our new database, so the protections we get
> from more checkpoints are practically useless.
> If such a change were implemented (i.e. no checkpoints for FILE_COPY
> in binary upgrade, with a single manual checkpoint after restoring
> template1 in create_new_objects) I think most of my concerns with this
> patch would be alleviated.

Yeah, I think that's a valid point. The second checkpoint is to ensure
that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
binary upgrades, we don't need that guarantee because a checkpoint
will be performed during shutdown at the end of the upgrade anyway.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: using func to locate and distinguish some error messages

2024-06-07 Thread Alvaro Herrera

On 2024-Jun-07, jian he wrote:

> so when it actually happens, it cannot quickly locate which function
> where the error has happened.
> maybe under certain conditions (e.g. certain build type or certain
> log_min_messages),
> we can also print out the function name by using gcc __func__.

That information is already in the error data, so you don't need it in
the message text.  You can change your log_error_verbosity if you want
it to show up in the log; in psql you can use \errverbose to have it
shown to you after the error is thrown, or you can use
  \pset VERBOSITY verbose
to have it printed for every error message.  Tools other than psql would
need their own specific ways to display those.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)

Re: Bug: PGTYPEStimestamp_from_asc() in ECPG pgtypelib

2024-06-07 Thread Ryo Matsumura (Fujitsu)

# I'm sorry for my late response.

I confirmed that the error of regression is caused by my code inserting 
setlocale() into ecpglib of local branch.
No other tests occur error in non-C locale.

The following is about other topics.


1. About regression test

We should test the followings:
- PGTYPEStimestamp_from_asc("1994-02-11 26:10:35", NULL) returns 0.
- PGTYPEStimestamp_fmt_asc() can accept format string including %x and %X.

ecpglib should be affected by only setlocale() called by user application and
dt_test.pgc does not call it. So the following test is the best, I think.
Please see attached patch for detail (fix_pgtypeslib_regress.patch).

ts1 = PGTYPEStimestamp_from_asc("1994-02-11 3:10:35", NULL);
text = PGTYPEStimestamp_to_asc(ts1);
printf("timestamp_to_asc2: %s\n", text);
PGTYPESchar_free(text);

/*  abc-03:10:35-def-02/11/94-gh  */
/*  12345678901234567890123456789 */

out = (char*) malloc(32);
i = PGTYPEStimestamp_fmt_asc(&ts1, out, 31, "abc-%X-def-%x-ghi%%");
printf("timestamp_fmt_asc: %d: %s\n", i, out);
free(out);

ts1 = PGTYPEStimestamp_from_asc("1994-02-11 26:10:35", NULL);
text = PGTYPEStimestamp_to_asc(ts1);
printf("timestamp_to_asc3: %s\n", text);
PGTYPESchar_free(text);

We should also add tests that check PGTYPEStimestamp_*() set errno for invalid 
input correctly,
but I want to leave the improvement to the next timing when implement for 
timestamp is changed.
(Maybe the timing will not come.)


2. About document of PGTYPEStimestamp_from_asc() and PGTYPESInvalidTimestamp

0 returned by PGTYPEStimestamp_from_asc () is a valid timestamp as you 
commented and
we should not break compatibility.
So we should remove the document for PGTYPESInvalidTimestamp and add one for 
checking errno
to description of PGTYPEStimestamp_from_asc().
Please see attached patch for detail (fix_PGTYPESInvalidTimestamp_doc.patch).


3. About endptr of *_from_asc()
> PGTYPESdate_from_asc(ParseDate)
> PGTYPEStimestamp_from_asc(ParseDate)
> PGTYPESinterval_from_asc(ParseDate)
> PGTYPESnumeric_from_asc

Basically, they return immediately just after detecting invalid format.
However, after passing the narrow parse, they could fails (e.g. failure of 
DecodeInterval(), DecodeISO8601Interval(), malloc(), and so on).

So we should write as follows:
   If the function detects invalid format,
   then it stores the address of the first invalid character in
   endptr. However, don't assume it successed if
   endptr points to end of input because other
   processing(e.g. memory allocation) could fails.
   Therefore, you should check return value and errno for detecting error.
   You can safely endptr to NULL.

I also found pandora box that description of the followings don't show their 
behavior when it fails.
I fix doc including them. Please see attached 
patch(fix_pgtypeslib_funcs_docs.patch).
- PGTYPESdate_from_asc()# sets errno. (can not check return value)
- PGTYPESdate_defmt_asc()   # returns -1 and sets errno
- PGTYPEStimestamp_to_asc() # returns NULL and sets errno
- PGTYPEStimestamp_defmt_asc()  # just returns 1 and doesn't set errno!
- PGTYPESinterval_new() # returns NULL and sets errno
- PGTYPESinterval_from_asc()# returns NULL and sets errno
- PGTYPESinterval_to_asc()  # returns NULL and sets errno
- PGTYPESinterval_copy  # currently always return 0
- PGTYPESdecimal_new()  # returns NULL and sets errno


4. Bug of PGTYPEStimestamp_defmt_asc()
PGTYPEStimestamp_defmt_asc() doesn't set errno on failure.
I didn't make a patch for it yet.

Best Regards
Ryo Matsumura


fix_pgtypeslib_regress.patch
Description: fix_pgtypeslib_regress.patch


fix_pgtypeslib_funcs_docs.patch
Description: fix_pgtypeslib_funcs_docs.patch


fix_PGTYPESInvalidTimestamp_doc.patch
Description: fix_PGTYPESInvalidTimestamp_doc.patch

Re: using func to locate and distinguish some error messages

2024-06-07 Thread jian he

On Fri, Jun 7, 2024 at 4:28 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-07, jian he wrote:
>
> > so when it actually happens, it cannot quickly locate which function
> > where the error has happened.
> > maybe under certain conditions (e.g. certain build type or certain
> > log_min_messages),
> > we can also print out the function name by using gcc __func__.
>
> That information is already in the error data, so you don't need it in
> the message text.  You can change your log_error_verbosity if you want
> it to show up in the log; in psql you can use \errverbose to have it
> shown to you after the error is thrown, or you can use
>   \pset VERBOSITY verbose
> to have it printed for every error message.  Tools other than psql would
> need their own specific ways to display those.
>

Thanks for pointing this out.

Re: Compress ReorderBuffer spill files using LZ4

On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.

+1

> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
>
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
>
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.
>

I think the compression option should be supported at the CREATE
SUBSCRIPTION level instead of being controlled by a GUC. This way, we
can decide on compression for each subscription individually rather
than applying it to all subscribers. It makes more sense for the
subscriber to control this, especially when we are planning to
compress the data sent downstream.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Avoid orphaned objects dependencies, take 3

2024-06-07 Thread Bertrand Drouvot

Hi,

On Thu, Jun 06, 2024 at 04:00:23PM -0400, Robert Haas wrote:
> On Thu, Jun 6, 2024 at 1:56 AM Bertrand Drouvot
>  wrote:
> > v9 is more invasive (as it changes code in much more places) than v8 but it 
> > is
> > easier to follow (as it is now clear where the new lock is acquired).
> 
> Hmm, this definitely isn't what I had in mind. Possibly that's a sign
> that what I had in mind was dumb, but for sure it's not what I
> imagined. What I thought you were going to do was add calls like
> LockDatabaseObject(NamespaceRelationId, schemaid, 0, AccessShareLock)
> in various places, or perhaps LockRelationOid(reloid,
> AccessShareLock), or whatever the case may be.

I see what you’re saying, doing things like:

LockDatabaseObject(TypeRelationId, returnType, 0, AccessShareLock);
in ProcedureCreate() for example.

> Here you've got stuff
> like this:
> 
> - record_object_address_dependencies(&conobject, addrs_auto,
> -DEPENDENCY_AUTO);
> + lock_record_object_address_dependencies(&conobject, addrs_auto,
> + DEPENDENCY_AUTO);
> 
> ...which to me looks like the locking is still pushed down inside the
> dependency code.

Yes but it’s now located in places where, I think, it’s easier to understand
what’s going on (as compare to v8), except maybe for:

recordDependencyOnExpr()
makeOperatorDependencies()
GenerateTypeDependencies()
makeParserDependencies()
makeDictionaryDependencies()
makeTSTemplateDependencies()
makeConfigurationDependencies()

but probably for:

heap_create_with_catalog()
StorePartitionKey()
index_create()
AggregateCreate()
CastCreate()
CreateConstraintEntry()
ProcedureCreate()
RangeCreate()
InsertExtensionTuple()
CreateTransform()
CreateProceduralLanguage()

The reasons I keep it linked to the dependency code are:

- To ensure we don’t miss anything (well, with the new Assert in place that’s
probably a tangential argument)

- It’s not only about locking the object: it’s also about 1) verifying the 
object
is pinned, 2) checking it still exists and 3) provide a description in the error
message if we can (in case the object does not exist anymore). Relying on an
already build object (in the dependency code) avoid to 1) define the object(s)
one more time or 2) create new functions that would do the same as 
isObjectPinned()
and getObjectDescription() with a different set of arguments.

That may sounds like weak arguments but it has been my reasoning.

Do you still find the code hard to maintain with v9?

> 
> And you also have stuff like this:
> 
>   ObjectAddressSet(referenced, RelationRelationId, childTableId);
> + depLockAndCheckObject(&referenced);
>   recordDependencyOn(&depender, &referenced, DEPENDENCY_PARTITION_SEC);
> 
> But in depLockAndCheckObject you have:
> 
> + if (object->classId == RelationRelationId || object->classId ==
> AuthMemRelationId)
> + return;
> 
> That doesn't seem right, because then it seems like the call isn't
> doing anything, but there isn't really any reason for it to not be
> doing anything. If we're dropping a dependency on a table, then it
> seems like we need to have a lock on that table. Presumably the reason
> why we don't end up with dangling dependencies in such cases now is
> because we're careful about doing LockRelation() in the right places,

Yeah, that's what I think: we're already careful when we deal with relations.

> but we're not similarly careful about other operations e.g.
> ConstraintSetParentConstraint is called by DefineIndex which calls
> table_open(childRelId, ...) first, but there's no logic in DefineIndex
> to lock the constraint.

table_open(childRelId, ...) would lock any "ALTER TABLE  DROP 
CONSTRAINT"
already. Not sure I understand your concern here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-07 Thread Peter Eisentraut


On 07.06.24 08:10, Erica Zhang wrote:
I’m a Postgres user and I’m looking into restricting the set of allowed 
ciphers on Postgres and configure a concrete set of curves on our 
postgres instances.


Out of curiosity, why is this needed in practice?

Could you please help to review to see if you are interested in having 
this change in upcoming Postgres major release(It's should be PG17)?


It would be targetting PG18 now.

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

2024-06-07 Thread Alexander Lakhin


Hello Thomas,

07.06.2024 09:06, Thomas Munro wrote:

On Fri, Jun 7, 2024 at 3:06 PM Thomas Munro  wrote:

On Fri, Jun 7, 2024 at 3:00 PM Alexander Lakhin  wrote:

My bisect run ended with:
210622c60e1a9db2e2730140b8106ab57d259d15 is the first bad commit

Author: Thomas Munro 
Date:   Wed Apr 3 00:03:08 2024 +1300

  Provide vectored variant of ReadBuffer().

Other buildfarm failures with this Assert I could find kind of confirm this:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2003%3A32%3A18
(presumably a first failure of this sort)
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua&dt=2024-04-04%2015%3A38%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2024-05-07%2004%3A00%3A08

Looking...

What Noah described[1] is what should be happening already, I think,
but 210622c6 unconditionally zeroed the page.  Oops.  The attached
seems to cure his repro for me.  Does it also cure your test?  I
couldn't see that variant myself for some reason, but it seems to make
sense as the explanation.  I would probably adjust the function name
or perhaps consider refactoring slightly, but first let's confirm that
this is the same issue and fix.


Thank you for looking and for the fix!

Using the same testing procedure (applying patch for checking lpp,
multiplying 026_overwrite_contrecord.pl tests and running 30 tests in
parallel, with fsync=on) which I used for bisecting, I got failures on
iterations 8, 19, 4 without the fix, but with the fix applied, 125
iterations passed. I think The Cure is sound.

Best regards,
Alexander

Re: relfilenode statistics

2024-06-07 Thread Bertrand Drouvot

Hi,

On Thu, Jun 06, 2024 at 08:38:06PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote:
> > The main argument is that we currently don’t have writes counters for 
> > relations.
> > The reason is that we don’t have the relation OID when writing buffers out.
> > Tracking writes per relfilenode would allow us to track/consolidate writes 
> > per
> > relation (example in the v1 patch and in the message up-thread).
> > 
> > I think that adding instrumentation in this area (writes counters) could be
> > beneficial (like it is for the ones we currently have for reads).
> > 
> > Second argument is that this is also beneficial for the "Split index and
> > table statistics into different types of stats" thread (mentioned in the 
> > previous
> > message). It would allow us to avoid additional branches in some situations 
> > (like
> > the one mentioned by Andres in the link I provided up-thread).
> 
> I think there's another *very* significant benefit:
> 
> Right now physical replication doesn't populate statistics fields like
> n_dead_tup, which can be a huge issue after failovers, because there's little
> information about what autovacuum needs to do.
> 
> Auto-analyze *partially* can fix it at times, if it's lucky enough to see
> enough dead tuples - but that's not a given and even if it works, is often
> wildly inaccurate.
> 
> 
> Once we put things like n_dead_tup into per-relfilenode stats,

Hm - I had in mind to populate relfilenode stats only with stats that are
somehow related to I/O activities. Which ones do you have in mind to put in 
relfilenode stats?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

On Fri, Jun 7, 2024 at 7:57 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Thanks for the comments! Here is the V6 patch that addressed the these.
>

I have pushed this after making minor changes in the wording. I have
also changed one of the queries in docs to ignore the NULL slot_name
values.

-- 
With Regards,
Amit Kapila.

Re: Compress ReorderBuffer spill files using LZ4

2024-06-07 Thread Alvaro Herrera

On 2024-Jun-07, Dilip Kumar wrote:

> I think the compression option should be supported at the CREATE
> SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> can decide on compression for each subscription individually rather
> than applying it to all subscribers. It makes more sense for the
> subscriber to control this, especially when we are planning to
> compress the data sent downstream.

True.  (I think we have some options that are in GUCs for the general
behavior and can be overridden by per-subscription options for specific
tailoring; would that make sense here?  I think it does, considering
that what we mostly want is to save disk space in the publisher when
spilling to disk.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

2024-06-07 Thread Matthias van de Meent

On Fri, 7 Jun 2024 at 10:28, Dilip Kumar  wrote:
>
> On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
>  wrote:
>>
>> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar  wrote:
>>>
>>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
>>>  wrote:
>>>
>>> I agree with you that we introduced the WAL_LOG strategy to avoid
>>> these force checkpoints. However, in binary upgrade cases where no
>>> operations are happening in the system, the FILE_COPY strategy should
>>> be faster.
>>
>> While you would be correct if there were no operations happening in
>> the system, during binary upgrade we're still actively modifying
>> catalogs; and this is done with potentially many concurrent jobs. I
>> think it's not unlikely that this would impact performance.
>
> Maybe, but generally, long checkpoints are problematic because they
> involve a lot of I/O, which hampers overall system performance.
> However, in the case of a binary upgrade, the concurrent operations
> are only performing a schema restore, not a real data restore.
> Therefore, it shouldn't have a significant impact, and the checkpoints
> should also not do a lot of I/O during binary upgrade, right?

My primary concern isn't the IO, but the O(shared_buffers) that we
have to go through during a checkpoint. As I mentioned upthread, it is
reasonably possible the new cluster is already setup with a good
fraction of the old system's shared_buffers configured. Every
checkpoint has to scan all those buffers, which IMV can get (much)
more expensive than the IO overhead caused by the WAL_LOG strategy. It
may be a baseless fear as I haven't done the performance benchmarks
for this, but I wouldn't be surprised if shared_buffers=8GB would
measurably impact the upgrade performance in the current patch (vs the
default 128MB).

I'll note that the documentation for upgrading with pg_upgrade has the
step for updating postgresql.conf / postgresql.auto.conf only after
pg_upgrade has run already, but that may not be how it's actually
used: after all, we don't have full control in this process, the user
is the one who provides the new cluster with initdb.

>> If such a change were implemented (i.e. no checkpoints for FILE_COPY
>> in binary upgrade, with a single manual checkpoint after restoring
>> template1 in create_new_objects) I think most of my concerns with this
>> patch would be alleviated.
>
> Yeah, I think that's a valid point. The second checkpoint is to ensure
> that the XLOG_DBASE_CREATE_FILE_COPY never gets replayed. However, for
> binary upgrades, we don't need that guarantee because a checkpoint
> will be performed during shutdown at the end of the upgrade anyway.

Indeed.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

XACT_EVENT for 'commit prepared'

2024-06-07 Thread Xiaoran Wang

Hi hackers,

I found that in enum XactEvent, there is  'XACT_EVENT_PREPARE'  for
'prepare transaction', but there is no event for 'commit prepared' or
'rollback prepared'.

For the following SQL:

begin;
create table test(a int);
PREPARE TRANSACTION 'foo';
rollback prepared 'foo';
-
When executing ' rollback prepared 'foo'; ', I expected to get
'XACT_EVENT_ABORT', but actually,
the event type is 'XACT_EVENT_COMMIT'.

I think XACT_EVENT_COMMIT_PREPARED and XACT_EVENT_ROLLBACK_PREPARED can be
added in function 'FinishPreparedTransaction'

I'm confused why there are no related events for them.

Re: Compress ReorderBuffer spill files using LZ4

On Fri, Jun 7, 2024 at 2:39 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-07, Dilip Kumar wrote:
>
> > I think the compression option should be supported at the CREATE
> > SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> > can decide on compression for each subscription individually rather
> > than applying it to all subscribers. It makes more sense for the
> > subscriber to control this, especially when we are planning to
> > compress the data sent downstream.
>
> True.  (I think we have some options that are in GUCs for the general
> behavior and can be overridden by per-subscription options for specific
> tailoring; would that make sense here?  I think it does, considering
> that what we mostly want is to save disk space in the publisher when
> spilling to disk.)

Yeah, that makes sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade

On Fri, Jun 7, 2024 at 2:40 PM Matthias van de Meent
 wrote:
>
> On Fri, 7 Jun 2024 at 10:28, Dilip Kumar  wrote:
> >
> > On Fri, Jun 7, 2024 at 11:57 AM Matthias van de Meent
> >  wrote:
> >>
> >> On Fri, 7 Jun 2024 at 07:18, Dilip Kumar  wrote:
> >>>
> >>> On Wed, Jun 5, 2024 at 10:59 PM Matthias van de Meent
> >>>  wrote:
> >>>
> >>> I agree with you that we introduced the WAL_LOG strategy to avoid
> >>> these force checkpoints. However, in binary upgrade cases where no
> >>> operations are happening in the system, the FILE_COPY strategy should
> >>> be faster.
> >>
> >> While you would be correct if there were no operations happening in
> >> the system, during binary upgrade we're still actively modifying
> >> catalogs; and this is done with potentially many concurrent jobs. I
> >> think it's not unlikely that this would impact performance.
> >
> > Maybe, but generally, long checkpoints are problematic because they
> > involve a lot of I/O, which hampers overall system performance.
> > However, in the case of a binary upgrade, the concurrent operations
> > are only performing a schema restore, not a real data restore.
> > Therefore, it shouldn't have a significant impact, and the checkpoints
> > should also not do a lot of I/O during binary upgrade, right?
>
> My primary concern isn't the IO, but the O(shared_buffers) that we
> have to go through during a checkpoint. As I mentioned upthread, it is
> reasonably possible the new cluster is already setup with a good
> fraction of the old system's shared_buffers configured. Every
> checkpoint has to scan all those buffers, which IMV can get (much)
> more expensive than the IO overhead caused by the WAL_LOG strategy. It
> may be a baseless fear as I haven't done the performance benchmarks
> for this, but I wouldn't be surprised if shared_buffers=8GB would
> measurably impact the upgrade performance in the current patch (vs the
> default 128MB).

Okay, that's a valid point.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: speed up a logical replica setup

2024-06-07 Thread Hayato Kuroda (Fujitsu)

Dear Euler,

Thanks for making the follow-up patch! I was looking forward to your updates.
I think this patch set is the solution for the found buildfarm error. However,
there are remained claims raised by others. You should reply what you think for
them. At least:

1) There are some misleading messages [1]. I think v3-0005 patch set can solve
the issue.
2) pg_createsubscriber may fail If the primary has subscriptions [2]. IIUC
possible approaches are A)"keep subscriptions disabled at the end",
B)"by default drop the pre-existing subscriptions",
C) "do nothing, just document the risk".

> Before sending this email I realized that I did nothing about physical
> replication slots on the standby. I think we should also remove them too
> unconditionally.

I also considered around here, but it might be difficult to predict the
expectation by users. Can we surely say that it's not intentional one? Regarding
the failover slot, it is OK because that's meaningful only on the standby,
but not sure other slots. I personally think we can keep current spec, but
how do other think?

Below parts are comments for each patches.

0001
Basically LGTM. I was bit confused because the default timeout is not set, but
it seemed to follow the suggestion by Tomas [3].

0002
If you want to improve the commit message, please add that
sync_replication_slots
is disabled during the conversion.

0003
Confirmed it followed the discussion [4].

0004
Basically LGTM.

Other minor comments are included by the attached diff file. It contains changes
to follow conventions and pgindent/pgperltidy.

[1]:
https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com
[3]:
https://www.postgresql.org/message-id/5d5dd4cd-6359-4109-88e8-c8e13035ae16%40enterprisedb.com
[4]:
https://www.postgresql.org/message-id/CAA4eK1LZxYxcbeiOn3Q5hjXVtZKhJWj-fQtndAeTCvZrPev8BA%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

minor_fix_by_kuroda.diff
Description: minor_fix_by_kuroda.diff

Re: relfilenode statistics

2024-06-07 Thread Bertrand Drouvot

Hi,

On Thu, Jun 06, 2024 at 08:17:36PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> > On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
> >  wrote:
> > > I think we should keep the stats in the relation during relfilenode 
> > > changes.
> > > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes 
> > > in
> > > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you 
> > > can
> > > see in the example provided up-thread the new heap_blks_written statistic 
> > > has
> > > been preserved during the TRUNCATE.
> >
> > Yeah, I think there's something weird about this design. Somehow we're
> > ending up with both per-relation and per-relfilenode counters:
> >
> > +   pg_stat_get_blocks_written(C.oid) +
> > pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> > C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> > C.relfilenode) AS heap_blks_written,
> >
> > I'll defer to Andres if he thinks that's awesome, but to me it does
> > not seem right to track some blocks written in a per-relation counter
> > and others in a per-relfilenode counter.
> 
> It doesn't immediately sound awesome. Nor really necessary?
> 
> If we just want to keep prior stats upon arelation rewrite, we can just copy
> the stats from the old relfilenode.

Agree, that's another option. But I think that would be in another field like
"cumulative_XXX" to ensure one could still retrieve stats that are "dedicated"
to this particular "new" relfilenode. Thoughts?

> Or we can decide that those stats don't
> really make sense anymore, and start from scratch.
> 
> 
> I *guess* I could see an occasional benefit in having both counter for "prior
> relfilenodes" and "current relfilenode" - except that stats get reset manually
> and upon crash anyway, making this less useful than if it were really
> "lifetime" stats.

Right but currently they are not lost during a relation rewrite. If we decide to
not keep the relfilenode stats during a rewrite then things like heap_blks_read
would stop surviving a rewrite (if we move it to relfilenode stats) while it
currently does. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Compress ReorderBuffer spill files using LZ4

On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera  wrote:
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.
>

Fair enough. it would be an interesting feature if we see the wider
usefulness of compression/decompression of logical changes. For
example, if this can improve the performance of applying large
transactions (aka reduce the apply lag for them) even when the
'streaming' option is 'parallel' then it would have a much wider
impact.

-- 
With Regards,
Amit Kapila.

Re: Compress ReorderBuffer spill files using LZ4

On Fri, Jun 7, 2024 at 2:08 PM Dilip Kumar  wrote:
>
> I think the compression option should be supported at the CREATE
> SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> can decide on compression for each subscription individually rather
> than applying it to all subscribers. It makes more sense for the
> subscriber to control this, especially when we are planning to
> compress the data sent downstream.
>

Yes, that makes sense. However, we then need to provide this option
via SQL APIs as well for other plugins.

-- 
With Regards,
Amit Kapila.

Re: Reordering DISTINCT keys to match input path's pathkeys

2024-06-07 Thread Richard Guo

On Mon, Feb 5, 2024 at 11:18 AM Richard Guo  wrote:
> cfbot reminds that this patch does not apply any more.  So I've rebased
> it on master, and also adjusted the test cases a bit.

This patch does not apply any more, so here is a new rebase, with some
tweaks to the comments.

Thanks
Richard


v3-0001-Reordering-DISTINCT-keys-to-match-input-path-s-pathkeys.patch
Description: Binary data

Re:Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-07 Thread Erica Zhang

Hi Peter,
Thanks a lot for the quick response. We are using Postgres instance in our 
product. For some security consideration, we prefer to use TLS1.3 cipher suites 
in our product with some customization values instead of default value 
"HIGH:MEDIUM:+3DES:!aNULL". Moreover we prefer to set a group of ecdh keys 
instead of a single value.

I see the https://commitfest.postgresql.org/48/ is still open, could it be 
possible to target for PG17? As I know PG17 is going to be release this year so 
that we can upgrade our instances to this new version accodingly.

Original Email

Sender:"Peter Eisentraut"< pe...@eisentraut.org >;

Sent Time:2024/6/7 16:55

To:"Erica Zhang"< ericazhangy2...@qq.com >;"pgsql-hackers"< 
pgsql-hackers@lists.postgresql.org >;

Subject:Re: Add support to TLS 1.3 cipher suites and curves lists

On 07.06.24 08:10, Erica Zhang wrote:
> I’m a Postgres user and I’m looking into restricting the set of allowed 
> ciphers on Postgres and configure a concrete set of curves on our 
> postgres instances.

Out of curiosity, why is this needed in practice?

> Could you please help to review to see if you are interested in having 
> this change in upcoming Postgres major release(It's should be PG17)?

It would be targetting PG18 now.

Re: Proposal to include --exclude-extension Flag in pg_dump

2024-06-07 Thread Dean Rasheed

On Tue, 19 Mar 2024 at 11:53, Daniel Gustafsson  wrote:
>
> I did notice a few mistakes in the --filter
> documentation portion for other keywords but that's unrelated to this patch,
> will fix them once this is in to avoid conflicts.
>

Attached is a patch for the --filter docs, covering the omissions I can see.

Regards,
Dean
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
new file mode 100644
index 08d7753..b1dfa21
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -866,13 +866,14 @@ PostgreSQL documentation
 same rules as the corresponding options:
 -t/--table,
 --table-and-children,
---exclude-table-and-children or
--T for tables,
--n/--schema for schemas,
+-T/--exclude-table, and
+--exclude-table-and-children for tables,
+-n/--schema and
+-N/--exclude-schema for schemas,
 --include-foreign-data for data on foreign servers,
---exclude-table-data,
+--exclude-table-data and
 --exclude-table-data-and-children for table data, and
--e/--extension or
+-e/--extension and
 --exclude-extension for extensions.
 To read from STDIN, use - as the
 filename.  The --filter option can be specified in
@@ -895,34 +896,37 @@ PostgreSQL documentation
 
  
   
-   extension: extensions, works like the
-   -e/--extension option.
+   extension: extensions. This works like the
+   -e/--extension and
+   --exclude-extension options.
   
  
  
   
-   foreign_data: data on foreign servers, works like
+   foreign_data: data on foreign servers. This works like
the --include-foreign-data option. This keyword can
only be used with the include keyword.
   
  
  
   
-   table: tables, works like the
-   -t/--table option.
+   table: tables. This works like the
+   -t/--table and
+   -T/--exclude-table options.
   
  
  
   
table_and_children: tables including any partitions
-   or inheritance child tables, works like the
-   --table-and-children option.
+   or inheritance child tables. This works like the
+   --table-and-children and
+   --exclude-table-and-children options.
   
  
  
   
table_data: table data of any tables matching
-   pattern, works like the
+   pattern. This works like the
--exclude-table-data option. This keyword can only
be used with the exclude keyword.
   
@@ -931,15 +935,16 @@ PostgreSQL documentation
   
table_data_and_children: table data of any tables
matching pattern as well as any partitions
-   or inheritance children of the table(s), works like the
+   or inheritance children of the table(s). This works like the
--exclude-table-data-and-children option. This
keyword can only be used with the exclude keyword.
   
  
  
   
-   schema: schemas, works like the
-   -n/--schema option.
+   schema: schemas. This works like the
+   -n/--schema and
+   -N/--exclude-schema options.

Re: Logical Replication of sequences

On Fri, Jun 7, 2024 at 7:55 AM Masahiko Sawada  wrote:
>
> On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila  wrote:
> >
> > On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > > To achieve this, we can allow sequences to be copied during
> > > > the initial CREATE SUBSCRIPTION command similar to what we do for
> > > > tables. And then later by new/existing command, we re-copy the already
> > > > existing sequences on the subscriber.
> > > >
> > > > The options for the new command could be:
> > > > Alter Subscription ... Refresh Sequences
> > > > Alter Subscription ... Replicate Sequences
> > > >
> > > > In the second option, we need to introduce a new keyword Replicate.
> > > > Can you think of any better option?
> > >
> > > Another idea is doing that using options. For example,
> > >
> > > For initial sequences synchronization:
> > >
> > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
> > >
> >
> > How will it interact with the existing copy_data option? So copy_data
> > will become equivalent to copy_table_data, right?
>
> Right.
>
> >
> > > For re-copy (or update) sequences:
> > >
> > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true);
> > >
> >
> > Similar to the previous point it can be slightly confusing w.r.t
> > copy_data. And would copy_sequence here mean that it would copy
> > sequence values of both pre-existing and newly added sequences, if so,
> > that would make it behave differently than copy_data?  The other
> > possibility in this direction would be to introduce an option like
> > replicate_all_sequences/copy_all_sequences which indicates a copy of
> > both pre-existing and new sequences, if any.
>
> Copying sequence data works differently than replicating table data
> (initial data copy and logical replication). So I thought the
> copy_sequence option (or whatever better name) always does both
> updating pre-existing sequences and adding new sequences. REFRESH
> PUBLICATION updates the tables to be subscribed, so we also update or
> add sequences associated to these tables.
>

Are you imagining the behavior for sequences associated with tables
differently than the ones defined by the CREATE SEQUENCE .. command? I
was thinking that users would associate sequences with publications
similar to what we do for tables for both cases. For example, they
need to explicitly mention the sequences they want to replicate by
commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
SEQUENCES IN SCHEMA sch1;

In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
should copy both the explicitly defined sequences and sequences
defined with the tables. Do you think a different variant for just
copying sequences implicitly associated with tables (say for identity
columns)?

>
> >
> > > >
> > > > In addition to the above, the command Alter Subscription .. Refresh
> > > > Publication will fetch any missing sequences similar to what it does
> > > > for tables.
> > >
> > > On the subscriber side, do we need to track which sequences are
> > > created via CREATE/ALTER SUBSCRIPTION?
> > >
> >
> > I think so unless we find some other way to know at refresh
> > publication time which all new sequences need to be part of the
> > subscription. What should be the behavior w.r.t sequences when the
> > user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was
> > thinking similar to tables, it should fetch any missing sequence
> > information from the publisher.
>
> It seems to make sense to me. But I have one question: do we want to
> support replicating sequences that are not associated with any tables?
>

Yes, unless we see a problem with it.

> if yes, what if we refresh two different subscriptions that subscribe
> to different tables on the same database?

What problem do you see with it?

>
 On the other hand, if no
> (i.e. replicating only sequences owned by tables), can we know which
> sequences to replicate by checking the subscribed tables?
>

Sorry, I didn't understand your question. Can you please try to
explain in more words or use some examples?

-- 
With Regards,
Amit Kapila.

Re: Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-07 Thread Michael Paquier

On Fri, Jun 07, 2024 at 06:02:37PM +0800, Erica Zhang wrote:
> I see the https://commitfest.postgresql.org/48/ is still open, could
> it be possible to target for PG17? As I know PG17 is going to be
> release this year so that we can upgrade our instances to this new
> version accodingly.

Echoing with Peter, https://commitfest.postgresql.org/48/ is planned
to be the first commit fest of the development cycle for Postgres 18.
v17 is in feature freeze state and beta, where only bug fixes are
accepted, and not new features.
--
Michael


signature.asc
Description: PGP signature

Re: Conflict Detection and Resolution

2024-06-07 Thread Ashutosh Bapat

On Thu, Jun 6, 2024 at 5:16 PM Nisha Moond  wrote:

> >
>
> Here are more use cases of the "earliest_timestamp_wins" resolution method:
> 1) Applications where the record of first occurrence of an event is
> important. For example, sensor based applications like earthquake
> detection systems, capturing the first seismic wave's time is crucial.
> 2) Scheduling systems, like appointment booking, prioritize the
> earliest request when handling concurrent ones.
> 3) In contexts where maintaining chronological order is important -
>   a) Social media platforms display comments ensuring that the
> earliest ones are visible first.
>   b) Finance transaction processing systems rely on timestamps to
> prioritize the processing of transactions, ensuring that the earliest
> transaction is handled first
>

Thanks for sharing examples. However, these scenarios would be handled by
the application and not during replication. What we are discussing here is
the timestamp when a row was updated/inserted/deleted (or rather when the
transaction that updated row committed/became visible) and not a DML on
column which is of type timestamp. Some implementations use a hidden
timestamp column but that's different from a user column which captures
timestamp of (say) an event. The conflict resolution will be based on the
timestamp when that column's value was recorded in the database which may
be different from the value of the column itself.

If we use the transaction commit timestamp as basis for resolution, a
transaction where multiple rows conflict may end up with different rows
affected by that transaction being resolved differently. Say three
transactions T1, T2 and T3 on separate origins with timestamps t1, t2, and
t3 respectively changed rows r1, r2 and r2, r3 and r1, r4 respectively.
Changes to r1 and r2 will conflict. Let's say T2 and T3 are applied first
and then T1 is applied. If t2 < t1 < t3, r1 will end up with version of T3
and r2 will end up with version of T1 after applying all the three
transactions. Would that introduce an inconsistency between r1 and r2?

-- 
Best Wishes,
Ashutosh Bapat

Re: Things I don't like about \du's "Attributes" column

On Thu, Jun 6, 2024 at 5:10 PM Pavel Luzanov  wrote:
> Agree.
> There is an additional technical argument for removing this replacement.
> I don't like explicit cast to text of the "Connection limit" column.
> Without 'Not allowed' it is no longer required.
> Value -1 can be replaced by NULL with an implicit cast to integer.

Yeah, +1 for that idea.

> Example output:
>
> \du+ regress_du*
> List of roles
> Role name | Login | Attributes  | Valid until  | 
> Connection limit |   Description
> --+---+-+--+--+--
>  regress_du_admin | yes   | Superuser  +|  |  
> | some description
>   |   | Create DB  +|  |  
> |
>   |   | Create role+|  |  
> |
>   |   | Inherit+|  |  
> |
>   |   | Replication+|  |  
> |
>   |   | Bypass RLS  |  |  
> |
>  regress_du_role0 | yes   | Inherit | Tue Jun 04 00:00:00 2024 PDT |  
>   0 |
>  regress_du_role1 | no| Create role+| infinity |  
> |
>   |   | Inherit |  |  
> |
>  regress_du_role2 | yes   | Inherit+|  |  
>  42 |
>   |   | Replication+|  |  
> |
>   |   | Bypass RLS  |  |  
> |
> (4 rows)

This seems unobjectionable to me. I am not sure whether it is better
than the current verison, or whether it is what we want. But it seems
reasonable.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution

On 5/27/24 07:48, shveta malik wrote:
> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>  wrote:
>>
>> On 5/23/24 08:36, shveta malik wrote:
>>> Hello hackers,
>>>
>>> Please find the proposal for Conflict Detection and Resolution (CDR)
>>> for Logical replication.
>>> >> below details.>
>>>
>>> Introduction
>>> 
>>> In case the node is subscribed to multiple providers, or when local
>>> writes happen on a subscriber, conflicts can arise for the incoming
>>> changes.  CDR is the mechanism to automatically detect and resolve
>>> these conflicts depending on the application and configurations.
>>> CDR is not applicable for the initial table sync. If locally, there
>>> exists conflicting data on the table, the table sync worker will fail.
>>> Please find the details on CDR in apply worker for INSERT, UPDATE and
>>> DELETE operations:
>>>
>>
>> Which architecture are you aiming for? Here you talk about multiple
>> providers, but the wiki page mentions active-active. I'm not sure how
>> much this matters, but it might.
> 
> Currently, we are working for multi providers case but ideally it
> should work for active-active also. During further discussion and
> implementation phase, if we find that, there are cases which will not
> work in straight-forward way for active-active, then our primary focus
> will remain to first implement it for multiple providers architecture.
> 
>>
>> Also, what kind of consistency you expect from this? Because none of
>> these simple conflict resolution methods can give you the regular
>> consistency models we're used to, AFAICS.
> 
> Can you please explain a little bit more on this.
> 

I was referring to the well established consistency models / isolation
levels, e.g. READ COMMITTED or SNAPSHOT ISOLATION. This determines what
guarantees the application developer can expect, what anomalies can
happen, etc.

I don't think any such isolation level can be implemented with a simple
conflict resolution methods like last-update-wins etc. For example,
consider an active-active where both nodes do

  UPDATE accounts SET balance=balance+1000 WHERE id=1

This will inevitably lead to a conflict, and while the last-update-wins
resolves this "consistently" on both nodes (e.g. ending with the same
result), it's essentially a lost update.

This is a very simplistic example of course, I recall there are various
more complex examples involving foreign keys, multi-table transactions,
constraints, etc. But in principle it's a manifestation of the same
inherent limitation of conflict detection and resolution etc.

Similarly, I believe this affects not just active-active, but also the
case where one node aggregates data from multiple publishers. Maybe not
to the same extent / it might be fine for that use case, but you said
the end goal is to use this for active-active. So I'm wondering what's
the plan, there.

If I'm writing an application for active-active using this conflict
handling, what assumptions can I make? Will Can I just do stuff as if on
a single node, or do I need to be super conscious about the zillion ways
things can misbehave in a distributed system?

My personal opinion is that the closer this will be to the regular
consistency levels, the better. If past experience taught me anything,
it's very hard to predict how distributed systems with eventual
consistency behave, and even harder to actually test the application in
such environment.

In any case, if there are any differences compared to the usual
behavior, it needs to be very clearly explained in the docs.

>>
>>> INSERT
>>> 
>>> To resolve INSERT conflict on subscriber, it is important to find out
>>> the conflicting row (if any) before we attempt an insertion. The
>>> indexes or search preference for the same will be:
>>> First check for replica identity (RI) index.
>>>   - if not found, check for the primary key (PK) index.
>>> - if not found, then check for unique indexes (individual ones or
>>> added by unique constraints)
>>>  - if unique index also not found, skip CDR
>>>
>>> Note: if no RI index, PK, or unique index is found but
>>> REPLICA_IDENTITY_FULL is defined, CDR will still be skipped.
>>> The reason being that even though a row can be identified with
>>> REPLICAT_IDENTITY_FULL, such tables are allowed to have duplicate
>>> rows. Hence, we should not go for conflict detection in such a case.
>>>
>>
>> It's not clear to me why would REPLICA_IDENTITY_FULL mean the table is
>> allowed to have duplicate values? It just means the upstream is sending
>> the whole original row, there can still be a PK/UNIQUE index on both the
>> publisher and subscriber.
> 
> Yes, right. Sorry for confusion. I meant the same i.e. in absence of
> 'RI index, PK, or unique index', tables can have duplicates. So even
> in presence of Replica-identity (FULL in this case) but in absence of
> unique/primary index, CDR will be skipped for INSERT.
> 
>>
>>> In case of replica identity ‘nothing’ and in absence

Re: Conflict Detection and Resolution

On 5/28/24 11:17, Nisha Moond wrote:
> On Mon, May 27, 2024 at 11:19 AM shveta malik  wrote:
>>
>> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>>  wrote:
>>>
>>> ...
>>>
>>> I don't understand the why should update_missing or update_deleted be
>>> different, especially considering it's not detected reliably. And also
>>> that even if we happen to find the row the associated TOAST data may
>>> have already been removed. So why would this matter?
>>
>> Here, we are trying to tackle the case where the row is 'recently'
>> deleted i.e. concurrent UPDATE and DELETE on pub and sub. User may
>> want to opt for a different resolution in such a case as against the
>> one where the corresponding row was not even present in the first
>> place. The case where the row was deleted long back may not fall into
>> this category as there are higher chances that they have been removed
>> by vacuum and can be considered equivalent to the update_ missing
>> case.
>>
>> Regarding "TOAST column" for deleted row cases, we may need to dig
>> more. Thanks for bringing this case. Let me analyze more here.
>>
> I tested a simple case with a table with one TOAST column and found
> that when a tuple with a TOAST column is deleted, both the tuple and
> corresponding pg_toast entries are marked as ‘deleted’ (dead) but not
> removed immediately. The main tuple and respective pg_toast entry are
> permanently deleted only during vacuum. First, the main table’s dead
> tuples are vacuumed, followed by the secondary TOAST relation ones (if
> available).
> Please let us know if you have a specific scenario in mind where the
> TOAST column data is deleted immediately upon ‘delete’ operation,
> rather than during vacuum, which we are missing.
> 

I'm pretty sure you can vacuum the TOAST table directly, which means
you'll end up with a deleted tuple with TOAST pointers, but with the
TOAST entries already gone.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: question regarding policy for patches to out-of-support branches

On Thu, Jun 6, 2024 at 10:04 PM Tom Lane  wrote:
> > I added them here with minimal copy editing an no attempt to organize or
> > sort into groups:
> > https://wiki.postgresql.org/wiki/Committing_checklist#Policies
> > If someone has thoughts on how to improve I am happy to make more changes.
>
> Thanks!  I summoned the energy to make a few more improvements,
> particularly updating stuff that seemed out-of-date.  I'm sure
> there's more that could be added here.

This is nice! I wonder if we could interest anyone in creating tooling
that could be used to check some of this stuff -- ideally run as part
of the regular build process, so that you fail to notice that you did
it wrong.

Not all of these rules are subject to automatic verification e.g. it's
hard to enforce that a change to an out-of-support branch makes no
functional change. But an awful lot of them could be, and I would
personally be significantly happier and less stressed if I knew that
'ninja && meson test' was going to tell me that I did it wrong before
I pushed, instead of finding out afterward and then having to drop
everything to go clean it up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Assert in heapgettup_pagemode() fails due to underlying buffer change

On Fri, Jun 7, 2024 at 4:05 AM Alvaro Herrera  wrote:
> >  static void
> > -ZeroBuffer(Buffer buffer, ReadBufferMode mode)
> > +ZeroBuffer(Buffer buffer, ReadBufferMode mode, bool zero)
>
> This change makes the API very strange.  Should the function be called
> ZeroAndLockBuffer() instead?  Then the addition of a "bool zero"
> argument makes a lot more sense.

I agree that's better, but it still looks a bit weird. You have to
realize that 'bool zero' means 'is already zeroed' here -- or at
least, I guess that's the intention. But then I wonder why you'd call
a function called ZeroAndLockBuffer if all you need to do is
LockBuffer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Conflict Detection and Resolution




On 6/3/24 09:30, Amit Kapila wrote:
> On Sat, May 25, 2024 at 2:39 AM Tomas Vondra
>  wrote:
>>
>> On 5/23/24 08:36, shveta malik wrote:
>>>
>>> Conflict Resolution
>>> 
>>> a) latest_timestamp_wins:The change with later commit timestamp wins.
>>> b) earliest_timestamp_wins:   The change with earlier commit timestamp wins.
>>> c) apply:   Always apply the remote change.
>>> d) skip:Remote change is skipped.
>>> e) error:   Error out on conflict. Replication is stopped, manual
>>> action is needed.
>>>
>>
>> Why not to have some support for user-defined conflict resolution
>> methods, allowing to do more complex stuff (e.g. merging the rows in
>> some way, perhaps even with datatype-specific behavior)?
>>
>>> The change will be converted to 'UPDATE' and applied if the decision
>>> is in favor of applying remote change.
>>>
>>> It is important to have commit timestamp info available on subscriber
>>> when latest_timestamp_wins or earliest_timestamp_wins method is chosen
>>> as resolution method.  Thus ‘track_commit_timestamp’ must be enabled
>>> on subscriber, in absence of which, configuring the said
>>> timestamp-based resolution methods will result in error.
>>>
>>> Note: If the user has chosen the latest or earliest_timestamp_wins,
>>> and the remote and local timestamps are the same, then it will go by
>>> system identifier. The change with a higher system identifier will
>>> win. This will ensure that the same change is picked on all the nodes.
>>
>> How is this going to deal with the fact that commit LSN and timestamps
>> may not correlate perfectly? That is, commits may happen with LSN1 <
>> LSN2 but with T1 > T2.
>>
> 
> One of the possible scenarios discussed at pgconf.dev with Tomas for
> this was as follows:
> 
> Say there are two publisher nodes PN1, PN2, and subscriber node SN3.
> The logical replication is configured such that a subscription on SN3
> has publications from both PN1 and PN2. For example, SN3 (sub) -> PN1,
> PN2 (p1, p2)
> 
> Now, on PN1, we have the following operations that update the same row:
> 
> T1
> Update-1 on table t1 at LSN1 (1000) on time (200)
> 
> T2
> Update-2 on table t1 at LSN2 (2000) on time (100)
> 
> Then in parallel, we have the following operation on node PN2 that
> updates the same row as Update-1, and Update-2 on node PN1.
> 
> T3
> Update-3 on table t1 at LSN(1500) on time (150)
> 
> By theory, we can have a different state on subscribers depending on
> the order of updates arriving at SN3 which shouldn't happen. Say, the
> order in which they reach SN3 is: Update-1, Update-2, Update-3 then
> the final row we have is by Update-3 considering we have configured
> last_update_wins as a conflict resolution method. Now, consider the
> other order:  Update-1, Update-3, Update-2, in this case, the final
> row will be by Update-2 because when we try to apply Update-3, it will
> generate a conflict and as per the resolution method
> (last_update_wins) we need to retain Update-1.
> 
> On further thinking, the operations on node-1 PN-1 as defined above
> seem impossible because one of the Updates needs to wait for the other
> to write a commit record. So the commits may happen with LSN1 < LSN2
> but with T1 > T2 but they can't be on the same row due to locks. So,
> the order of apply should still be consistent. Am, I missing
> something?
> 

Sorry, I should have read your message before responding a couple
minutes ago. I think you're right this exact example can't happen, due
to the dependency between transactions.

But as I wrote, I'm not quite convinced this means there are not other
issues with this way of resolving conflicts. It's more likely a more
complex scenario is required.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compress ReorderBuffer spill files using LZ4

On 6/6/24 16:24, Alvaro Herrera wrote:
> On 2024-Jun-06, Amit Kapila wrote:
> 
>> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires  wrote:
>>>
>>> When the content of a large transaction (size exceeding
>>> logical_decoding_work_mem) and its sub-transactions has to be
>>> reordered during logical decoding, then, all the changes are written
>>> on disk in temporary files located in pg_replslot/.
>>> Decoding very large transactions by multiple replication slots can
>>> lead to disk space saturation and high I/O utilization.
> 
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
> 
>> Why can't one use 'streaming' option to send changes to the client
>> once it reaches the configured limit of 'logical_decoding_work_mem'?
> 
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.
> 
> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
> 
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
> 
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.
> 
> (I don't mean to say that you should implement Zstd compression with
> this patch, only that you should choose the implementation so that
> adding Zstd support (or whatever) later is just a matter of adding some
> branches here and there.  With the current #ifdef you propose, it's hard
> to do that.  Maybe separate the parts that depend on the specific
> algorithm to algorithm-agnostic functions.)
> 

I haven't been following the "libpq compression" thread, but wouldn't
that also do compression for the streaming case? That was my assumption,
at least, and it seems like the right way - we probably don't want to
patch every place that sends data over network independently, right?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compress ReorderBuffer spill files using LZ4

On 6/6/24 12:58, Julien Tachoires wrote:
> ...
>
> When compiled with LZ4 support (--with-lz4), this patch enables data
> compression/decompression of these temporary files. Each transaction
> change that must be written on disk (ReorderBufferDiskChange) is now
> compressed and encapsulated in a new structure.
> 

I'm a bit confused, but why tie this to having lz4? Why shouldn't this
be supported even for pglz, or whatever algorithms we add in the future?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: race condition in pg_class

On Thu, Jun 6, 2024 at 7:20 PM Michael Paquier  wrote:
> On Thu, Jun 06, 2024 at 09:48:51AM -0400, Robert Haas wrote:
> > It's not this patch set's fault, but I'm not very pleased to see that
> > the injection point wait events have been shoehorned into the
> > "Extension" category - which they are not - instead of being a new
> > wait_event_type. That would have avoided the ugly wait-event naming
> > pattern, inconsistent with everything else, introduced by
> > inplace050-tests-inj-v1.patch.
>
> Not sure to agree with that.  The set of core backend APIs supporting
> injection points have nothing to do with wait events.  The library
> attached to one or more injection points *may* decide to use a wait
> event like what the wait/wakeup calls in modules/injection_points do,
> but that's entirely optional.  These rely on custom wait events,
> plugged into the Extension category as the code run is itself in an
> extension.  I am not arguing against the point that it may be
> interesting to plug in custom wait event categories, but the current
> design of wait events makes that much harder than what core is
> currently able to handle, and I am not sure that this brings much at
> the end as long as the wait event strings can be customized.
>
> I've voiced upthread concerns over the naming enforced by the patch
> and the way it plugs the namings into the isolation functions, by the
> way.

I think the core code should provide an "Injection Point" wait event
type and let extensions add specific wait events there, just like you
did for "Extension". Then this ugly naming would go away. As I see it,
"Extension" is only supposed to be used as a catch-all when we have no
other information, but here we do. If we refuse to use the
wait_event_type field to categorize waits, then people are going to
have to find some other way to get that data into the system, as Noah
has done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Compress ReorderBuffer spill files using LZ4

2024-06-07 Thread Julien Tachoires

Le ven. 7 juin 2024 à 05:59, Tomas Vondra
 a écrit :
>
> On 6/6/24 12:58, Julien Tachoires wrote:
> > ...
> >
> > When compiled with LZ4 support (--with-lz4), this patch enables data
> > compression/decompression of these temporary files. Each transaction
> > change that must be written on disk (ReorderBufferDiskChange) is now
> > compressed and encapsulated in a new structure.
> >
>
> I'm a bit confused, but why tie this to having lz4? Why shouldn't this
> be supported even for pglz, or whatever algorithms we add in the future?

That's right, reworking this patch in that sense.

Regards,

JT

Re: relfilenode statistics

On Thu, Jun 6, 2024 at 11:17 PM Andres Freund  wrote:
> If we just want to keep prior stats upon arelation rewrite, we can just copy
> the stats from the old relfilenode.  Or we can decide that those stats don't
> really make sense anymore, and start from scratch.

I think we need to think carefully about what we want the user
experience to be here. "Per-relfilenode stats" could mean "sometimes I
don't know the relation OID so I want to use the relfilenumber
instead, without changing the user experience" or it could mean "some
of these stats actually properly pertain to the relfilenode rather
than the relation so I want to associate them with the right object
and that will affect how the user sees things." We need to decide
which it is. If it's the former, then we need to examine whether the
goal of hiding the distinction between relfilenode stats and relation
stats from the user is in fact feasible. If it's the latter, then we
need to make sure the whole patch reflects that design, which would
include e.g. NOT copying stats from the old to the new relfilenode,
and which would also include documenting the behavior in a way that
will be understandable to users.

In my experience, the worst thing you can do in cases like this is be
somewhere in the middle. Then you tend to end up with stuff like: the
difference isn't supposed to be something that the user knows or cares
about, except that they do have to know and care because you haven't
thoroughly covered up the deception, and often they have to reverse
engineer the behavior because you didn't document what was really
happening because you imagined that they wouldn't notice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ssl tests fail due to TCP port conflict

2024-06-07 Thread Andrew Dunstan




On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote:

On Wed, 5 Jun 2024 at 23:37, Tom Lane  wrote:

Andrew Dunstan  writes:

On 2024-06-05 We 16:00, Alexander Lakhin wrote:

That is, psql from the test instance 001_ssltests_34 opened a
connection to
the test server with the client port 50072 and it made using the port by
the server from the test instance 001_ssltests_30 impossible.

Oh. (kicks self)

D'oh.


Should we really be allocating ephemeral server ports in the range
41952..65535? Maybe we should be looking for an unallocated port
somewhere below 41952, and above, say, 32767, so we couldn't have a
client socket collision.

Hmm, are there really any standards about how these port numbers
are used?

I wonder if we don't need to just be prepared to retry the whole
thing a few times.  Even if it's true that "clients" shouldn't
choose ports below 41952, we still have a small chance of failure
against a non-Postgres server starting up at the wrong time.

My suggestion would be to not touch the ephemeral port range at all
for these ports. In practice the ephemeral port range is used for
cases where the operating system assigns the port, and the application
doesn't care whot it is. Not for when you want to get a free port, but
want to know in advance which one it is.

For the PgBouncer test suite we do something similar as the PG its
perl tests do, but there we allocate a port between 10200 and 32768:
https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215

Sure theoretically it's possible to hit a rare case where another
server starts up at the wrong time, but that chance seems way lower
than a client starting up at the wrong time. Especially since there
aren't many servers that use a port with 5 digits.

Attached is a patch that updates the port numbers.



Makes sense to me.

I still think my patch to force TCP mode for the SSL test makes sense as 
well.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: 回复: An implementation of multi-key sort

2024-06-07 Thread Yao Wang

To be accurate, "multi-key sort" includes both "multi-key quick sort"
and "multi-key heap sort". This patch includes code change related to
only "multi-key quick sort" which is used to replace standard quick
sort for tuplesort. The "multi-key heap sort" is about an implementation
of multi-key heap and should be treated as a separated task. We need
to clarify the naming to avoid confusion.

I updated code which is related to only function/var renaming and
relevant comments, plus some minor assertions changes. Please see the
attachment.


Thanks,

Yao Wang

On Fri, May 31, 2024 at 8:09 PM Yao Wang  wrote:
>
> I added two optimizations to mksort which exist on qsort_tuple():
>
> 1. When selecting pivot, always pick the item in the middle of array but
> not by random. Theoretically it has the same effect to old approach, but
> it can eliminate some unstable perf test results, plus a bit perf benefit by
> removing random value generator.
> 2. Always check whether the array is ordered already, and return
> immediately if it is. The pre-ordered check requires extra cost and
> impacts perf numbers on some data sets, but can improve perf
> significantly on other data sets.
>
> By now, mksort has perf results equal or better than qsort on all data
> sets I ever used.
>
> I also updated test case. Please see v3 code as attachment.
>
> Perf test results:
>
> Data set 1 (with mass duplicate values):
> -
>
> create table t1 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t1 values (generate_series(1,49), 0, 0, 0, 0,
> 'aaabbb');
> update t1 set c2 = c1 % 100, c3 = c1 % 50, c4 = c1 % 10, c5 = c1 % 3;
> update t1 set c6 = 'aaabbb'
> || (c1 % 5)::text;
>
> Query 1:
>
> explain analyze select c1 from t1 order by c6, c5, c4, c3, c2, c1;
>
> Disable Mksort
>
> 3021.636 ms
> 3014.669 ms
> 3033.588 ms
>
> Enable Mksort
>
> 1688.590 ms
> 1686.956 ms
> 1688.567 ms
>
> The improvement is 78.9%, which is reduced from the previous version
> (129%). The most cost should be the pre-ordered check.
>
> Query 2:
>
> create index idx_t1_mk on t1 (c6, c5, c4, c3, c2, c1);
>
> Disable Mksort
>
> 1674.648 ms
> 1680.608 ms
> 1681.373 ms
>
> Enable Mksort
>
> 1143.341 ms
> 1143.462 ms
> 1143.894 ms
>
> The improvement is ~47%, which is also reduced a bit (52%).
>
> Data set 2 (with distinct values):
> --
>
> create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> insert into t2 values (generate_series(1,49), 0, 0, 0, 0, '');
> update t2 set c2 = 90 - c1, c3 = 91 - c1, c4 = 92 - c1, c5
> = 93 - c1;
> update t2 set c6 = 'aaabbb'
> || (94 - c1)::text;
>
> Query 1:
>
> explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
>
> Disable Mksort
>
> 12199.963 ms
> 12197.068 ms
> 12191.657 ms
>
> Enable Mksort
>
> 9538.219 ms
> 9571.681 ms
> 9536.335 ms
>
> The improvement is 27.9%, which is much better than the old approach (-6.2%).
>
> Query 2 (the data is pre-ordered):
>
> explain analyze select c1 from t2 order by c6 desc, c5, c4, c3, c2, c1;
>
> Enable Mksort
>
> 768.191 ms
> 768.079 ms
> 767.026 ms
>
> Disable Mksort
>
> 768.757 ms
> 766.166 ms
> 766.149 ms
>
> They are almost the same since no actual sort was performed, and much
> better than the old approach (-1198.1%).
>
>
> Thanks,
>
> Yao Wang
>
> On Fri, May 24, 2024 at 8:50 PM Yao Wang  wrote:
> >
> > When all leading keys are different, mksort will finish the entire sort at 
> > the
> > first sort key and never touch other keys. For the case, mksort falls back 
> > to
> > kind of qsort actually.
> >
> > I created another data set with distinct values in all sort keys:
> >
> > create table t2 (c1 int, c2 int, c3 int, c4 int, c5 int, c6 varchar(100));
> > insert into t2 values (generate_series(1,49), 0, 0, 0, 0, '');
> > update t2 set c2 = 90 - c1, c3 = 91 - c1, c4 = 92 - c1, c5
> > = 93 - c1;
> > update t2 set c6 = 'aaabbb'
> >   || (94 - c1)::text;
> > explain analyze select c1 from t2 order by c6, c5, c4, c3, c2, c1;
> >
> > Results:
> >
> > MKsort:
> > 12374.427 ms
> > 12528.068 ms
> > 12554.718 ms
> >
> > qsort:
> > 12251.422 ms
> > 12279.938 ms
> > 12280.254 ms
> >
> > MKsort is a bit slower than qsort, which can be explained by extra
> > checks of MKsort.
> >
> > Yao Wang
> >
> > On Fri, May 24, 2024 at 8:36 PM Wang Yao  wrote:
> > >
> > >
> > >
> > > 获取Outlook for Android
> > > 
> > > From: Heikki Linnakangas 
> > > Sent: Thursday, May 23, 2024 8:47:29 PM
> > > To: Wang Yao ; PostgreSQL Hackers 
> > > 
> > > Cc: inte...@outlook.com 
> > > Subject: Re: 回复: An implementation of multi-key sort
> > >
> > > On 23/05/2024 15:39, Wang Yao wrote:
> > > > No obvious perf regression is expected

Re: Patch bug: Fix jsonpath .* on Arrays

2024-06-07 Thread David E. Wheeler

On Jun 4, 2024, at 20:45, David E. Wheeler  wrote:

> Oh FFS, unwrapping still breaks my brain. You’re right, of course. Here’s a 
> new patch that demonstrates that behavior, since that code path is not 
> currently represented in tests AFAICT (I would have expected to have broken 
> it with this patch).

Rebased and moved the new tests to the end of the file.

D


v2-0001-Add-tests-for-jsonpath-.-on-arrays.patch
Description: Binary data

Re: Patch bug: Fix jsonpath .* on Arrays

2024-06-07 Thread David E. Wheeler

On Jun 7, 2024, at 10:23, David E. Wheeler  wrote:

> Rebased and moved the new tests to the end of the file.

Bah, sorry, that was the previous patch. Here’s v3.

D



v3-0001-Add-tests-for-jsonpath-.-on-arrays.patch
Description: Binary data

Re: ssl tests fail due to TCP port conflict

2024-06-07 Thread Tom Lane

Andrew Dunstan  writes:
> On 2024-06-06 Th 18:02, Jelte Fennema-Nio wrote:
>> For the PgBouncer test suite we do something similar as the PG its
>> perl tests do, but there we allocate a port between 10200 and 32768:
>> https://github.com/pgbouncer/pgbouncer/blob/master/test/utils.py#L192-L215

> Makes sense to me.

> I still think my patch to force TCP mode for the SSL test makes sense as 
> well.

+1 to both things.  If that doesn't get the failure rate down to an
acceptable level, we can look at the retry idea.

regards, tom lane

Re: Postgresql OOM

2024-06-07 Thread Radu Radutiu

>
>
>
>> The planner should recognize this situation and avoid use of hash
>> join in such cases, but maybe the statistics aren't reflecting the
>> problem, or maybe there's something wrong with the logic specific
>> to parallel hash join.  You've not really provided enough information
>> to diagnose why the poor choice of plan.
>>
>> regards, tom lane
>>
>
> Thanks for looking into this. I'm not sure what information would be
> needed to look at the choice of plan.
> The statistics for the join conditions in the query would be:
>  join_condition | min_count | max_count | avg_count
> +---+---+
>  snd_tro| 0 | 0 | 0.
>  rpl_rec_tro| 0 | 2 | 0.99869222814474470477
>  rec_tro| 0 | 2 | 0.99869222814474470477
>  rpl_snd_tro| 0 | 0 | 0.
>  r  | 0 | 1 | 0.49850916663490161653
>
>
> The relevant columns for the tables are:
> postgres=# \d inputrequest
>Table "public.inputrequest"
>   Column  |Type | Collation |
> Nullable | Default
>
> --+-+---+--+-
>  input_sequence   | bigint  |   | not
> null |
>  msg_type | character varying(8)|   | not
> null |
>  msg_content  | text|   | not
> null |
>  msg_reference| character varying(35)   |   |
>  |
>  originalrequest_id   | bigint  |   |
>  |
>  receive_time | timestamp without time zone |   | not
> null |
>  related_output_sequence  | bigint  |   |
>  |
>  msg_status   | character varying(15)   |   |
>  |
>
> Indexes:
> "inputrequest_pkey" PRIMARY KEY, btree (input_sequence)
> "inputrequest_originalrequest_id_idx" btree (originalrequest_id)
>
> postgres=# \d outputrequest
>  Table "public.outputrequest"
>  Column |Type | Collation |
> Nullable | Default
>
> +-+---+--+-
>  output_sequence| bigint  |   | not
> null |
>  input_sequence | bigint  |   |
>|
>  msg_type   | character varying(8)|   |
>|
>  msg_content| text|   | not
> null |
>  msg_reference  | character varying(35)   |   |
>|
>  reply_input_sequence   | bigint  |   |
>|
>  status | integer |   | not
> null |
>  related_input_sequence | bigint  |   |
>|
> Indexes:
> "outputrequest_pkey" PRIMARY KEY, btree (output_sequence)
> "outputrequest_input_sequence_idx" btree (input_sequence)
> "outputrequest_reply_input_sequence_idx" btree (reply_input_sequence)
>
>
I wonder if our choice of primary keys (input_sequence and output_sequence)
has something to do with the skew in the hash bucket distribution. We use
the following format: mmddxx , where  is more or less a
sequence and xx is the node generating the id, i.e. 01,02,etc (with only
one or two values in the dataset).

I wonder if it would be difficult to have an upper limit on the private
memory that can be allocated by one process (or all processes similar to
Oracle's pga_aggregate_limit). I would rather have one query failing with
an error message instead of postgres eating up all memory and swap on the
server.

Best regards,
Radu

Re: XACT_EVENT for 'commit prepared'

2024-06-07 Thread Tom Lane

Xiaoran Wang  writes:
> I found that in enum XactEvent, there is  'XACT_EVENT_PREPARE'  for
> 'prepare transaction', but there is no event for 'commit prepared' or
> 'rollback prepared'.

On the whole, it seems like a good idea to me that those commands
don't invoke event triggers.  It is a core principle of 2PC that
if 'prepare' succeeded, 'commit prepared' must not fail.  Invoking a
trigger during the second step would add failure cases and I'm not
sure what value it has.

regards, tom lane

Re: PgStat_KindInfo.named_on_disk not required in shared stats

Hi,

On 2024-06-07 14:07:33 +0900, Michael Paquier wrote:
> While hacking on the area of pgstat_*.c, I have noticed the existence
> of named_on_disk in PgStat_KindInfo, that is here to track the fact
> that replication slots are a particular case in the PgStat_HashKey for
> the dshash table of the stats because this kind of stats requires a
> mapping between the replication slot name and the hash key.
> 
> As far as I can see, this field is not required and is used nowhere,
> because the code relies on the existence of the to_serialized_name and
> from_serialized_name callbacks to do the mapping.
> 
> Wouldn't it make sense to remove it?  This field is defined since
> 5891c7a8ed8f that introduced the shmem stats, and has never been used
> since.

Yes, makes sense. Looks we changed direction during development a bunch of 
times...q


> This frees an extra bit in PgStat_KindInfo, which is going to help me
> a bit with what I'm doing with this area of the code while keeping the
> structure size the same.

Note it's just a single bit, not a full byte. So unless you need precisely 30
bits, rather than 29, I don't really see why it'd help? And i don't see a
reason to strictly keep the structure size the same.

Greetings,

Andres Freund

RE: AIX support

2024-06-07 Thread Srirama Kucherlapati

Hi Team, We are pursuing to trim the changes wrt AIX. As of now we trimmed
the changes with respect to XLC and currently with trimmed changes the
buildfarm script passed (build and all the regression tests)
The XLC changes were trimmed only in the below file
modified: configure
modified: configure.ac
We are looking further into the other file changes as well.

Warm regards,
Sriram.

Re: XACT_EVENT for 'commit prepared'

Hi,

On 2024-06-07 11:19:40 -0400, Tom Lane wrote:
> Xiaoran Wang  writes:
> > I found that in enum XactEvent, there is  'XACT_EVENT_PREPARE'  for
> > 'prepare transaction', but there is no event for 'commit prepared' or
> > 'rollback prepared'.
> 
> On the whole, it seems like a good idea to me that those commands
> don't invoke event triggers.  It is a core principle of 2PC that
> if 'prepare' succeeded, 'commit prepared' must not fail.  Invoking a
> trigger during the second step would add failure cases and I'm not
> sure what value it has.

Event triggers? Isn't this about RegisterXactCallback?

XACT_EVENT_COMMIT is called after the commit record has been flushed and the
procarray has been modified. Thus a failure in the hook has somewhat limited
consequences. I'd assume XACT_EVENT_COMMIT_PREPARED would do something
similar.

I suspect the reason we don't callback for 2pc commit/rollback prepared is
simpl: The code for doing a 2pc commit prepared lives in twophase.c, not
xact.c...

Greetings,

Andres Freund

Re: Postgresql OOM

Hi,

On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> I have a query that forces an out of memory error, where the OS will kill
> the postgresql process.

FWIW, it can be useful to configure the OS with strict memory overcommit. That
causes postgres to fail more gracefully, because the OOM killer won't be
invoked.

> The query plan (run immediately after a vacuum analyze) is at
> https://explain.depesz.com/s/ITQI#html .

Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the number of
workers? It'd be useful to get some of the information about the actual
numbers of tuples etc.

Greetings,

Andres Freund

Re: Postgresql OOM

Hi,

On 2024-06-06 13:58:24 +0100, Pantelis Theodosiou wrote:
> I am not qualified to answer on the OOM issue but why are you joining the
> same table (outputrequest) 4 times (using an identical join condition)?

The conditions aren't actually the same
rpl_rec_tro.  input_sequence = r.input_sequence
rpl_snd_tro.reply_input_sequence = r.input_sequence
snd_tro.reply_input_sequence = t.input_sequence

First two are r.input_sequence to different columns, the third one also uses
reply_input_sequence but joins to t, not r.

Greetings,

Andres Freund

Re: Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-07 Thread Jacob Champion

On Fri, Jun 7, 2024 at 3:02 AM Erica Zhang  wrote:
>
> For some security consideration, we prefer to use TLS1.3 cipher suites in our 
> product with some customization values instead of default value 
> "HIGH:MEDIUM:+3DES:!aNULL". Moreover we prefer to set a group of ecdh keys 
> instead of a single value.

+1 for the curve list feature, at least. No opinions on the 1.3
ciphersuites half, yet.

I've added this patch to my planned review for the v18 cycle. Some
initial notes:

- Could you separate the two features into two patches? That would
make it easier for reviewers. (They can still share the same thread
and CF entry.)
- The "curve" APIs have been renamed "group" in newer OpenSSLs for a
while now, and we should probably use those if possible.
- I think parsing apart the groups list to check NIDs manually could
lead to false negatives. From a docs skim, 3.0 allows providers to add
their own group names, and 3.3 now supports question marks in the
string to allow graceful fallbacks.
- I originally thought it'd be better to just stop calling
SSL_set_tmp_ecdh() entirely by default, so we could use OpenSSL's
builtin list of groups. But that may have denial-of-service concerns
[1]?
- We should maybe look into SSL_CTX_config(), if we haven't discussed
that already on the list, but that's probably a bigger tangent and
doesn't need to be part of this patch.

Thanks,
--Jacob

[1] 
https://www.openssl.org/blog/blog/2022/10/21/tls-groups-configuration/index.html

WIP: parallel GiST index builds

Hi,

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.


unsorted builds only


Addressing only the unsorted case may seem a bit weird, but I did it
this way for two reasons - parallel sort is a solved problem, and adding
this to the patch seems quite straightforward. It's what btree does, for
example. But I also was not very sure how common this is - we do have
sort for points, but I have no idea if the PostGIS indexes define
sorting etc. My guess was "no" but I've been told that's no longer true,
so I guess sorted builds are more widely applicable than I thought.

In any case, I'm not in a rush to parallelize sorted builds. It can be
added later, as an improvement, IMHO. In fact, it's a well isolated part
of the patch, which might make it a good choice for someone looking for
an idea for their first patch ...


buffered builds
---

The lack of support for buffered builds is a very different thing. The
basic idea is that we don't push the index entries all the way to the
leaf pages right away, but accumulate them in buffers half-way through.
This combines writes and reduces random I/O, which is nice.

Unfortunately, the way it's implemented does not work with parallel
builds at all - all the state is in private memory, and it assumes the
worker is the only possible backend that can split the page (at which
point the buffers need to be split too, etc.). But for parallel builds
this is obviously not true.

I'm not saying parallel builds can't do similar buffering, but it
requires moving the buffers into shared memory, and introducing locking
to coordinate accesses to the buffers. (Or perhaps it might be enough to
only "notify" the workers about page splits, with buffers still in
private memory?). Anyway, it seems far too complicated for v1.

In fact, I'm not sure the buffering is entirely necessary - maybe the
increase in amount of RAM makes this less of an issue? If the index can
fit into shared buffers (or at least page cache), maybe the amount of
extra I/O is not that bad? I'm sure there may be cases really affected
by this, but maybe it's OK to tell people to disable parallel builds in
those cases?


gistGetFakeLSN
--

One more thing - GiST disables WAL-logging during the build, and only
logs it once at the end. For serial builds this is fine, because there
are no concurrent splits, and so we don't need to rely on page LSNs to
detect these cases (in fact, is uses a bogus value).

But for parallel builds this would not work - we need page LSNs that
actually change, otherwise we'd miss page splits, and the index build
would either fail or produce a broken index. But the existing is_build
flag affects both things, so I had to introduce a new "is_parallel" flag
which only affects the page LSN part, using the gistGetFakeLSN()
function, previously used only for unlogged indexes.

This means we'll produce WAL during the index build (because
gistGetFakeLSN() writes a trivial message into WAL). Compared to the
serial builds this produces maybe 25-75% more WAL, but it's an order of
magnitude less than with "full" WAL logging (is_build=false).

For example, serial build of 5GB index needs ~5GB of WAL. A parallel
build may need ~7GB, while a parallel build with "full" logging would
use 50GB. I think this is a reasonable trade off.

There's one "strange" thing, though - the amount of WAL decreases with
the number of parallel workers. Consider for example an index on a
numeric field, where the index is ~9GB, but the amount of WAL changes
like this (0 workers means serial builds):

  parallel workers  0  1  3  5  7
  WAL (GB)5.79.27.67.06.8

The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN
determines if it needs to actually assign a new LSN (and write stuff to
WAL) by comparing the last LSN assigned (in a given worker) to the
current insert LSN. But the current insert LSN might have been updated
by some other worker, in which case we simply use that. Which means that
multiple workers may use the same fake LSN, and the likelihood increases
with the number of workers - and this is consistent with the observed
behavior of the WAL decreasing as the number of workers increases
(because more workers use the same LSN).

I'm not entirely sure if this is OK or a problem. I was worried two
workers might end up using the same LSN for the same page, leading to
other workers not noticing the split. But after a week of pretty
intensive stress testing, I haven't seen a single such failure ...

If this turns out to be a problem, the fix is IMHO quite simple - it
should be enough to force gistGetFak

Re: Optimizing COPY with SIMD

2024-06-07 Thread Neil Conway

On Wed, Jun 5, 2024 at 3:05 PM Nathan Bossart 
wrote:

> For pg_lfind32(), we ended up using an overlapping approach for the
> vectorized case (see commit 7644a73).  That appeared to help more than it
> harmed in the many (admittedly branch predictor friendly) tests I ran.  I
> wonder if you could do something similar here.
>

I didn't entirely follow what you are suggesting here -- seems like we
would need to do strlen() for the non-SIMD case if we tried to use a
similar approach.

It'd be interesting to see the threshold where your patch starts winning.
> IIUC the vector stuff won't take effect until there are 16 bytes to
> process.  If we don't expect attributes to ordinarily be >= 16 bytes, it
> might be worth trying to mitigate this ~3% regression.  Maybe we can find
> some other small gains elsewhere to offset it.
>

For the particular short-strings benchmark I have been using (3 columns
with 8-character ASCII strings in each), I suspect the regression is caused
by the need to do a strlen(), rather than the vectorized loop itself (we
skip the vectorized loop anyway because sizeof(Vector8) == 16 on this
machine). (This explains why we see a regression on short strings for text
but not CSV: CSV needed to do a strlen() for the non-quoted-string case
regardless). Unfortunately this makes it tricky to make the optimization
conditional on the length of the string. I suppose we could play some games
where we start with a byte-by-byte loop and then switch over to the
vectorized path (and take a strlen()) if we have seen more than, say,
sizeof(Vector8) bytes so far. Seems a bit kludgy though.

I will do some more benchmarking and report back. For the time being, I'm
not inclined to push to get the CopyAttributeOutTextVector() into the tree
in its current state, as I agree that the short-attribute case is quite
important.

In the meantime, attached is a revised patch series. This uses SIMD to
optimize CopyReadLineText in COPY FROM. Performance results:

master @ 8fea1bd5411b:

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-long-strings.sql
  Time (mean ± σ):  1.944 s ±  0.013 s[User: 0.001 s, System: 0.000
s]
  Range (min … max):1.927 s …  1.975 s10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-short-strings.sql
  Time (mean ± σ):  1.021 s ±  0.017 s[User: 0.002 s, System: 0.001
s]
  Range (min … max):1.005 s …  1.053 s10 runs

master + SIMD patches:

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-long-strings.sql
  Time (mean ± σ):  1.513 s ±  0.022 s[User: 0.001 s, System: 0.000
s]
  Range (min … max):1.493 s …  1.552 s10 runs

Benchmark 1: ./psql -f /Users/neilconway/copy-from-large-short-strings.sql
  Time (mean ± σ):  1.032 s ±  0.032 s[User: 0.002 s, System: 0.001
s]
  Range (min … max):1.009 s …  1.113 s10 runs

Neil

v4-0005-Optimize-COPY-TO-in-text-format-using-SIMD.patch
Description: Binary data

v4-0003-Cosmetic-code-cleanup-for-CopyReadLineText.patch
Description: Binary data

v4-0004-Optimize-COPY-TO-in-CSV-format-using-SIMD.patch
Description: Binary data

v4-0002-Improve-COPY-test-coverage-for-handling-of-contro.patch
Description: Binary data

v4-0001-Adjust-misleading-comment-placement.patch
Description: Binary data

v4-0006-Optimize-COPY-FROM-using-SIMD.patch
Description: Binary data

Re: Postgresql OOM

2024-06-07 Thread Radu Radutiu

On Fri, Jun 7, 2024 at 7:59 PM Andres Freund  wrote:

> Hi,
>
> On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> > I have a query that forces an out of memory error, where the OS will kill
> > the postgresql process.
>
> FWIW, it can be useful to configure the OS with strict memory overcommit.
> That
> causes postgres to fail more gracefully, because the OOM killer won't be
> invoked.
>
>
> > The query plan (run immediately after a vacuum analyze) is at
> > https://explain.depesz.com/s/ITQI#html .
>
> Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the
> number of
> workers? It'd be useful to get some of the information about the actual
> numbers of tuples etc.
>
>
> Hi,
I've tried first giving more memory to the OS and mounting a tmpfs
in  pgsql_tmp. It didn't  work, I got
ERROR:  invalid DSA memory alloc request size 1140850688
CONTEXT:  parallel worker
I've seen around 2 million temporary files created before the crash.
With work_mem 100MB I was not able to get it to work with 2 parallel
workers.
Next, I've increased work_mem to 200MB and now (with extra memory and
tmpfs) it finished: https://explain.depesz.com/s/NnRC

Radu

Re: Sort functions with specialized comparators

2024-06-07 Thread Stepan Neretin

Hello all.

I am interested in the proposed patch and would like to propose some
additional changes that would complement it. My changes would introduce
similar optimizations when working with a list of integers or object
identifiers. Additionally, my patch includes an extension for benchmarking,
which shows an average speedup of 30-40%.

postgres=# SELECT bench_oid_sort(100);
 bench_oid_sort


 Time taken by list_sort: 116990848 ns, Time taken by list_oid_sort:
80446640 ns, Percentage difference: 31.24%
(1 row)

postgres=# SELECT bench_int_sort(100);
 bench_int_sort


 Time taken by list_sort: 118168506 ns, Time taken by list_int_sort:
80523373 ns, Percentage difference: 31.86%
(1 row)

What do you think about these changes?

Best regards, Stepan Neretin.

On Fri, Jun 7, 2024 at 11:08 PM Andrey M. Borodin 
wrote:

> Hi!
>
> In a thread about sorting comparators[0] Andres noted that we have
> infrastructure to help compiler optimize sorting. PFA attached PoC
> implementation. I've checked that it indeed works on the benchmark from
> that thread.
>
> postgres=# CREATE TABLE arrays_to_sort AS
>SELECT array_shuffle(a) arr
>FROM
>(SELECT ARRAY(SELECT generate_series(1, 100)) a),
>generate_series(1, 10);
>
> postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- original
> Time: 990.199 ms
> postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- patched
> Time: 696.156 ms
>
> The benefit seems to be on the order of magnitude with 30% speedup.
>
> There's plenty of sorting by TransactionId, BlockNumber, OffsetNumber, Oid
> etc. But this sorting routines never show up in perf top or something like
> that.
>
> Seems like in most cases we do not spend much time in sorting. But
> specialization does not cost us much too, only some CPU cycles of a
> compiler. I think we can further improve speedup by converting inline
> comparator to value extractor: more compilers will see what is actually
> going on. But I have no proofs for this reasoning.
>
> What do you think?
>
>
> Best regards, Andrey Borodin.
>
> [0]
> https://www.postgresql.org/message-id/flat/20240209184014.sobshkcsfjix6u4r%40awork3.anarazel.de#fc23df2cf314bef35095b632380b4a59
>
From 74bad4bbcff9ea4a9a68f91618c84854dab24701 Mon Sep 17 00:00:00 2001
From: Stepan Neretin 
Date: Sat, 8 Jun 2024 01:29:42 +0700
Subject: [PATCH v42 6/6] Implemented benchmarking for optimized sorting

This commit adds benchmarking functions to compare the performance of two list sorting operations: bench_int_sort and bench_oid_sort. These functions measure the execution time of sorting lists of integers and OIDs, respectively, using different algorithms (list_sort and custom sorting functions). Random lists of specified sizes are generated, sorted using both methods, and their execution times are recorded. The percentage difference in execution time between the two methods is also calculated. This commit aims to provide insights into the efficiency of the sorting algorithms used.
---
 contrib/Makefile  |   1 +
 contrib/bench_sort_improvements/Makefile  |  20 
 contrib/bench_sort_improvements/bench.c   | 105 ++
 .../bench_sort_improvements--1.0.sql  |   3 +
 .../bench_sort_improvements.control   |   5 +
 5 files changed, 134 insertions(+)
 create mode 100644 contrib/bench_sort_improvements/Makefile
 create mode 100644 contrib/bench_sort_improvements/bench.c
 create mode 100644 contrib/bench_sort_improvements/bench_sort_improvements--1.0.sql
 create mode 100644 contrib/bench_sort_improvements/bench_sort_improvements.control

diff --git a/contrib/Makefile b/contrib/Makefile
index abd780f277..a1ee9defc2 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -10,6 +10,7 @@ SUBDIRS = \
 		auto_explain	\
 		basic_archive	\
 		basebackup_to_shell	\
+		bench_sort_improvements \
 		bloom		\
 		btree_gin	\
 		btree_gist	\
diff --git a/contrib/bench_sort_improvements/Makefile b/contrib/bench_sort_improvements/Makefile
new file mode 100644
index 00..46458ee76c
--- /dev/null
+++ b/contrib/bench_sort_improvements/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = bench_sort_improvements
+
+OBJS = \
+	$(WIN32RES) \
+	bench.o
+
+EXTENSION = bench_sort_improvements
+
+DATA = bench_sort_improvements--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_sort_improvements
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_sort_improvements/bench.c b/contrib/bench_sort_improvements/bench.c
new file mode

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs