Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-07-14 Thread Peter Eisentraut

On 12.07.24 21:42, Daniel Gustafsson wrote:

On 11 Jul 2024, at 23:22, Peter Eisentraut  wrote:



The 0001 patch removes the functions pgtls_init_library() and pgtls_init() but 
keeps the declarations in libpq-int.h.  This should be fixed.


Ah, nice catch. Done in the attached rebase.


This patch version looks good to me.

Small comments on the commit message of 0002: Typo "checkig".  Also, 
maybe the commit message title can be qualified a little, since we're 
not really doing "Remove pg_strong_random initialization." but something 
like "Remove unnecessary ..."?






Re: Cannot find a working 64-bit integer type on Illumos

2024-07-14 Thread Peter Eisentraut

On 04.07.24 03:55, Thomas Munro wrote:

Unfortunately, that theory turned out to be wrong.  The usual suspect,
Windows, uses something else: "I64" or something like that.  We could
teach our snprintf to grok that, but I don't like the idea anymore.
So let's go back to INT64_MODIFIER, with just a small amount of
configure time work to pick the right value.  I couldn't figure out
any header-only way to do that.


src/port/snprintf.c used to support I64 in the past, but it was later 
removed.  We could probably put it back.



Personally, I find "PRId64" pretty unreadable. "INT64_MODIFIER" wasn't
nice either, though, and following standards is good, so I'm sure I'll
get used to it.


Using PRId64 would be very beneficial because gettext understands it, 
and so we would no longer need the various workarounds for not putting 
INT64_FORMAT into the middle of a translated string.


But this could be a separate patch.  What you have works for now.

Here are some other comments on this patch set:

* v3-0001-Use-int64_t-support-from-stdint.h.patch

- src/include/c.h:

Maybe add a comment that above all the int8_t -> int8 etc. typedefs
that these are for backward compatibility or something like that.
Actually, just move the comment

+/* Historical names for limits in stdint.h. */

up a bit to it covers the types as well.

Also, these /* == 8 bits */ comments could be removed, I think.

- src/include/port/pg_bitutils.h:
- src/port/pg_bitutils.c:
- src/port/snprintf.c:

These changes look functionally correct, but I think I like the old
code layout better, like

#if (using long)
...
#elif (using long long)
...
#else
#error
#endif

rather than

#if (using long)
...
#else
static assertion
... // long long
#endif

which seems a bit more complicated.  I think you could leave the code
mostly alone and just change

defined(HAVE_LONG_INT_64) to SIZEOF_LONG == 8
defined(HAVE_LONG_LONG_INT_64) to SIZEOF_LONG_LONG == 8

in each case.

- src/include/postgres_ext.h:

-#define OID_MAX  UINT_MAX
-/* you will need to include  to use the above #define */
+#define OID_MAX  UINT32_MAX

If the type Oid is unsigned int, then the OID_MAX should be UINT_MAX.
So this should not be changed.  Also, is the comment about 
no longer applicable?


- src/interfaces/ecpg/ecpglib/typename.c:
- src/interfaces/ecpg/include/sqltypes.h:
- .../test/expected/compat_informix-sqlda.c:

-#ifdef HAVE_LONG_LONG_INT_64
+#if SIZEOF_LONG < 8

These changes alter the library behavior unnecessarily.  The old code
would always prefer to report back long long (ECPGt_long_long etc.),
but the new code will report back long (ECPGt_long etc.) if it is
64-bit.  I don't know the impact of these changes, but it seems
preferable to keep the existing behavior.

- src/interfaces/ecpg/include/ecpg_config.h.in:
- src/interfaces/ecpg/include/meson.build:

In the past, we have kept obsolete symbols as always defined in
ecpg_config.h.  We ought to do the same here.


* v3-0002-Remove-traces-of-BeOS.patch

This looks ok.  This could also be committed before 0001.


* v3-0003-Allow-tzcode-to-use-stdint.h-and-inttypes.h.patch

- src/timezone/localtime.c:

Addition of #include  is unnecessary, since it's already
included in c.h, and it's also not in the upstream code.

This looks like a typo:

-* UTC months are at least 28 days long 
(minus 1 second for a
+* UTC months are at least 2 days long 
(minus 1 second for a


-getsecs(const char *strp, int32 *const secsp)
+getsecs(const char *strp, int_fast32_t * const secsp)

Need to add int_fast32_t (and maybe the other types) to typedefs.list?

- src/timezone/zic.c:

+#include 
+#include 

We don't need both of these.  Also, this is not in the upstream code
AFAICT.





Re: Cannot find a working 64-bit integer type on Illumos

2024-07-14 Thread Tom Lane
Peter Eisentraut  writes:
> On 04.07.24 03:55, Thomas Munro wrote:
>>> Personally, I find "PRId64" pretty unreadable. "INT64_MODIFIER" wasn't
>>> nice either, though, and following standards is good, so I'm sure I'll
>>> get used to it.

> Using PRId64 would be very beneficial because gettext understands it, 
> and so we would no longer need the various workarounds for not putting 
> INT64_FORMAT into the middle of a translated string.

Uh, really?  The translated strings live in /usr/share, which is
expected to be architecture-independent, so how would they make
that work?

regards, tom lane




Re: Internal error codes triggered by tests

2024-07-14 Thread Alexander Lakhin

Hello Daniel and Michael,

12.07.2024 23:41, Daniel Gustafsson wrote:

On 10 Jul 2024, at 06:42, Michael Paquier  wrote:

The rest mentioned upthread seems either not worth the effort or are likely to
be bugs warranting proper investigation.



I've filed a bug report about the "WITH RECURSIVE" anomaly: [1], but what
I wanted to understand when presenting different error kinds is what
definition XX000 errors could have in principle?

It seems to me that we can't define them as indicators of unexpected (from
the server's POV) conditions, similar to assertion failures (but produced
with no asserts enabled), which should be and mostly get fixed.

If the next thing to do is to get backtrace_on_internal_error back and
that GUC is mainly intended for developers, then maybe having clean (or
containing expected backtraces only) regression test logs is a final goal
and we should stop here. But if it's expected that that GUC could be
helpful for users to analyze such errors in production and thus pay extra
attention to them, maybe having XX000 status for presumably
unreachable conditions only is desirable...

[1] 
https://www.postgresql.org/message-id/18536-0a342ec07901203e%40postgresql.org

Best regards,
Alexander




Re: race condition in pg_class

2024-07-14 Thread Noah Misch
On Thu, Jul 04, 2024 at 03:08:16PM -0700, Noah Misch wrote:
> On Thu, Jul 04, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> > 28.06.2024 08:13, Noah Misch wrote:
> > > Pushed. ...
> > 
> > Please look also at another anomaly, I've discovered.
> > 
> > An Assert added with d5f788b41 may be falsified with:
> > CREATE TABLE t(a int PRIMARY KEY);
> > INSERT INTO t VALUES (1);
> > CREATE VIEW v AS SELECT * FROM t;
> > 
> > MERGE INTO v USING (VALUES (1)) AS va(a) ON v.a = va.a
> >   WHEN MATCHED THEN DO NOTHING
> >   WHEN NOT MATCHED THEN DO NOTHING;
> > 
> > TRAP: failed Assert("resultRelInfo->ri_TrigDesc"), File: 
> > "nodeModifyTable.c", Line: 2891, PID: 1590670
> 
> Thanks.  When all the MERGE actions are DO NOTHING, view_has_instead_trigger()
> returns true

I've pushed the two patches for your reports.  To placate cfbot, I'm attaching
the remaining patches.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by FIXME.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 0400a50..461d925 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2256,6 +2256,11 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}
 
+#ifdef USE_ASSERT_CHECKING
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the heap_inplace_update_scan()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by FIXME and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..dbfa2b7 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,56 @@ The following infomask bits are applicable:
 
 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Locking to write inplace-updated tables
+---
+
+[This is the plan, but LOCKTAG_TUPLE acquisition is not yet here.]
+
+If IsInplaceUpdateRelation() returns true for a table, the table is a system
+catalog that receives heap_inplace_update_scan() calls.  Preparing a
+heap_update() of these tables follows additional locking rules, to ensure we
+don't lose the effects of an inplace update.  In particular, consider a moment
+when a backend has fetched the old tuple to modify, not yet having called
+heap_update().  Another backend's inplace update starting then can't conclude
+until the heap_update() places its new tuple in a buffer.  We enforce that
+using locktags as follows.  While DDL code is the main audience, the executor
+follows these rules to make e.g. "MERGE INTO pg_class" safer.  Locking rules
+are per-catalog:
+
+  pg_class heap_inplace_update_scan() callers: before the call, acquire
+  LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
+  (VACUUM), or a mode with strictly more conflicts.  If the update targets a
+  row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
+  on the table.  Locking the index rel is optional.  (This allows VACUUM to
+  overwrite per-index pg_class while holding a lock on the table alone.)  We
+  could allow weaker locks, in which case the next paragraph wou

[BUG?] check_exclusion_or_unique_constraint false negative

2024-07-14 Thread Michail Nikolaev
Hello, everyone!

While reviewing [1], I noticed that check_exclusion_or_unique_constraint
occasionally returns false negatives for btree unique indexes during UPSERT
operations.
Although this doesn't cause any real issues with INSERT ON CONFLICT, I
wanted to bring it to your attention, as it might indicate an underlying
problem.

Attached is a patch to reproduce the issue.

make -C src/test/modules/test_misc/ check PROVE_TESTS='t/006_*'

   Failed test 'concurrent INSERTs status (got 2 vs expected 0)'
#   at t/006_concurrently_unique_fail.pl line 26.

#   Failed test 'concurrent INSERTs stderr /(?^:^$)/'
#   at t/006_concurrently_unique_fail.pl line 26.
#   'pgbench: error: client 34 script 0 aborted in command
0 query 0: ERROR:  we know 31337 in the index!

Best regards,
Mikhail,

[1]:
https://www.postgresql.org/message-id/flat/CANtu0ogs10w%3DDgbYzZ8MswXE3PUC3J4SGDc0YEuZZeWbL0b6HA%40mail.gmail.com#8c01dcf6051e28c47d25e9471736947e
Subject: [PATCH] test + assert to reproduce possible issue with check_exclusion_or_unique_constraint
---
Index: src/backend/executor/execIndexing.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
--- a/src/backend/executor/execIndexing.c	(revision d5e6891502ca9e359aa5f5a381d904fe9d606338)
+++ b/src/backend/executor/execIndexing.c	(date 1720979367766)
@@ -889,6 +889,11 @@
 	}
 
 	index_endscan(index_scan);
+	if (!conflict && values[0] == 31337) {
+		ereport(ERROR,
+(errcode(ERRCODE_EXCLUSION_VIOLATION),
+		errmsg("we know 31337 in the index!")));
+	}
 
 	/*
 	 * Ordinarily, at this point the search should have found the originally
Index: src/test/modules/test_misc/t/006_concurrently_unique_fail.pl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/test/modules/test_misc/t/006_concurrently_unique_fail.pl b/src/test/modules/test_misc/t/006_concurrently_unique_fail.pl
new file mode 100644
--- /dev/null	(date 1720979285840)
+++ b/src/test/modules/test_misc/t/006_concurrently_unique_fail.pl	(date 1720979285840)
@@ -0,0 +1,39 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key, n int)));
+
+$node->safe_psql('postgres', q(INSERT INTO tbl VALUES(31337,1)));
+
+$node->pgbench(
+	'--no-vacuum --client=40 --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent INSERTs',
+	{
+		'on_conflicts' => q(
+			INSERT INTO tbl VALUES(31337,1) on conflict(i) do update set n = EXCLUDED.n + 1;
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file


Re: Amcheck verification of GiST and GIN

2024-07-14 Thread Andrey M. Borodin
Hi Tomas!

Thank you so much for your interest in the patchset.

> On 10 Jul 2024, at 19:01, Tomas Vondra  wrote:
> 
> I realized amcheck GIN/GiST support would be useful for testing my
> patches adding parallel builds for these index types, so I decided to
> take a look at this and do an initial review today.

Great! Thank you!

> Attached is a patch series with a extra commits to keep the review
> comments and patches adjusting the formatting by pgindent (the patch
> seems far enough for this).

I was hoping to address your review comments this weekend, but unfortunately I 
could not. I'll do this ASAP, but at least I decided to post answers on 
questions.

> 
> Let me quickly go through the review comments:
> 
> 1) Not sure I like 'amcheck.c' very much, I'd probably go with something
> like 'verify_common.c' to match naming of the other files. But it's just
> nitpicking and I can live with it.

Any name works for me. We have tens of files ending with "common.c", so I think 
that's a good way to go.

> 2) amcheck_lock_relation_and_check seems to be the most important
> function, yet there's no comment explaining what it does :-(

Makes sense.

> 3) amcheck_lock_relation_and_check still has a TODO to add the correct
> name of the AM

Yes, I've discovered it during rebase and added TODO.

> 4) Do we actually need amcheck_index_mainfork_expected as a separate
> function, or could it be a part of index_checkable?

It was separate function before refactoring...

> 5) The comment for heaptuplespresent says "debug counter" but that does
> not really explain what it's for. (I see verify_nbtree has the same
> comment, but maybe let's improve that.)

It's there for a DEBUG1 message
ereport(DEBUG1,
(errmsg_internal("finished verifying presence of " INT64_FORMAT 
" tuples from table \"%s\" with bitset %.2f%% set",
But the message is gone for GiST. Perhaps, let's restore this message?

> 
> 6) I'd suggest moving the GISTSTATE + blocknum fields to the beginning
> of GistCheckState, it seems more natural to start with "generic" fields.

Makes sense.

> 7) I'd adjust the gist_check_parent_keys_consistency comment a bit, to
> explain what the function does first, and only then explain how.

Makes sense.

> 8) We seem to be copying PageGetItemIdCareful() around, right? And the
> copy in _gist.c still references nbtree - I guess that's not right.

Version differ in two aspects:
1. Size of opaque data may be different. But we can pass it as a parameter.
2. GIN's line pointer verification is slightly more strict.

> 
> 9) Why is the GIN function called gin_index_parent_check() and not
> simply gin_index_check() as for the other AMs?

AFAIR function should be called _parent_ if it takes ShareLock. 
gin_index_parent_check() does not, so I think we should rename it.

> 10) The debug in gin_check_posting_tree_parent_keys_consistency triggers
> assert when running with client_min_messages='debug5', it seems to be
> accessing bogus item pointers.
> 
> 11) Why does it add pg_amcheck support only for GiST and not GIN?

GiST part is by far more polished. When we were discussing current 
implementation with Peter G, we decided that we could finish work on GiST, and 
then proceed to GIN. Main concern is about GIN's locking model.




> On 12 Jul 2024, at 15:16, Tomas Vondra  wrote:
> 
> On 7/10/24 18:01, Tomas Vondra wrote:
>> ...
>> 
>> That's all for now. I'll add this to the stress-testing tests of my
>> index build patches, and if that triggers more issues I'll report those.
>> 
> 
> As mentioned a couple days ago, I started using this patch to validate
> the patches adding parallel builds to GIN and GiST indexes - I scripts
> to stress-test the builds, and I added the new amcheck functions as
> another validation step.
> 
> For GIN indexes it didn't find anything new (in either this or my
> patches), aside from the assert crash I already reported.
> 
> But for GiST it turned out to be very valuable - it did actually find an
> issue in my patches, or rather confirm my hypothesis that the way the
> patch generates fake LSN may not be quite right.
> 
> In particular, I've observed these two issues:
> 
>  ERROR:  heap tuple (13315,38) from table "planet_osm_roads" lacks
>  matching index tuple within index "roads_7_1_idx"
> 
>  ERROR:  index "roads_7_7_idx" has inconsistent records on page 23723
>  offset 113
> 
> And those consistency issues are real - I've managed to reproduce issues
> with incorrect query results (by comparing the results to an index built
> without parallelism).
> 
> So that's nice - it shows the value of this patch, and I like it.

That's great!

> One thing I've been wondering about is that currently amcheck (in
> general, not just these new GIN/GiST functions) errors out on the first
> issue, because it does ereport(ERROR). Which is good enough to decide if
> there is some corruption, but a bit inconvenient if you need to assess
> how much corruption is there. For example when in

Re: CREATE INDEX CONCURRENTLY on partitioned index

2024-07-14 Thread Michael Paquier
On Fri, Jul 12, 2024 at 11:17:25PM +0100, Ilya Gladyshev wrote:
> Sure, created a separate thread [1]. Please disregard the second patch in
> this thread. Duplicating the last version of the relevant patch here to
> avoid any confusion.
> 
> [1] 
> https://www.postgresql.org/message-id/b72f2d89-820a-4fa2-9058-b155cf646f4f%40gmail.com

Thanks, will check that.
--
Michael


signature.asc
Description: PGP signature


Re: Allow non-superuser to cancel superuser tasks.

2024-07-14 Thread Michael Paquier
On Fri, Jul 12, 2024 at 11:19:05AM -0500, Nathan Bossart wrote:
> I suppose it would be silly to allow even lower values for
> autovacuum_naptime (e.g., by moving it to ConfigureNamesReal and setting
> the minimum to 0.1).

I've thought about that as well, and did not mention it as this would
encourage insanely low naptime values resulting in fork() bursts.

> That's a neat trick.  I was confused why this test generates an autovacuum
> worker at all, but I now see that you are pausing it before we even gather
> the list of tables that need to be vacuumed.

Yep.  More aggressive signals aren't going to help.  One thing I also
considered here is to manipulate the db list timestamps inside a
USE_INJECTION_POINTS block in the launcher to make the spawn more
aggressive.  Anyway, with 600ms in detection where I've tested it, I
can live with the responsiveness of the patch as proposed.

> Looks reasonable to me.

Thanks.  I'll see about stressing the buildfarm tomorrow or so, after
looking at how the CI reacts.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-07-14 Thread Alexander Korotkov
Thanks to Kyotaro for the review.  And thanks to Ivan for the patch
revision.  I made another revision of the patch.

On Tue, Jul 9, 2024 at 12:51 PM Kartyshov Ivan
 wrote:
> Thank you for your interest in the patch.
>
> On 2024-06-20 11:30, Kyotaro Horiguchi wrote:
> > Hi, I looked through the patch and have some comments.
> >
> >
> > == L68:
> > +Recovery Procedures
> >
> > It looks somewhat confusing and appears as if the section is intended
> > to explain how to perform recovery. Since this is the first built-in
> > procedure, I'm not sure how should this section be written. However,
> > the section immediately above is named "Recovery Control Functions",
> > so "Reocvery Synchronization Functions" would align better with the
> > naming of the upper section. (I don't believe we need to be so strcit
> > about the distinction between functions and procedures here.)
> >
> > It looks strange that the procedure signature includes the return type.
>
> Good point, change
> Recovery Procedures -> Recovery Synchronization Procedures

Thank you, looks good to me.

> > == L93:
> > +If timeout is not specified or zero,
> > this
> > +procedure returns once WAL is replayed upto
> > +target_lsn.
> > +If the timeout is specified (in
> > +milliseconds) and greater than zero, the procedure waits until
> > the
> > +server actually replays the WAL upto
> > target_lsn or
> > +until the given time has passed. On timeout, an error is
> > emitted.
> >
> > The first sentence should mention the main functionality. Following
> > precedents, it might be better to use something like "Waits until
> > recovery surpasses the specified LSN. If no timeout is specified or it
> > is set to zero, this procedure waits indefinitely for the LSN. If the
> > timeout is specified (in milliseconds) and is greater than zero, the
> > procedure waits until the LSN is reached or the specified time has
> > elapsed. On timeout, or if the server is promoted before the LSN is
> > reached, an error is emitted."
> >
> > The detailed explanation that follows the above seems somewhat too
> > verbose to me, as other functions don't have such detailed examples.
>
> Please offer your description. I think it would be better.

Kyotaro actually provided a paragraph in his message.  I've integrated
it into the patch.

> > == L484
> > /*
> > +  * Set latches for processes, whose waited LSNs are already replayed.
> > This
> > +  * involves spinlocks.  So, we shouldn't do this under a spinlock.
> > +  */
> >
> > Here, I'm not quite sure what specifically spinlock (or mutex?) is
> > referring to.  However, more importantly, shouldn't we explain that it
> > is okay not to use any locks at all, rather than mentioning that
> > spinlocks should not be used here? I found a similar case around
> > freelist.c:238, which is written as follows.
> >
> >>   * Not acquiring ProcArrayLock here which is slightly icky. 
> >> It's
> >>   * actually fine because procLatch isn't ever freed, so we 
> >> just can
> >>   * potentially set the wrong process' (or no process') latch.
> >>   */
> >>  SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
>
> ???

I've revised the comment.

> > = L518
> > +void
> > +WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
> >
> > This function is only called within the same module. I'm not sure if
> > we need to expose it. I we do, the name should probably be more
> > specific. I'm not quite sure if the division of functionality between
> > this function and its only caller function is appropriate.  As a
> > possible refactoring, we could have WaitForLSN() just return the
> > result as [reached, timedout, promoted] and delegate prerequisition
> > checks and error reporting to the SQL function.

I think WaitForLSNReplay() is better name.  And it's not clear what
API we could need.  So I would prefer to keep it static for now.

> waitLSN -> waitLSNStates
> No, waitLSNStates is not the best name, because waitLSNState is a state,
> and waitLSN is not the array of waitLSNStates. We can think about
> another name, what you think?
>
> > = L524
> > + /* Shouldn't be called when shmem isn't initialized */
> > + Assert(waitLSN);
> >
> > Seeing this assertion, I feel that the name "waitLSN" is a bit
> > obscure. How about renaming it to "waitLSNStates"?

I agree that waitLSN is too generic.  I've renamed it to waitLSNState.

> > = L527
> > + /* Should be only called by a backend */
> > + Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);
> >
> > This is somewhat excessive, causing a server crash when ending with an
> > error would suffice. By the way, if I execute "CALL
> > pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
> > The condition doesn't seem appropriate.
>
> Can you give more information on your server crashes, so I could repeat
> them.

I've rech

Re: Check lateral references within PHVs for memoize cache keys

2024-07-14 Thread Richard Guo
On Thu, Jul 11, 2024 at 5:18 PM Richard Guo  wrote:
> Attached is an updated version of this patch.

I've pushed this patch.  Thanks for all the reviews.

Thanks
Richard




Re: Pgoutput not capturing the generated columns

2024-07-14 Thread Peter Smith
Hi, here are some review comments about patch v17-0003

==
1.
Missing a docs change?

Previously, (v16-0002) the patch included a change to
doc/src/sgml/protocol.sgml like below to say STORED generated instead
of just generated.


-Boolean option to enable generated columns. This option controls
-whether generated columns should be included in the string
-representation of tuples during logical decoding in PostgreSQL.
+Boolean option to enable STORED generated columns.
+This option controls whether STORED
generated columns
+should be included in the string representation of tuples
during logical
+decoding in PostgreSQL.


Why is that v16 change no longer present in patch v17-0003?

==
src/backend/catalog/pg_publication.c

2.
Previously, (v16-0003) this patch included a change to clarify the
kind of generated cols that are allowed in a column list.

  * Translate a list of column names to an array of attribute numbers
  * and a Bitmapset with them; verify that each attribute is appropriate
- * to have in a publication column list (no system or generated attributes,
- * no duplicates).  Additional checks with replica identity are done later;
- * see pub_collist_contains_invalid_column.
+ * to have in a publication column list (no system or virtual generated
+ * attributes, no duplicates). Additional checks with replica identity
+ * are done later; see pub_collist_contains_invalid_column.

Why is that v16 change no longer present in patch v17-0003?

==
src/backend/replication/logical/tablesync.c

3. make_copy_attnamelist

- if (!attr->attgenerated)
+ if (attr->attgenerated != ATTRIBUTE_GENERATED_STORED)
  continue;

IIUC this logic is checking to make sure the subscriber-side table
column was not a generated column (because we don't replicate on top
of generated columns). So, does the distinction of STORED/VIRTUAL
really matter here?

~~~

fetch_remote_table_info:
nitpick - Should not change any spaces unrelated to the patch

==

send_relation_and_attrs:

- if (att->attgenerated && !include_generated_columns)
+ if (att->attgenerated && (att->attgenerated !=
ATTRIBUTE_GENERATED_STORED || !include_generated_columns))
  continue;

nitpick - It seems over-complicated. Conditions can be split so the
code fragment looks the same as in other places in this patch.

==
99.
Please see the attached diffs patch that implements any nitpicks
mentioned above.

==
Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/backend/replication/logical/tablesync.c 
b/src/backend/replication/logical/tablesync.c
index c2a7d18..5288769 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1008,7 +1008,7 @@ fetch_remote_table_info(char *nspname, char *relname, 
bool **remotegenlist_res,
 "  LEFT JOIN pg_catalog.pg_index i"
 "   ON (i.indexrelid = 
pg_get_replica_identity_index(%u))"
 " WHERE a.attnum > 0::pg_catalog.int2"
-" AND NOT a.attisdropped", 
lrel->remoteid);
+"   AND NOT a.attisdropped", 
lrel->remoteid);
 
if(server_version >= 12)
{
diff --git a/src/backend/replication/pgoutput/pgoutput.c 
b/src/backend/replication/pgoutput/pgoutput.c
index 944554d..a256ab7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -786,8 +786,14 @@ send_relation_and_attrs(Relation relation, TransactionId 
xid,
if (att->attisdropped)
continue;
 
-   if (att->attgenerated && (att->attgenerated != 
ATTRIBUTE_GENERATED_STORED || !include_generated_columns))
-   continue;
+   if (att->attgenerated)
+   {
+   if (!include_generated_columns)
+   continue;
+
+   if (att->attgenerated != ATTRIBUTE_GENERATED_STORED)
+   continue;
+   }
 
if (att->atttypid < FirstGenbkiObjectId)
continue;


Re: Wrong results with grouping sets

2024-07-14 Thread Richard Guo
FWIW, in addition to fixing wrong result issues for queries with
grouping sets, the changes in 0001 also improve performance for
queries that have subqueries in the grouping expressions, because
different instances of the same subquery would need to be executed
only once.  As a simple example, consider

create table t (a int, b int);
insert into t select i, i from generate_series(1,1)i;
analyze t;

-- on patched
explain (analyze, costs off)
select (select t1.b from t t2 where a = t1.a) as s1,
   (select t1.b from t t2 where a = t1.a) as s2,
   (select t1.b from t t2 where a = t1.a) as s3
from t t1
group by a, s1;
 QUERY PLAN

 Group (actual time=20475.028..20480.543 rows=1 loops=1)
   Group Key: t1.a, ((SubPlan 1))
   ->  Sort (actual time=20475.017..20475.821 rows=1 loops=1)
 Sort Key: t1.a, ((SubPlan 1))
 Sort Method: quicksort  Memory: 697kB
 ->  Seq Scan on t t1 (actual time=7.435..20468.599 rows=1 loops=1)
   SubPlan 1
 ->  Seq Scan on t t2 (actual time=1.022..2.045 rows=1
loops=1)
   Filter: (a = t1.a)
   Rows Removed by Filter: 
 Planning Time: 1.561 ms
 Execution Time: 20481.933 ms
(12 rows)

-- on master
explain (analyze, costs off)
select (select t1.b from t t2 where a = t1.a) as s1,
   (select t1.b from t t2 where a = t1.a) as s2,
   (select t1.b from t t2 where a = t1.a) as s3
from t t1
group by a, s1;
 QUERY PLAN

 Group (actual time=20779.318..62233.526 rows=1 loops=1)
   Group Key: t1.a, ((SubPlan 1))
   ->  Sort (actual time=20775.125..20777.936 rows=1 loops=1)
 Sort Key: t1.a, ((SubPlan 1))
 Sort Method: quicksort  Memory: 697kB
 ->  Seq Scan on t t1 (actual time=7.492..20770.060 rows=1 loops=1)
   SubPlan 1
 ->  Seq Scan on t t2 (actual time=1.037..2.075 rows=1
loops=1)
   Filter: (a = t1.a)
   Rows Removed by Filter: 
   SubPlan 2
 ->  Seq Scan on t t2_1 (actual time=1.037..2.071 rows=1 loops=1)
   Filter: (a = t1.a)
   Rows Removed by Filter: 
   SubPlan 3
 ->  Seq Scan on t t2_2 (actual time=1.037..2.071 rows=1 loops=1)
   Filter: (a = t1.a)
   Rows Removed by Filter: 
 Planning Time: 1.286 ms
 Execution Time: 62235.753 ms
(20 rows)

We can see that with the 0001 patch, this query runs ~3 times faster,
which is no surprise because there are 3 instances of the same
subquery in the targetlist.

Thanks
Richard




Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-14 Thread Fujii Masao



On 2024/07/12 1:16, Robert Haas wrote:

On Thu, Jul 11, 2024 at 6:51 AM Fujii Masao  wrote:

It looks like the fast_forward field in WalSummarizerData is no longer 
necessary.

So far, I haven't found any other issues with the patch.


Thanks for reviewing. Regarding fast_forward, I think I had the idea
in mind that perhaps it should be exposed by
pg_get_wal_summarizer_state(),


Understood.



but I didn't actually implement that.
Thinking about it again, I think maybe it's fine to just remove it
from the shared memory state, as this should be a rare scenario in
practice. What is your opinion?


I don't think it's a rare scenario since summarize_wal can be enabled
after starting the server with wal_level=minimal. Therefore, I believe
such a configuration should be prohibited using a GUC check hook,
as my patch does. Alternatively, we should at least report or
log something when summarize_wal is enabled but fast_forward is also
enabled, so users can easily detect or investigate this unexpected situation.
I'm not sure if exposing fast_forward is necessary for that or not...

Regarding pg_get_wal_summarizer_state(), it is documented that
summarized_lsn indicates the ending LSN of the last WAL summary file
written to disk. However, with the patch, summarized_lsn advances
even when fast_forward is enabled. The documentation should be updated,
or summarized_lsn should be changed so it doesn't advance while
fast_forward is enabled.


On 2024/07/12 3:00, Robert Haas wrote:

On Thu, Jul 11, 2024 at 6:51 AM Fujii Masao  wrote:

So far, I haven't found any other issues with the patch.


Here is a new version that removes the hunks you highlighted and also
adds a test case.


Thanks for updating the patch! LGTM.

I have one small comment:

+# This test aims to validate that takeing an incremental backup fails when

"takeing" should be "taking"?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: s/shm_mq_iovec/struct iovec/

2024-07-14 Thread Thomas Munro
On Thu, Jul 4, 2024 at 9:26 PM Heikki Linnakangas  wrote:
> On 15/04/2024 04:20, Thomas Munro wrote:
> > I was grepping for iovec users and noticed that the shm_mq stuff
> > defines its own iovec struct.  Is there any reason not to use the
> > standard one, now that we can?  Will add to next commitfest.
>
> I think it's better to keep them separate. They serve a similar purpose,
> but they belong to completely separate APIs; I think "incidental
> deduplication" is the right term for that. shm_mq_iovec is only used by
> our shm queue implementation, while struct iovec is part of the POSIX
> API. We wouldn't want to leak IOV_MAX into how shm_mq_iovec is used, for
> example. Or as a thought experiment, if our shm_mq implementation needed
> an extra flag in the struct or something, we would be free to just add
> it. But if it we reused struct iovec, then we couldn't, or we'd need to
> create a new struct again.

Thanks for looking.  I marked this "returned with feedback".

(For future parallel query work, I have a scheme where queues can
spill to disk instead of blocking, which unblocks a bunch of parallel
execution strategies that are currently impossible due to flow control
deadlock risk.  Then the two APIs meet and it annoys me that they are
not the same, so maybe I'll talk about this again some day :-))




Re: Confine vacuum skip logic to lazy_scan_skip

2024-07-14 Thread Thomas Munro
On Mon, Jul 8, 2024 at 2:49 AM Noah Misch  wrote:
> what is the scope of the review you seek?

The patch "Refactor tidstore.c memory management." could definitely
use some review.  I wasn't sure if that should be proposed in a new
thread of its own, but then the need for it comes from this
streamifying project, so...  The basic problem was that we want to
build up a stream of block to be vacuumed (so that we can perform the
I/O combining etc) + some extra data attached to each buffer, in this
case the TID list, but the implementation of tidstore.c in master
would require us to make an extra intermediate copy of the TIDs,
because it keeps overwriting its internal buffer.  The proposal here
is to make it so that you can get get a tiny copyable object that can
later be used to retrieve the data into a caller-supplied buffer, so
that tidstore.c's iterator machinery doesn't have to have its own
internal buffer at all, and then calling code can safely queue up a
few of these at once.




Re: race condition when writing pg_control

2024-07-14 Thread Thomas Munro
On Fri, Jul 12, 2024 at 11:43 PM Noah Misch  wrote:
> On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:
> > On Fri, May 17, 2024 at 4:46 PM Thomas Munro  wrote:
> > > The specific problem here is that LocalProcessControlFile() runs in
> > > every launched child for EXEC_BACKEND builds.  Windows uses
> > > EXEC_BACKEND, and Windows' NTFS file system is one of the two file
> > > systems known to this list to have the concurrent read/write data
> > > mashing problem (the other being ext4).
>
> > First idea idea I've come up with to avoid all of that: pass a copy of
> > the "proto-controlfile", to coin a term for the one read early in
> > postmaster startup by LocalProcessControlFile().  As far as I know,
> > the only reason we need it is to suck some settings out of it that
> > don't change while a cluster is running (mostly can't change after
> > initdb, and checksums can only be {en,dis}abled while down).  Right?
> > Children can just "import" that sucker instead of calling
> > LocalProcessControlFile() to figure out the size of WAL segments yada
> > yada, I think?  Later they will attach to the real one in shared
> > memory for all future purposes, once normal interlocking is allowed.
>
> I like that strategy, particularly because it recreates what !EXEC_BACKEND
> backends inherit from the postmaster.  It might prevent future bugs that would
> have been specific to EXEC_BACKEND.

Thanks for looking!  Yeah, that is a good way to put it.

The only other idea I can think of is that the Postmaster could take
all of the things that LocalProcessControlFile() wants to extract from
the file, and transfer them via that struct used for EXEC_BACKEND as
individual variables, instead of this new proto-controlfile copy.  I
think it would be a bigger change with no obvious-to-me additional
benefit, so I didn't try it.

> > I dunno.  Draft patch attached.  Better plans welcome.  This passes CI
> > on Linux systems afflicted by EXEC_BACKEND, and Windows.  Thoughts?
>
> Looks reasonable.  I didn't check over every detail, given the draft status.

I'm going to upgrade this to a proposal:

https://commitfest.postgresql.org/49/5124/

I wonder how often this happens in the wild.




Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-14 Thread Fujii Masao




On 2024/07/12 21:17, Masahiko Sawada wrote:

On Thu, Jul 11, 2024 at 8:14 PM Fujii Masao  wrote:




On 2024/07/10 22:35, Masahiko Sawada wrote:

BTW the new regression tests don't check the table and index names.
Isn't it better to show table and index names for better
diagnosability?


Sounds good to me. I've updated the patch as suggested.
Please see the attached patch.



Thank you for updating the patch! LGTM.


Thanks for reviewing the patch! I've pushed it.

However, some buildfarm members reported errors, so I'll investigate further.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Fix a comment error in logicalrep_write_typ()

2024-07-14 Thread Amit Kapila
On Thu, Jul 11, 2024 at 4:35 PM cca5507  wrote:
>
> Thank you for review!
>
> The commitfest link for tracking:
> https://commitfest.postgresql.org/49/5121/
>

I've pushed and closed the CF entry.

-- 
With Regards,
Amit Kapila.




Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-14 Thread Fujii Masao



On 2024/07/15 13:33, Fujii Masao wrote:



On 2024/07/12 21:17, Masahiko Sawada wrote:

On Thu, Jul 11, 2024 at 8:14 PM Fujii Masao  wrote:




On 2024/07/10 22:35, Masahiko Sawada wrote:

BTW the new regression tests don't check the table and index names.
Isn't it better to show table and index names for better
diagnosability?


Sounds good to me. I've updated the patch as suggested.
Please see the attached patch.



Thank you for updating the patch! LGTM.


Thanks for reviewing the patch! I've pushed it.

However, some buildfarm members reported errors, so I'll investigate further.


Attached patch fixes unstable tests. Currently testing before pushing.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONFrom 8da2a15b983be06abe08f838ab01bfc0e492 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Mon, 15 Jul 2024 13:55:51 +0900
Subject: [PATCH] Fix unstable tests in partition_merge.sql and
 partition_split.sql.

The tests added by commit c086896625 were unstable due to
missing schema names when checking pg_tables and pg_indexes.

Reported by buildfarm.
---
 src/test/regress/expected/partition_merge.out | 7 +--
 src/test/regress/expected/partition_split.out | 6 --
 src/test/regress/sql/partition_merge.sql  | 7 +--
 src/test/regress/sql/partition_split.sql  | 6 --
 4 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/src/test/regress/expected/partition_merge.out 
b/src/test/regress/expected/partition_merge.out
index 26bf58b23d..2b6e92f892 100644
--- a/src/test/regress/expected/partition_merge.out
+++ b/src/test/regress/expected/partition_merge.out
@@ -862,13 +862,15 @@ SET search_path = partitions_merge_schema, pg_temp, 
public;
 ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
 ROLLBACK;
 -- Check the new partition inherits parent's tablespace
+SET search_path = partitions_merge_schema, pg_temp, public;
 CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE regress_tblspace)
   PARTITION BY RANGE (i) TABLESPACE regress_tblspace;
 CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
 CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
 ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
 SELECT tablename, tablespace FROM pg_tables
-  WHERE tablename IN ('t', 'tp_0_2') ORDER BY tablename, tablespace;
+  WHERE tablename IN ('t', 'tp_0_2') AND schemaname = 'partitions_merge_schema'
+  ORDER BY tablename, tablespace;
  tablename |tablespace
 ---+--
  t | regress_tblspace
@@ -876,7 +878,8 @@ SELECT tablename, tablespace FROM pg_tables
 (2 rows)
 
 SELECT tablename, indexname, tablespace FROM pg_indexes
-  WHERE tablename IN ('t', 'tp_0_2') ORDER BY tablename, indexname, tablespace;
+  WHERE tablename IN ('t', 'tp_0_2') AND schemaname = 'partitions_merge_schema'
+  ORDER BY tablename, indexname, tablespace;
  tablename |  indexname  |tablespace
 ---+-+--
  t | t_pkey  | regress_tblspace
diff --git a/src/test/regress/expected/partition_split.out 
b/src/test/regress/expected/partition_split.out
index 1a8c95ad81..dc9a5130cc 100644
--- a/src/test/regress/expected/partition_split.out
+++ b/src/test/regress/expected/partition_split.out
@@ -1502,7 +1502,8 @@ ALTER TABLE t SPLIT PARTITION tp_0_2 INTO
   (PARTITION tp_0_1 FOR VALUES FROM (0) TO (1),
PARTITION tp_1_2 FOR VALUES FROM (1) TO (2));
 SELECT tablename, tablespace FROM pg_tables
-  WHERE tablename IN ('t', 'tp_0_1', 'tp_1_2') ORDER BY tablename, tablespace;
+  WHERE tablename IN ('t', 'tp_0_1', 'tp_1_2') AND schemaname = 
'partition_split_schema'
+  ORDER BY tablename, tablespace;
  tablename |tablespace
 ---+--
  t | regress_tblspace
@@ -1511,7 +1512,8 @@ SELECT tablename, tablespace FROM pg_tables
 (3 rows)
 
 SELECT tablename, indexname, tablespace FROM pg_indexes
-  WHERE tablename IN ('t', 'tp_0_1', 'tp_1_2') ORDER BY tablename, indexname, 
tablespace;
+  WHERE tablename IN ('t', 'tp_0_1', 'tp_1_2') AND schemaname = 
'partition_split_schema'
+  ORDER BY tablename, indexname, tablespace;
  tablename |  indexname  |tablespace
 ---+-+--
  t | t_pkey  | regress_tblspace
diff --git a/src/test/regress/sql/partition_merge.sql 
b/src/test/regress/sql/partition_merge.sql
index 200bd3e762..6feceeef6c 100644
--- a/src/test/regress/sql/partition_merge.sql
+++ b/src/test/regress/sql/partition_merge.sql
@@ -537,15 +537,18 @@ ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO 
tp_0_2;
 ROLLBACK;
 
 -- Check the new partition inherits parent's tablespace
+SET search_path = partitions_merge_schema, pg_temp, public;
 CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE regress_tblspace)
   PARTITION BY RANGE (i) TABLESPACE regress_tblspace;
 CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
 CREATE TABLE tp_1_2 

Re: Sort functions with specialized comparators

2024-07-14 Thread Антуан Виолин
>
> Hello all.
>
> I have decided to explore more areas in which I can optimize and have added
> two new benchmarks. Do you have any thoughts on this?
>
> postgres=# select bench_int16_sort(100);
> bench_int16_sort
>
>
> -
> Time taken by usual sort: 66354981 ns, Time taken by optimized sort:
> 52151523 ns, Percentage difference: 21.41%
> (1 row)
>
> postgres=# select bench_float8_sort(100);
> bench_float8_sort
>
>
> --
> Time taken by usual sort: 121475231 ns, Time taken by optimized sort:
> 74458545 ns, Percentage difference: 38.70%
> (1 row)
>

 Hello all
We would like to see the relationship between the length of the sorted array
and the performance gain, perhaps some graphs. We also want to see to set a
"worst case" test, sorting the array in ascending order when it is initially
descending

Best, regards, Antoine Violin

postgres=#
>

On Mon, Jul 15, 2024 at 10:32 AM Stepan Neretin  wrote:

>
>
> On Sat, Jun 8, 2024 at 1:50 AM Stepan Neretin  wrote:
>
>> Hello all.
>>
>> I am interested in the proposed patch and would like to propose some
>> additional changes that would complement it. My changes would introduce
>> similar optimizations when working with a list of integers or object
>> identifiers. Additionally, my patch includes an extension for
>> benchmarking, which shows an average speedup of 30-40%.
>>
>> postgres=# SELECT bench_oid_sort(100);
>>  bench_oid_sort
>>
>>
>> 
>>  Time taken by list_sort: 116990848 ns, Time taken by list_oid_sort:
>> 80446640 ns, Percentage difference: 31.24%
>> (1 row)
>>
>> postgres=# SELECT bench_int_sort(100);
>>  bench_int_sort
>>
>>
>> 
>>  Time taken by list_sort: 118168506 ns, Time taken by list_int_sort:
>> 80523373 ns, Percentage difference: 31.86%
>> (1 row)
>>
>> What do you think about these changes?
>>
>> Best regards, Stepan Neretin.
>>
>> On Fri, Jun 7, 2024 at 11:08 PM Andrey M. Borodin 
>> wrote:
>>
>>> Hi!
>>>
>>> In a thread about sorting comparators[0] Andres noted that we have
>>> infrastructure to help compiler optimize sorting. PFA attached PoC
>>> implementation. I've checked that it indeed works on the benchmark from
>>> that thread.
>>>
>>> postgres=# CREATE TABLE arrays_to_sort AS
>>>SELECT array_shuffle(a) arr
>>>FROM
>>>(SELECT ARRAY(SELECT generate_series(1, 100)) a),
>>>generate_series(1, 10);
>>>
>>> postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- original
>>> Time: 990.199 ms
>>> postgres=# SELECT (sort(arr))[1] FROM arrays_to_sort; -- patched
>>> Time: 696.156 ms
>>>
>>> The benefit seems to be on the order of magnitude with 30% speedup.
>>>
>>> There's plenty of sorting by TransactionId, BlockNumber, OffsetNumber,
>>> Oid etc. But this sorting routines never show up in perf top or something
>>> like that.
>>>
>>> Seems like in most cases we do not spend much time in sorting. But
>>> specialization does not cost us much too, only some CPU cycles of a
>>> compiler. I think we can further improve speedup by converting inline
>>> comparator to value extractor: more compilers will see what is actually
>>> going on. But I have no proofs for this reasoning.
>>>
>>> What do you think?
>>>
>>>
>>> Best regards, Andrey Borodin.
>>>
>>> [0]
>>> https://www.postgresql.org/message-id/flat/20240209184014.sobshkcsfjix6u4r%40awork3.anarazel.de#fc23df2cf314bef35095b632380b4a59
>>
>>
> Hello all.
>
> I have decided to explore more areas in which I can optimize and have
> added two new benchmarks. Do you have any thoughts on this?
>
> postgres=# select bench_int16_sort(100);
> bench_int16_sort
>
>
> -
>  Time taken by usual sort: 66354981 ns, Time taken by optimized sort:
> 52151523 ns, Percentage difference: 21.41%
> (1 row)
>
> postgres=# select bench_float8_sort(100);
> bench_float8_sort
>
>
> --
>  Time taken by usual sort: 121475231 ns, Time taken by optimized sort:
> 74458545 ns, Percentage difference: 38.70%
> (1 row)
>
> postgres=#
>
> Best regards, Stepan Neretin.
>


Re: Pgoutput not capturing the generated columns

2024-07-14 Thread Peter Smith
Hi, I had a quick look at the patch v17-0004 which is the split-off
new BMS logic.

IIUC this 0004 is currently undergoing some refactoring and
cleaning-up, so I won't comment much about it except to give the
following observation below.

==
src/backend/replication/logical/proto.c.

I did not expect to see any code fragments that are still checking
generated columns like below:

logicalrep_write_tuple:

  if (att->attgenerated)
  {
- if (!include_generated_columns)
- continue;

  if (att->attgenerated != ATTRIBUTE_GENERATED_STORED)
  continue;
~

  if (att->attgenerated)
  {
- if (!include_generated_columns)
- continue;

  if (att->attgenerated != ATTRIBUTE_GENERATED_STORED)
  continue;

~~~

logicalrep_write_attrs:

  if (att->attgenerated)
  {
- if (!include_generated_columns)
- continue;

  if (att->attgenerated != ATTRIBUTE_GENERATED_STORED)
  continue;

~
if (att->attgenerated)
  {
- if (!include_generated_columns)
- continue;

  if (att->attgenerated != ATTRIBUTE_GENERATED_STORED)
  continue;
~~~


AFAIK, now checking support of generated columns will be done when the
BMS 'columns' is assigned, so the continuation code will be handled
like this:

if (!column_in_column_list(att->attnum, columns))
  continue;

==

BTW there is a subtle but significant difference in this 0004 patch.
IOW, we are introducing a difference between the list of published
columns VERSUS a publication column list. So please make sure that all
code comments are adjusted appropriately so they are not misleading by
calling these "column lists" still.

BEFORE: BMS 'columns'  means "columns of the column list" or NULL if
there was no publication column list
AFTER: BMS 'columns' means "columns to be replicated" or NULL if all
columns are to be replicated

==
Kind Regards,
Peter Smith.




Re: on_error table, saving error info to a table

2024-07-14 Thread Nishant Sharma
Thanks for the patch!

I was not able to understand why we need a special error table for COPY?
Can't we just log it in a new log error file for COPY in a proper format?
Like
using table approach, PG would be unnecessarily be utilising its resources
like
extra CPU to format the data in pages to store in its table format, writing
and
keeping the table in its store (which may or may not be required), the user
would be required to make sure it creates the error table with proper
columns
to be used in COPY, etc..


Meanwhile, please find some quick review comments:-

1) Patch needs re-base.

2) If the columns of the error table are fixed, then why not create it
internally using
some function or something instead of making the user create the table
correctly
with all the columns?

3) I think, error messages can be improved, which looks big to me.

4) I think no need of below pstrdup, As CStringGetTextDatum creates a text
copy for
the same:-
err_code =
pstrdup(unpack_sql_state(cstate->escontext->error_data->sqlerrcode));

t_values[9] = CStringGetTextDatum(err_code);

5) Should 'on_error_rel' as not NULL be checked along with below, as I can
see it is
passed as NULL from two locations?
+   if (cstate->opts.on_error == COPY_ON_ERROR_TABLE)
+   {

6) Below declarations can be shifted to the if block, where they are used.
Instead of
keeping them as global in function?
+   HeapTuple   on_error_tup;
+   TupleDesc   on_error_tupDesc;

+   if (cstate->opts.on_error == COPY_ON_ERROR_TABLE)
+   {

7) Below comment going beyond 80 char width:-
* if on_error is specified with 'table', then on_error_rel is the error
saving table

8) Need space after 'false'
err_detail ? false: true;

9) Below call can fit in a single line. No need to keep the 2nd and 3rd
parameter in
nextlines.
+   on_error_tup = heap_form_tuple(on_error_tupDesc,
+   t_values,
+   t_isnull);

10) Variable declarations Tab Spacing issue at multiple places.

11) Comments in the patch are not matched to PG comment style.


Kindly note I have not tested the patch properly yet. Only checked it with
a positive test
case. As I will wait for a conclusion on my opinion of the proposed patch.


Regards,
Nishant Sharma.
EnterpriseDB, Pune.


On Sat, Feb 3, 2024 at 11:52 AM jian he  wrote:

> Hi.
> I previously did some work in COPY FROM save error information to a table.
> still based on this suggestion:
> https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us
> Now I refactored it.
>
> the syntax:
> ON_ERROR 'table', TABLE 'error_saving_tbl'
>
> if ON_ERROR is not specified with 'table', TABLE is specified, then error.
> if ON_ERROR is specified with 'table', TABLE is not specified or
> error_saving_tbl does not exist, then error.
>
> In BeginCopyFrom, we check the data definition of error_saving_table,
> we also check if the user has INSERT privilege to error_saving_table
> (all the columns).
> We also did a preliminary check of the lock condition of
> error_saving_table.
>
> if it does not meet these conditions, we quickly error out.
> error_saving_table will be the same schema as the copy from table.
>
> Because "table" is a keyword, I have to add the following changes to
> gram.y.
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -3420,6 +3420,10 @@ copy_opt_item:
>   {
>   $$ = makeDefElem("null", (Node *) makeString($3), @1);
>   }
> + | TABLE opt_as Sconst
> + {
> + $$ = makeDefElem("table", (Node *) makeString($3), @1);
> + }
>
> since "table" is already a keyword, so there is no influence on the
> parsing speed?
>
> demo:
>
> create table err_tbl(
> userid oid, -- the user oid while copy generated this entry
> copy_tbl oid, --copy table
> filename text,
> lineno  int8,
> linetext,
> colname text,
> raw_field_value text,
> err_message text,
> err_detail text,
> errorcode text
> );
> create table t_copy_tbl(a int, b int, c int);
> COPY t_copy_tbl FROM STDIN WITH (delimiter ',', on_error 'table',
> table err_tbl);
> 1,2,a
> \.
>
> table err_tbl \gx
> -[ RECORD 1 ]---+---
> userid  | 10
> copy_tbl| 17920
> filename| STDIN
> lineno  | 1
> line| 1,2,a
> colname | c
> raw_field_value | a
> err_message | invalid input syntax for type integer: "a"
> err_detail  |
> errorcode   | 22P02
>


Re: Extension for PostgreSQL cast jsonb to hstore WIP

2024-07-14 Thread Антуан Виолин
On 2024-04-03 Wn 04:21, Andrew Dunstan

> I don't think a cast that doesn't cater for all the forms json can take is
> going to work very well. At the very least you would need to error out in
> cases you didn't want to cover, and have tests for all of those errors. But
> the above is only a tiny fraction of those. If the error cases are going to
> be so much more than the cases that work it seems a bit pointless.
>
Hi everyone
I changed my mail account to be officially displayed in the correspondence.
I also made an error conclusion if we are given an incorrect value. I
believe that such a cast is needed by PostgreSQL since we already have
several incomplete casts, but they perform their duties well and help in
the right situations.

cheers
Antoine Violin

Antoine

On Mon, Jul 15, 2024 at 12:42 PM Andrew Dunstan  wrote:

>
> On 2024-04-02 Tu 11:43, ShadowGhost wrote:
>
> At the moment, this cast supports only these structures, as it was enough
> for my tasks:
> {str:numeric}
> {str:str}
> {str:bool}
> {str:null}
> But it's a great idea and I'll think about implementing it.
>
>
> Please don't top-post on the PostgreSQL lists. See
> 
> 
>
> I don't think a cast that doesn't cater for all the forms json can take is
> going to work very well. At the very least you would need to error out in
> cases you didn't want to cover, and have tests for all of those errors. But
> the above is only a tiny fraction of those. If the error cases are going to
> be so much more than the cases that work it seems a bit pointless.
>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>


Support specify tablespace for each merged/split partition

2024-07-14 Thread Junwang Zhao
In [1], it is suggested that it might be a good idea to support
specifying the tablespace for each merged/split partition.

We can do the following after this feature is supported:

CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
CREATE TABLE t (i int PRIMARY KEY) PARTITION BY RANGE (i);
CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);

ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2 TABLESPACE tblspc;

ALTER TABLE t SPLIT PARTITION tp_0_2 INTO
(PARTITION tp_0_1 FOR VALUES FROM (0) TO (1) TABLESPACE tblspc,
PARTITION tp_1_2 FOR VALUES FROM (1) TO (2));

[1] 
https://www.postgresql.org/message-id/abaf390b-3320-40a5-8815-ef476db5c...@oss.nttdata.com

-- 
Regards
Junwang Zhao


0001-support-specify-tablespace-for-each-merged-split.patch
Description: Binary data


Re: Optimize WindowAgg's use of tuplestores

2024-07-14 Thread Ashutosh Bapat
On Fri, Jul 12, 2024 at 11:59 AM Ashutosh Bapat
 wrote:
>
>
> Now that you are also seeing the slowdown with your earlier patch, I
> am wondering whether adding unlikely() by itself is a good
> optimization. There might be some other reason behind the perceived
> slowdown. How do the numbers look when you just add unlikely() without
> any other changes?

Out of curiosity, I measured the performance with just the "unlikely"
change and with the full patch. Below are the results
Testing with 100 partitions
latency average = 503.321 ms
latency average = 510.365 ms
latency average = 512.117 ms
Testing with 10 partitions
latency average = 371.764 ms
latency average = 361.202 ms
latency average = 360.529 ms
Testing with 1 partitions
latency average = 327.495 ms
latency average = 327.334 ms
latency average = 325.925 ms
Testing with 1000 partitions
latency average = 319.290 ms
latency average = 318.709 ms
latency average = 318.013 ms
Testing with 100 partitions
latency average = 317.756 ms
latency average = 318.933 ms
latency average = 316.529 ms
Testing with 10 partitions
latency average = 316.392 ms
latency average = 315.297 ms
latency average = 316.007 ms
Testing with 1 partitions
latency average = 330.978 ms
latency average = 330.529 ms
latency average = 333.538 ms

with just unlikely change
Testing with 100 partitions
latency average = 504.786 ms
latency average = 507.557 ms
latency average = 508.522 ms
Testing with 10 partitions
latency average = 316.345 ms
latency average = 315.496 ms
latency average = 326.503 ms
Testing with 1 partitions
latency average = 296.878 ms
latency average = 293.927 ms
latency average = 294.654 ms
Testing with 1000 partitions
latency average = 292.680 ms
latency average = 283.245 ms
latency average = 280.857 ms
Testing with 100 partitions
latency average = 292.569 ms
latency average = 296.330 ms
latency average = 295.389 ms
Testing with 10 partitions
latency average = 285.909 ms
latency average = 287.499 ms
latency average = 293.322 ms
Testing with 1 partitions
latency average = 305.080 ms
latency average = 309.100 ms
latency average = 307.794 ms

There's noticeable change across all the number of partitions with
just "unlikely" change. The improvement is lesser with larger number
of partitions but quite visible with lesser number of partitions.

full patch
Testing with 100 partitions
latency average = 356.026 ms
latency average = 375.280 ms
latency average = 374.575 ms
Testing with 10 partitions
latency average = 318.173 ms
latency average = 307.598 ms
latency average = 315.868 ms
Testing with 1 partitions
latency average = 295.541 ms
latency average = 313.317 ms
latency average = 299.936 ms
Testing with 1000 partitions
latency average = 295.082 ms
latency average = 305.204 ms
latency average = 294.702 ms
Testing with 100 partitions
latency average = 302.552 ms
latency average = 307.596 ms
latency average = 304.202 ms
Testing with 10 partitions
latency average = 295.050 ms
latency average = 291.127 ms
latency average = 299.704 ms
Testing with 1 partitions
latency average = 308.781 ms
latency average = 304.071 ms
latency average = 319.560 ms

There is significant improvement with a large number of partitions as
seen previously. But for a smaller number of partitions the
performance worsens, which needs some investigation.
--
Best Wishes,
Ashutosh Bapat




Re: Removing unneeded self joins

2024-07-14 Thread Andrei Lepikhov

On 7/15/24 12:31, jian he wrote:

hi.
Here is the latest patch (v6),
I've made the following changes.

* disallow original Query->resultRelation participate in SJE.
for SELECT, nothing is changed. for UPDATE/DELETE/MERGE
we can do:
EXPLAIN (COSTS OFF)
UPDATE sj sq SET b = sq.b + sz.a FROM (select s1.* from sj s1 join sj
s2 on s1.a = s2.a) as sz
WHERE sz.a = sq.a;

here, only "(select s1.* from sj s1 join sj s2 on s1.a = s2.a)" can
apply to SJE.

but for now we cannot apply SJE to
EXPLAIN (COSTS OFF)
UPDATE sj sq SET b = sq.b + sz.a FROM sj as sz WHERE sz.a = sq.a;

so the EPQ abnormality issue[1] won't happen.


* add a new function: ChangeVarNodesExtended for
address concerns in  [2]

I see you still stay with the code line:
if (omark && imark && omark->markType != imark->markType)

It is definitely an error. What if omark is NULL, but imark is not? Why 
not to skip this pair of relids? Or, at least, insert an assertion to 
check that you filtered it earlier.


--
regards, Andrei Lepikhov





Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-14 Thread Fujii Masao




On 2024/07/15 14:00, Fujii Masao wrote:

Attached patch fixes unstable tests. Currently testing before pushing.


I pushed the patch at commit 4e5d6c4091, and some buildfarm animals
are back to green, but crake still reported an error. At first glance,
the error messages don't seem related to the recent patches.
I'll investigate further.

Waiting for replication conn standby_1's replay_lsn to pass 0/15428F78 on 
primary
[01:31:11.920](206.483s) # poll_query_until timed out executing this query:
# SELECT '0/15428F78' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at 
/home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 
103.
# Postmaster PID for node "primary" is 99205

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: documentation structure

2024-07-14 Thread jian he
On Mon, Apr 29, 2024 at 1:17 PM Corey Huinker  wrote:
>>
>> I've splitted it to7 patches.
>> each patch split one  into separate new files.
>
>
> Seems like a good start. Looking at the diffs of these, I wonder if we would 
> be better off with a func/ directory, each function gets its own file in that 
> dir, and either these files above include the individual files, or the 
> original func.sgml just becomes the organizer of all the functions. That 
> would allow us to do future reorganizations with minimal churn, make 
> validation of this patch a bit more straightforward, and make it easier for 
> future editors to find the function they need to edit.

looking back.
The patch is big. no convenient way to review/validate it.
so I created a python script to automate it.
we can review the python script.
(just googling around, I know little about python).

* create new files for holding the content.
func-string.sgml
func-matching.sgml
func-datetime.sgml
func-json.sgml
func-aggregate.sgml
func-info.sgml
func-admin.sgml

* locate parts that need to copy paste to a newly created file, based
on line number.
line number pattern is mentioned here:
http://postgr.es/m/CACJufxEcMjjn-m6fpC2wXHsQbE5nyd%3Dxt6k-jDizBVUKK6O4KQ%40mail.gmail.com

* insert placeholder string in func.sgml:
&func-string;
&func-matching;
&func-datetime;
&func-json;
&func-aggregate;
&func-info;
&func-admin;

* copy the parts to new files.

* validate newly created file. (must only have 2 occurrences of "sect1").

* delete the parts from func.sgml files, since they already copy to new files.
sed --in-place "2408,4180d ; 5330,7760d ; 8942,11127d ; 15502,19436d ;
21567,22985d ; 24346,28017d ; 28020,30714d " func.sgml

* manually change doc/src/sgml/filelist.sgml
 
+
+
+
+
+
+
+



2 requirements.
1. manual change doc/src/sgml/filelist.sgml as mentioned before;
2. in python script, at line 35, i use
"os.chdir("/home/jian/Desktop/pg_src/src7/postgres/doc/src/sgml")"
you need to change to your "doc/src/sgml" directory accordingly.
import subprocess
import os
import re
func_string_place_holder="&func-string;\n"
func_matching_place_holder="&func-matching;\n"
func_datetime_place_holder="&func-datetime;\n"
func_json_place_holder="&func-json;\n"
func_aggregate_place_holder="&func-aggregate;\n"
func_info_place_holder="&func-info;\n"
func_admin_place_holder="&func-admin;\n"

func_string="func-string.sgml"
func_matching="func-matching.sgml"
func_datetime="func-datetime.sgml"
func_json="func-json.sgml"
func_aggregate="func-aggregate.sgml"
func_info="func-info.sgml"
func_admin="func-admin.sgml"

func_string_line_begin_lineno = -1
func_string_line_end_lineno = -1
func_matching_begin_lineno = -1
func_matching_end_lineno = -1
func_datetime_begin_lineno = -1
func_datetime_end_lineno = -1
func_json_begin_lineno = -1
func_json_end_lineno = -1
func_aggregate_begin = -1
func_aggregate_end = -1
func_info_begin_lineno = -1
func_info_end_lineno = -1
func_admin_begin_lineno = -1
func_admin_end_lineno = -1

os.chdir("/home/jian/Desktop/pg_src/src7/postgres/doc/src/sgml")
target_file="func.sgml"
subprocess.call(["touch", func_string])
subprocess.call(["touch", func_matching])
subprocess.call(["touch", func_datetime])
subprocess.call(["touch", func_json])
subprocess.call(["touch", func_aggregate])
subprocess.call(["touch", func_info])
subprocess.call(["touch", func_admin])

def printall():
print(f'func_string_line_begin_lineno:{func_string_line_begin_lineno}')
print(f'func_string_line_end_lineno:{func_string_line_end_lineno}')
print(f'func_matching_begin_lineno:{func_matching_begin_lineno}')
print(f'func_matching_end_lineno:{func_matching_end_lineno}')
print(f'func_datetime_begin_lineno:{func_datetime_begin_lineno}')
print(f'func_datetime_end_lineno:{func_datetime_end_lineno}')
print(f'func_json_begin_lineno:{func_json_begin_lineno}')
print(f'func_json_end_lineno:{func_json_end_lineno}')
print(f'func_aggregate_begin:{func_aggregate_begin}')
print(f'func_aggregate_end:{func_aggregate_end}')
print(f'func_info_begin_lineno:{func_info_begin_lineno}')
print(f'func_info_end_lineno:{func_info_end_lineno}')
print(f'func_admin_begin_lineno:{func_admin_begin_lineno}')
print(f'func_admin_end_lineno:{func_admin_end_lineno}')

def get_line_number(file_name: str):
global func_string_line_begin_lineno
global func_string_line_end_lineno
global func_matching_begin_lineno
global func_matching_end_lineno
global func_datetime_begin_lineno
global func_datetime_end_lineno
global func_json_begin_lineno
global func_json_end_lineno
global func_aggregate_begin
global func_aggregate_end
global func_info_begin_lineno
global func_info_end_lineno
global func_admin_begin_lineno
global func_admin_end_lineno
with open(file_name, 'r+') as f:
for i, line in enumerate(f, 1):
if r'' in line:
func_string_line_begin_lineno = i
elif r'' in line:
   

Re: speed up a logical replica setup

2024-07-14 Thread Amit Kapila
On Fri, Jul 12, 2024 at 4:54 AM Euler Taveira  wrote:
>
> On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote:
>
> May I ask you to look at another failure of the test occurred today [1]?
>
>
> Thanks for the report!
>
> You are observing the same issue that Amit explained in [1]. The
> pg_create_logical_replication_slot returns the EndRecPtr (see
> slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr 
> points
> to the next record and it is a future position for an idle server. That's why
> the recovery takes some time to finish because it is waiting for an activity 
> to
> increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to 
> create
> additional WAL records soon, the EndRecPtr position is reached rapidly and the
> recovery ends quickly.
>

If the recovery ends quickly (which is expected due to reduced
LOG_SNAPSHOT_INTERVAL_MS ) then why do we see "error: recovery timed
out"?

> Hayato proposes a patch [2] to create an additional WAL record that has the 
> same
> effect from you little hack: increase the LSN position to allow the recovery
> finishes soon. I don't like the solution although it seems simple to 
> implement.
> As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The
> problem is that it is used by logical decoding but it is not exposed. [reading
> the code...] When the logical replication slot is created, restart_lsn points 
> to
> the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last
> record replayed.
>

The last 'lastReplayedEndRecPtr' should be the value of restart_lsn on
standby (when RecoveryInProgress is true) but here we are creating
slots on the publisher/primary, so shouldn't restart_lsn point to
"latest WAL insert pointer"?

-- 
With Regards,
Amit Kapila.