date:20241031

On Mon, Sep 09, 2024 at 03:43:58PM +0530, Ashutosh Bapat wrote:
> 894be11adfa60ad1ce5f74534cf5f04e66d51c30 changed the schema in which
> objects in test genereated_stored.sql are created. Because of this the
> new test added by the patch was failing. Fixed the failure in the
> attached.

On my laptop, testing the plain format adds roughly 12s, in a test
that now takes 1m20s to run vs 1m32s.  Enabling regress_dump_formats
and adding three more formats counts for 45s of runtime.  For a test
that usually shows up as the last one to finish for a heavily
parallelized run.  So even the default of "plain" is going to be
noticeable, I am afraid.

+   test_regression_dump_restore($oldnode, %node_params);

Why is this only done for the main regression test suite?  Perhaps it
could be useful as well for tests that want to check after their own
custom dumps, as a shortcut?  

Linked to that.  Could there be some use in being able to pass down a
list of databases to this routine, rather than being limited only to
"regression"?  Think extension databases with USE_MODULE_DB that have
unique names.

+   # Dump the original database in "plain" format for comparison later. The
+   # order of columns in COPY statements dumped from the original database and
[...]
+   # Adjust the CREATE TABLE ... INHERITS statements.
+   if ($original)
+   {
+   $dump =~ s/(^CREATE\sTABLE\sgenerated_stored_tests\.gtestxx_4\s\()
+  (\n\s+b\sinteger),
+  (\n\s+a\sinteger)/$1$3,$2/mgx; 

The reason why $original exists is documented partially in both
002_pg_upgrade.pl and AdjustDump.pm.  It would be better to
consolidate that only in AdjustDump.pm, I guess.  Isn't the name
"$original" a bit too general when it comes to applying filters to
the dumps to as the order of the column re-dumped is under control?
Perhaps it would be adapted to use a hash that can be extended with
more than one parameter to control which filters are applied?  For
example, imagine a %params where the caller of adjust_dumpfile() can
pass in a "filter_column_order => 1".  The filters applied to the dump
are then self-documented.  We could do with a second for the
whitespaces, as well.

What's the advantage of testing all the formats?  Would that stuff
have been able to catch up more issues related to specific format(s)
when it came to the compression improvements with inheritance?

I'm wondering if it would make sense to also externalize the dump
comparison routine currently in the pg_upgrade script.  Perhaps we
should be more ambitious and move more logic into AdjustDump.pm?  If
we think that the full cycle of dump -> restore -> dump -> compare
could be used elsewhere, this would limit the footprint of what we are
doing in the pg_upgrade script in this patch and be able to do similar
stuff in out-of-core extensions or other tests.  Let's say a
PostgreSQL::Test::Dump.pm?
--
Michael

signature.asc
Description: PGP signature

Re: Pgoutput not capturing the generated columns

2024-10-31 Thread Ajin Cherian

On Thu, Oct 31, 2024 at 9:55 PM Ajin Cherian  wrote:

> I ran some tests and verified that the patch works with previous versions
> of PG12 and PG17
> 1. Verified with publications with generated columns and without generated
> columns on patched code and subscriptions on PG12 and PG17
> Observations:
> a. If publication is created with publish_generated_columns=true or
> with generated columns mentioned explicitly, then tablesync will not copy
> generated columns but post tablesync the generated columns are replicated
> b. Column list override (publish_generated_columns=false) behaviour
>
> These seem expected.
>
>
Currently the documentation does not talk about this behaviour, I suggest
this be added similar to how such a behaviour was documented when the
original row-filter version was committed.
Suggestion:
"If a subscriber is a pre-18 version, the initial table synchronization
won't publish generated columns even if they are defined in the publisher."

regards,
Ajin Cherian
Fujitsu Australia

Time to add a Git .mailmap?

2024-10-31 Thread Daniel Gustafsson

When looking at our Git tree for a recent conference presentation I happened to
notice that we have recently gained duplicate names in the shortlog.  Not sure
if we care enough to fix that with a .mailmap, but if we do the attached diff
makes sure that all commits are accounted for a single committer entry.

--
Daniel Gustafsson



mailmap.diff
Description: Binary data

Re: Count and log pages set all-frozen by vacuum

On Wed, Oct 30, 2024 at 2:42 PM Melanie Plageman
 wrote:
>
> Vacuum currently counts and logs the number of pages of a relation
> with newly frozen tuples. It does not count the number of pages newly
> set all-frozen in the visibility map.
>
> The number of pages set all-frozen in the visibility map by a given
> vacuum can be useful when analyzing which normal vacuums reduce the
> number of pages future aggressive vacuum need to scan.

+1

Some small suggestions:

+   vmbits = visibilitymap_set(vacrel->rel, blkno, buf,
+  InvalidXLogRecPtr,
+  vmbuffer, InvalidTransactionId,
+  VISIBILITYMAP_ALL_VISIBLE |
+  VISIBILITYMAP_ALL_FROZEN);

How about using "old_vmbits" or something along the same lines to make
it clear that this value has the bits before visibilitymap_set()?

---
+   /* If it wasn't all-frozen before, count it */
+   if (!(vmbits & VISIBILITYMAP_ALL_FROZEN))

To keep consistent with surrounding codes in lazyvacuum.c, I think we
can write "if ((vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)" instead.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: UUID v7

On Sat, Oct 26, 2024 at 10:05 AM Andrey M. Borodin  wrote:
>
> >
> > ---
> > - oid | proname | oid | proname
> > --+-+-+-
> > -(0 rows)
> > + oid  | proname | oid  | proname
> > +--+-+--+-
> > + 9896 | uuidv7  | 9897 | uuidv7
> > +(1 row)
> >
> > I think that we need to change these functions so that this check
> > query doesn't return anything, no?
>
> We have 4 options:
> 0. Remove uuidv7(interval). But it brings imporatne functionality to the 
> table: we can avoid contention points while massively insert data.
> 1. Give different names to uuidv7() and uuidv7(interval).
> 2. Allow importing pg_node_tree  (see v7 of the patch)
> 3. Change this query. Comment to this query suggest that it checks for 
> exactly this case: same function is declared with different number of 
> arguments.
>
> IMO approach number 3 is best. However, I do not understand why this query 
> check was introduced in the first place. Maybe, there are string arguments 
> why we should not do same-named functions with different number of arguments.
>

I think we typically avoid this kind of check failure by assigning
uuidv7() and uuidv7(interval) different C functions that call the
common function. That is, we have pg_proc entries like:

{ oid => '9896', descr => 'generate UUID version 7',
  proname => 'uuidv7', proleakproof => 't', provolatile => 'v',
  prorettype => 'uuid', proargtypes => '', prosrc => 'uuidv7' },
{ oid => '9897', descr => 'generate UUID version 7 with a timestamp
shifted on specific interval',
  proname => 'uuidv7', proleakproof => 't', provolatile => 'v',
  prorettype => 'uuid', proargtypes => 'interval', prosrc =>
'uuidv7_interval' },

And then have C codes like:

static Datum
generate_uuidv7(FunctionCallInfo fcinfo)
{
static uint64 previous_ns = 0;:
:
PG_RETURN_UUID_P(uuid);
}

Datum
uuidv7(PG_FUNCTION_ARGS)
{
return generate_uuidv7(fcinfo);
}

Datum
uuidv7_interval(PG_FUNCTION_ARGS)
{
return generate_uuidv7(fcinfo);
}

> >
> > ---
> > +   if (version == 6)
> > +   {
> > +   tms = ((uint64) uuid->data[0]) << 52;
> > +   tms += ((uint64) uuid->data[1]) << 44;
> > +   tms += ((uint64) uuid->data[2]) << 36;
> > +   tms += ((uint64) uuid->data[3]) << 28;
> > +   tms += ((uint64) uuid->data[4]) << 20;
> > +   tms += ((uint64) uuid->data[5]) << 12;
> > +   tms += (((uint64) uuid->data[6]) & 0xf) << 8;
> > +   tms += ((uint64) uuid->data[7]);
> > +
> > +   /* convert 100-ns intervals to us, then adjust */
> > +   ts = (TimestampTz) (tms / 10) -
> > +   ((uint64) POSTGRES_EPOCH_JDATE - GREGORIAN_EPOCH_JDATE) *
> > SECS_PER_DAY * USECS_PER_SEC;
> > +
> > +   PG_RETURN_TIMESTAMPTZ(ts);
> > +   }
> >
> > It's odd to me that only uuid_extract_timestamp() supports UUID v6 in
> > spite of not supporting UUID v6 generation. I think it makes more
> > sense to support UUID v6 generation as well, if the need for it is
> > high.
>
> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with 
> providing implementation, it's trivial. PFA patch with implementation.
>

My point is that we should either support full functionality for
UUIDv6 (such as generation and extraction) or none of them. I'm not
really sure we want UUIDv6 as well, but if we want it, it should be
implemented in a separate patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: MultiXact\SLRU buffers configuration

2024-10-31 Thread Thom Brown

On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin  wrote:
>
>
>
> > On 29 Oct 2024, at 21:45, Thom Brown  wrote:
> >
> > It clearly checks for interrupts, but when I saw this issue happen, it
> > wasn't interruptible.
>
> Perhaps, that was different multixacts?
> When I was observing this pathology on Standby, it was a stream of different 
> reads encountering different multis.
>
> Either way startup can cancel locking process on it's own. Or is it the case 
> that cancel was not enough, did you actually need termination, not cancel?

Termination didn't work on either of the processes.

Thom

Re: Using read stream in autoprewarm

> On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz  wrote:
> 
> Any feedback would be appreciated.

I've took a look into the patch. It seems to me that you add new block numbers 
to the read stream until you have buffers. So when there are no more buffers 
you will still have some queued blocks.
Maybe can you change the logic so that number of free buffers must be enough to 
allocate all blocks in look-ahead distance?

Thanks!

Best regards, Andrey Borodin.

Re: [PATCH] Improve code coverage of network address functions

Hi Stepan,

> Hello Aleksander! I'm a beginner and I would like to see and try your patch. 
> However, I have never worked with coverage in regression tests for 
> PostgreSQL. Could you please tell me how it works and help me understand the 
> process? Thanks!

You are going to need some Linux distribution, GCC stack and lcov 1.16
[1]. (Lcov doesn't work with Clang; We seem to experience some
problems with lcov 2.0+, this is being investigated [2])

Apply the patch as usual ( git clone
git://git.postgresql.org/git/postgresql.git ; git am ... ) and run:

```
git clean -dfx && meson setup --buildtype debug
-DPG_TEST_EXTRA="kerberos ldap ssl" -Db_coverage=true -Dldap=disabled
-Dssl=openssl -Dcassert=true -Dtap_tests=enabled
-Dprefix=/home/eax/pginstall build && ninja -C build &&
PG_TEST_EXTRA=1 meson test -C build && ninja -C build coverage-html
```

Note the `-Db_coverage=true` and `ninja -C build coverage-html` parts.
Change `prefix` to an appropriate one. This will take longer than
usual. Your coverage report will be located in
build/meson-logs/coveragereport. You can compare the reports with and
without the patch to ensure that the patch improves code coverage.

[1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16
[2]: 
https://postgr.es/m/CAJ7c6TN%2BMCh99EZ8YGhXZAdnqvNQYir6E34B_mmcB5KsxCB00A%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Re: "command cannot affect row a second time" in INSERT ... ON CONFLICT

Hi Karthik,

> I am looking to better understand the applicability of the error message 
> "command cannot affect row a second time".
>
> Consider the following table and data:
> CREATE TABLE ioc (i int, UNIQUE(i));
> INSERT INTO ioc VALUES (1);
>
> The following two queries produce different errors:
> Query 1
> postgres=# INSERT INTO ioc VALUES (1), (20) ON CONFLICT (i) DO UPDATE SET i = 
> 20;
> ERROR:  21000: ON CONFLICT DO UPDATE command cannot affect row a second time
> HINT:  Ensure that no rows proposed for insertion within the same command 
> have duplicate constrained values.
>
> Query 2
> postgres=# INSERT INTO ioc VALUES (20), (1) ON CONFLICT (i) DO UPDATE SET i = 
> 20;
> ERROR:  23505: duplicate key value violates unique constraint "ioc_i_key"
> DETAIL:  Key (i)=(20) already exists.

Not sure if it will answer your question *entirely* but you will find
a bit more detail about "cannot affect row a second time" in the
discussion [1]. This error has nothing to do with unique constraints,
so I think you trigger one of two errors depending on the order of
inserted rows and the content of your table. This being said, I didn't
investigate your scenario in much detail.

[1]: 
https://www.postgresql.org/message-id/flat/CAJ7c6TPQJNFETz9H_qPpA3x7ybz2D1QMDtBku_iK33gT3UR34Q%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev

Re: WIP: parallel GiST index builds

> On 8 Oct 2024, at 17:05, Tomas Vondra  wrote:
> 
> Here's an updated patch adding the queryid.

I've took another round of looking through the patch.
Everything I see seems correct to me. It just occurred to me that we will have: 
buffered build, parallel build, sorting build. All 3 aiming to speed things up. 
I really hope that we will find a way to parallel sorting build, because it 
will be fastest for sure.

Currently we have some instances of such code...or similar... or related code.

if (is_build)
{
  if (is_parallel)
recptr = GetFakeLSNForUnloggedRel();
  else
recptr = GistBuildLSN;
}
else
{
  if (RelationNeedsWAL(rel))
  {
recptr = actuallyWriteWAL();
  }
  else
recptr = gistGetFakeLSN(rel);
}
// Use recptr

In previous review I was proponent of not adding arguments to gistGetFakeLSN(). 
But now I see that now logic of choosing LSN source is sprawling across various 
places...
Now I do not have a strong point of view on this. Do you think if something 
like following would be clearer?
if (!is_build && RelationNeedsWAL(rel))
  {
recptr = actuallyWriteWAL();
  }
  else
recptr = gistGetFakeLSN(rel, is_build, is_parallel);

Just as an idea.

I'm mostly looking on GiST-specific parts of the patch, while things around 
entering parallel mode seems much more complicated. But as far as I can see, 
large portions of this code are taken from B-tree\BRIN.

Best regards, Andrey Borodin.

Re: [PATCH] Improve code coverage of network address functions

Hello Aleksander! I'm a beginner and I would like to see and try your
patch. However, I have never worked with coverage in regression tests for
PostgreSQL. Could you please tell me how it works and help me understand
the process? Thanks!
Best Regards, Stepan Neretin!

Re: RFC: Extension Packaging & Lookup

2024-10-31 Thread David E. Wheeler

Fellow Humans,

I’m working on an updated proposal with more detail, and more comprehensive. 
But I keep getting a bit caught up on this bit:

On Oct 28, 2024, at 18:19, David E. Wheeler  wrote:

>> *   Binary-only extensions might also be installed here; the difference is 
>> they have no control file. The LOAD command and shared_preload_libraries 
>> would need to know to look here, too.
> 
> Or perhaps we should require a control file for these, too, but add a “type” 
> key or some such? Maybe such a shared module could be supported by CREATE 
> EXTENSION, as well as, but not include SQL files?

I’m trying to imagine how this ought to work. The challenge is that, with the 
layout I propose here, shared module files will no longer always be in 
`$dynamic_library_path`, but in any `$extension/pkglib` subdirectory of each 
subdirectory of `extension_path`, as well. Is that desirable?

Let’s say we want to load a module named “semver” that’s included in the semver 
extension. With the proposed organization up-thread, the module would be 
installed in:

```
$extdir_user/semver/pkglib/semver.(so|dylib|dll|etc)
```

What should be passed to preload/LOAD to load it? A few options:

Option 1

* Specify the module name “semver” in the `LOAD` command or in
`*_preload_libraries` (same as in 17 and earlier)
* Teach the preload/LOAD code to search for the module file in `*/pkglib/`
under each extension path

Pros:

* Follows the existing module name specification in preload/LOAD

Cons:

* Potentially huge number of directories to search, when lots of extension
are installed.
* Depending on search order, the wrong module may be loaded if two
extensions have a module file with the same name

Option 2

* Specify the module name to include the extension name. Perhaps something
like `$extension:$module`.
* Teach the preload/LOAD code to detect the extension name as part of the
command and only look for the DSO in that extension's `pkglibdir`.

Pros:

* Searches at most the list of directories in the `extension_path`.
* No conflicts with any other module files from other extensions.

Cons:

* Overloads the meaning of preload/LOAD strings, which might be confusing to
some.
* Upgrades might need these values to change from the old to the new syntax.

Other Options?
--

I kind of like Option 2, as it would allow us to eventually support non-`CREATE 
EXTENSION` modules as extensions, too. I imagine distributing, say 
`auto_explain` in an extension directory of its own, with a 
`auto_explain.control` file that identifies it as a LOAD-only extension. Then 
specifying `auto_explain:auto_explain` would work as expected. Or perhaps just 
`auto_explain:` could load *all* the modules included in the auto_explain 
"extension".

But then maybe I'm starting to talk myself into arguing that `LOAD` ought to be 
deprecated, `CREATE EXTENSION` could support non-SQL extensions, and the 
`*preload*` GUCs would contain a list of extensions rather than module files.

But I digress. Any ideas about other options to address this design challenge?

Thanks,

David

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

On Thu, Oct 24, 2024 at 08:32:54AM +1100, Peter Smith wrote:
> Hi. Here are a couple of minor comments.
> 
> 1.
> +The time slot synchronization (see  +linkend="logicaldecoding-replication-slots-synchronization"/>)
> +was most recently stopped.
> 
> /The time slot/The time when slot/
> 
> ~~~
> 
> 2.
> - /* The time since the slot has become inactive */
> + /* The time slot sychronized was stopped. */
> 
> 
> Maybe just make this comment the same as the sentence used in the docs:
> - e.g. add "when"; fix, typo "sychronized", etc.
> 
> 
> /The time slot sychronized was stopped./The time when slot
> synchronization was most recently stopped./

Yes, all good suggestions, updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 61d28e701f2..d853f7fe49b 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2435,7 +2435,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
active bool
   
   
-   True if this slot is currently actively being used
+   True if this slot is currently being streamed
   
  
 
@@ -2444,9 +2444,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
active_pid int4
   
   
-   The process ID of the session using this slot if the slot
-   is currently actively being used. NULL if
-   inactive.
+   The process ID of the session streaming data for this slot.
+   NULL if inactive.
   
  
 
@@ -2566,15 +2565,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
inactive_since timestamptz
   
   
-The time since the slot has become inactive.
-NULL if the slot is currently being used.
-Note that for slots on the standby that are being synced from a
-primary server (whose synced field is
-true), the
-inactive_since indicates the last
-synchronization (see
-)
-time.
+The time when slot synchronization (see )
+was most recently stopped.  NULL if the slot
+has always been synchronized.  This is useful for slots on the
+standby that are being synced from a primary server (whose
+synced field is true)
+so they know when the slot stopped being synchronized.
   
  
 
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index d62186a5107..f4f80b23129 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1515,7 +1515,7 @@ update_synced_slots_inactive_since(void)
 	 * correctly interpret the inactive_since if the standby gets promoted
 	 * without a restart. We don't want the slots to appear inactive for a
 	 * long time after promotion if they haven't been synchronized recently.
-	 * Whoever acquires the slot i.e.makes the slot active will reset it.
+	 * Whoever acquires the slot, i.e., makes the slot active, will reset it.
 	 */
 	if (!StandbyMode)
 		return;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 45582cf9d89..90ea2979b87 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,7 +205,7 @@ typedef struct ReplicationSlot
 	 */
 	XLogRecPtr	last_saved_confirmed_flush;
 
-	/* The time since the slot has become inactive */
+	/* The time when slot synchronization was most recently stopped. */
 	TimestampTz inactive_since;
 } ReplicationSlot;

Re: "command cannot affect row a second time" in INSERT ... ON CONFLICT

2024-10-31 Thread David G. Johnston

On Thu, Oct 31, 2024 at 9:52 AM Karthik Ramanathan <
karthikram.3...@gmail.com> wrote:

> I am looking to better understand the applicability of the error message
> "command cannot affect row a second time".
>
> Consider the following table and data:
> CREATE TABLE ioc (i int, UNIQUE(i));
> INSERT INTO ioc VALUES (1);
>
> The following two queries produce different errors:
> *Query 1*
> postgres=# INSERT INTO ioc VALUES (1), (20) ON CONFLICT (i) DO UPDATE SET
> i = 20;
> ERROR:  21000: ON CONFLICT DO UPDATE command cannot affect row a second
> time
> HINT:  Ensure that no rows proposed for insertion within the same command
> have duplicate constrained values.
>

Right, id 1 exists, you insert id 1 again, that row becomes id 20, then you
attempt to insert id 20 again, which conflicts, and the system attempts to
update the 1-become-20 row to 20 but fails to perform the update since that
now-existing row was already modified in this statement (it was inserted).
You don't get a duplicate key error because the second modification
condition is more general and thus triggers first.  I.e., that error has to
happen regardless of whether a duplicate key error condition was going to
happen or not (e.g., you could have done something like "set i = i * 20" -
not tested)

> *Query 2*
> postgres=# INSERT INTO ioc VALUES (20), (1) ON CONFLICT (i) DO UPDATE SET
> i = 20;
> ERROR:  23505: duplicate key value violates unique constraint "ioc_i_key"
> DETAIL:  Key (i)=(20) already exists.
>

Here the insertion of id 20 happens just fine, then inserting id 1
conflicts, the existing row with id 1 gets updated to id 20 which results
in a duplicate key violation.

> 1. Is there a different reason the two queries produce a different error?
>

First error condition wins.  Multiple modification gets tested
first, before checking whether the outcome of a modification would result
in a duplicate.

2. Is there a better way to think about the "command cannot affect row a
> second time"? Appreciate any guidance. Thanks.
>
>
A row inserted or updated in a statement cannot be subsequently modified in
that same statement.  I don't actually understand how you are presently
thinking about this...

Apparently the algorithm for merge is able to avoid impacting the same row
twice and thus if the underlying DML is going to produce a duplicate key
violation that is what you will see.  I hesitate to claim you'd never see
a multi-update scenario but do find it reasonable that it would be less
prone to it.

David J.

Re: Changing the default random_page_cost value

On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
> On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian  wrote:
> 
> I am not a fan of this patch.  I don't see why _removing_ the magnetic
> part helps because you then have no logic for any 1.2 was chosen.
> 
> 
> Okay, but we have no documented logic on why 4.0 was chosen either. :)

Uh, we do, and it is in the docs:

Random access to mechanical disk storage is normally much more expensive
than four times sequential access.  However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache.  The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

You surely saw this when you created the patch and removed the text.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: meson and check-tests

2024-10-31 Thread Nazir Bilal Yavuz

Hi Ashutosh and Jian,

Sorry for the late reply and thanks for the feedback!

On Fri, 4 Oct 2024 at 16:13, jian he  wrote:
>
> v3, 0001 documentation:
> We can at least write something on
> https://wiki.postgresql.org/wiki/Meson about this feature.

I agree. It seems that I need to apply for an editor access on the
wiki page. I am planning to do that after the commit.

> TESTS='check check_btree' meson test amcheck/regress --verbose
> works, but I feel like there is a discoverability issue .
>
> TESTS='check check_btree' meson test amcheck/regress --list --verbose
> shows:
> postgresql:amcheck / amcheck/regress
>
>
> For small tests, listing all the available tests would be great.

I agree but unfortunately this seems not possible. 'meson test --list'
option only shows tests that are registered to meson build (i.e.
test() calls in the meson.build files).

On Fri, 4 Oct 2024 at 18:40, Ashutosh Bapat
 wrote:
> --schedule or the test selection becomes part of the test command
> itself in the current master. By passing it as an argument, in these
> patches, we are allowing those to be overridden if TESTS is set at the
> time of running the test. I like this idea. I was pondering whether we
> really need two separate arguments --schedule and --tests
> respectively. IIUC, we can't pass them as a single argument (say
> --test-selection or --selection) because the subsequent --schedule
> would be treated as a separate argument if not quoted correctly. That
> isn't a problem with value of tests. To avoid quoting '--schedule
> ...', we have two separate arguments. Am I right?

Yes, that is exactly why we have both '--schedule' and '--tests'
flags. Also, a comment is added to clarify this.

> It might be better to make this explicit in the code -- by making sure
> that only one of them is passed and writing a comment about it.
> ArgumentParser might have some trick to specify that passing both the
> arguments is an error.

I did not understand why only one of them needed to be passed at a
time. For example in ecpg tests
(src/interfaces/ecpg/test/meson.build), both '--schedule' and
'--tests' options are passed.

> Also I don't think "Note that setup
> # suite tests (at least tmp_install and initdb_cache tests) may need to be run
> # before running these tests." is needed. This is true irrespective of
> whether we specify TESTS or not.

Done.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From f0e035636ae02b7fe0668a0b2246b080656a26e5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 31 Oct 2024 16:21:23 +0300
Subject: [PATCH v4 1/2] Add 'make check-tests' behavior to the meson based
 builds

There was no way to run specific regression tests in the regress/regress
tests in the meson based builds. Add this behavior.

Author: Nazir Bilal Yavuz 
Reviewed-by: Ashutosh Bapat 
Reviewed-by: Jian He 
Discussion: postgr.es/m/CAExHW5tK-QqayUN0%2BN3MF5bjV6vLKDCkRuGwoDJwc7vGjwCygQ%40mail.gmail.com
---
 meson.build| 14 --
 src/tools/testwrap | 10 ++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/meson.build b/meson.build
index bb9d7f5a8e8..c9f91a0ee22 100644
--- a/meson.build
+++ b/meson.build
@@ -3399,11 +3399,11 @@ foreach test_dir : tests
 '--dbname', dbname,
   ] + t.get('regress_args', [])
 
-  test_selection = []
-  if t.has_key('schedule')
-test_selection += ['--schedule', t['schedule'],]
-  endif
+  # To avoid '--schedule' option to be treated as a separate argument in
+  # the testwrap script if not quoted correctly.
+  test_schedule = t.get('schedule', [])
 
+  test_selection = []
   if kind == 'isolation'
 test_selection += t.get('specs', [])
   else
@@ -3427,12 +3427,13 @@ foreach test_dir : tests
   testwrap_base,
   '--testgroup', test_group,
   '--testname', kind,
+  '--schedule', test_schedule,
+  '--tests', test_selection,
   '--',
   test_command_base,
   '--outputdir', test_output,
   '--temp-instance', test_output / 'tmp_check',
   '--port', testport.to_string(),
-  test_selection,
 ],
 suite: test_group,
 kwargs: test_kwargs,
@@ -3447,10 +3448,11 @@ foreach test_dir : tests
 testwrap_base,
 '--testgroup', test_group_running,
 '--testname', kind,
+'--schedule', test_schedule,
+'--tests', test_selection,
 '--',
 test_command_base,
 '--outputdir', test_output_running,
-test_selection,
   ],
   is_parallel: t.get('runningcheck-parallel', true),
   suite: test_group_running,
diff --git a/src/tools/testwrap b/src/tools/testwrap
index 9a270beb72d..0ab9f5dada9 100755
--- a/src/tools/testwrap
+++ b/src/tools/testwrap
@@ -12,6 +12,8 @@ parser.add_argument('--srcdir', help='source directory of test', type=str)
 parser.add_argument('--basedir', help='base directory of tes

[PATCH] Add support for displaying database service in psql prompt

2024-10-31 Thread Michael Banck

Hi,

I think it might be useful to sometimes display the database service
(i.e. what is defined in ~/.pg_service.conf and used via psql
service=foo) in the psql prompt, e.g. if you have 'test' and 'prod'
services, but both have the same database name. This was also suggested
to me during a recent customer training.

I chose the '%s' tag for it. I had to add the service to PGConn as
PQservice (first patch) to libpq and then use it in psql in the second
patch.


Michael
>From f876195acb797a5ac58c17409fdd75d18581c292 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Thu, 31 Oct 2024 18:27:52 +0100
Subject: [PATCH v1 1/2] Add PQservice to PGConn.

This adds the content of the database service (if any) to PGConn. One
use for this would be for psql to display the service as part of the
prompt.
---
 src/interfaces/libpq/exports.txt  |  1 +
 src/interfaces/libpq/fe-connect.c | 11 ++-
 src/interfaces/libpq/libpq-fe.h   |  1 +
 src/interfaces/libpq/libpq-int.h  |  1 +
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 5d8213e0b5..2ad2cbf5ca 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -205,3 +205,4 @@ PQcancelFinish202
 PQsocketPoll  203
 PQsetChunkedRowsMode  204
 PQgetCurrentTimeUSec  205
+PQservice 206
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 61c025ff3b..8d809ee4cb 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -190,7 +190,8 @@ typedef struct _internalPQconninfoOption
 
 static const internalPQconninfoOption PQconninfoOptions[] = {
 	{"service", "PGSERVICE", NULL, NULL,
-	"Database-Service", "", 20, -1},
+		"Database-Service", "", 20,
+	offsetof(struct pg_conn, pgservice)},
 
 	{"user", "PGUSER", NULL, NULL,
 		"Database-User", "", 20,
@@ -7052,6 +7053,14 @@ PQdb(const PGconn *conn)
 	return conn->dbName;
 }
 
+char *
+PQservice(const PGconn *conn)
+{
+	if (!conn)
+		return NULL;
+	return conn->pgservice;
+}
+
 char *
 PQuser(const PGconn *conn)
 {
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 15012c770c..5947e7c766 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -385,6 +385,7 @@ extern int	PQrequestCancel(PGconn *conn);
 
 /* Accessor functions for PGconn objects */
 extern char *PQdb(const PGconn *conn);
+extern char *PQservice(const PGconn *conn);
 extern char *PQuser(const PGconn *conn);
 extern char *PQpass(const PGconn *conn);
 extern char *PQhost(const PGconn *conn);
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 08cc391cbd..dcebca9898 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -394,6 +394,7 @@ struct pg_conn
 	char	   *fbappname;		/* fallback application name */
 	char	   *dbName;			/* database name */
 	char	   *replication;	/* connect as the replication standby? */
+	char	   *pgservice;		/* Postgres service, if any */
 	char	   *pguser;			/* Postgres username and password, if any */
 	char	   *pgpass;
 	char	   *pgpassfile;		/* path to a file containing password(s) */
-- 
2.39.5

>From 3257e93d20353ff348b15df9b45207ec45839ed5 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Thu, 31 Oct 2024 18:49:14 +0100
Subject: [PATCH v1 2/2] Add support for database service to psql prompt.

This adds a new psql variable SERVICE as well as the string '%s' for the
psql PROMPT, displaying the service (from PGSERVICEFILE) if so desired.
---
 src/bin/psql/command.c | 2 ++
 src/bin/psql/prompt.c  | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 328d78c73f..e8f2573649 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -4085,6 +4085,7 @@ SyncVariables(void)
 	pset.sversion = PQserverVersion(pset.db);
 
 	SetVariable(pset.vars, "DBNAME", PQdb(pset.db));
+	SetVariable(pset.vars, "SERVICE", PQservice(pset.db));
 	SetVariable(pset.vars, "USER", PQuser(pset.db));
 	SetVariable(pset.vars, "HOST", PQhost(pset.db));
 	SetVariable(pset.vars, "PORT", PQport(pset.db));
@@ -4118,6 +4119,7 @@ void
 UnsyncVariables(void)
 {
 	SetVariable(pset.vars, "DBNAME", NULL);
+	SetVariable(pset.vars, "SERVICE", NULL);
 	SetVariable(pset.vars, "USER", NULL);
 	SetVariable(pset.vars, "HOST", NULL);
 	SetVariable(pset.vars, "PORT", NULL);
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 0d99d00ac9..de3d976103 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -33,6 +33,7 @@
  * %p - backend pid
  * %> - database server port number
  * %n - database user name
+ * %s - service
  * %/ - current database
  * %~ - like %/ but "~" when database name equals user name
  * %w - whitespace of the same width as the most recent output of PROMPT1
@@ -246,6 +247,11 @@ get_prompt(promptStatus_t status, ConditionalStack c

Re: Count and log pages set all-frozen by vacuum

2024-10-31 Thread Alastair Turner

On Thu, 31 Oct 2024 at 15:26, Melanie Plageman 
wrote:

> On Thu, Oct 31, 2024 at 11:15 AM Peter Geoghegan  wrote:
> >
> > On Thu, Oct 31, 2024 at 10:22 AM Melanie Plageman
> >  wrote:
> > > It seems a better solution would be to find a way to document it or
> > > phrase it clearly in the log. It is true that these pages were set
> > > all-frozen in the VM. It is just that some of them were later removed.
> > > And we don't know which ones those are. Is there a way to make this
> > > clear?
> >
> > Probably not, but I don't think that it's worth worrying about. ISTM
> > that it's inevitable that somebody might get confused whenever we
> > expose implementation details such as these. This particular example
> > doesn't seem particularly concerning to me.
> >
> > Fundamentally, the information that you're showing is a precisely
> > accurate account of the work performed by VACUUM. If somebody is
> > bothered by the fact that we're setting VM bits for pages that just
> > get truncated anyway, then they're bothered by the reality of what
> > VACUUM does -- and not by the instrumentation itself.
>
> Makes sense to me. Though, I'm looking at it as a developer.
>

For user consumption, to reduce the number of puzzled questions, I'd
suggest adding a line to the log output of the form

visibility map: %u pages set all frozen, up to %u may have been removed
from the table

rather than appending the info to the frozen: line of the output. By
spelling visibility map out in full it gives the curious user something
specific enough to look up. It also at least alerts the user to the fact
that the number can't just be subtracted from a total.

Re: Changing the default random_page_cost value

Bruce Momjian  writes:
> On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
>> Okay, but we have no documented logic on why 4.0 was chosen either. :)

> Uh, we do, and it is in the docs:

> Random access to mechanical disk storage is normally much more 
> expensive
> than four times sequential access.  However, a lower default is used
> (4.0) because the majority of random accesses to disk, such as indexed
> reads, are assumed to be in cache.  The default value can be thought 
> of
> as modeling random access as 40 times slower than sequential, while
> expecting 90% of random reads to be cached.

Meh.  Reality is that that is somebody's long-after-the-fact apologia
for a number that was obtained by experimentation.

regards, tom lane

Re: Fix for consume_xids advancing XIDs incorrectly

On Wed, Oct 30, 2024 at 12:00 AM Yushi Ogiwara
 wrote:
>
> I made a new patch (skip_xid_correctly.diff) that incorporates the
> points we discussed:
>
> 1. Fix the issue that consume_xids consumes nxids+1 XIDs.
> 2. Update lastxid when calling GetTopFullTransactionId() to support
> nxids==1 case.
> 3. Forbid consume_xids when nxids==0.
> 4. Add comments explaining the return values of consume_xids and
> consume_xids_until, and the rationale for incrementing consumed++ when
> GetTopFullTransactionId() is called.
>
> Also, I made another patch (support_blksz_32k.diff) that supports the
> block size == 32K case.
>

Thank you for making patches! Here are review comments.

skip_xid_correctly.diff:

-   if (nxids < 0)
+   if (nxids <= 0)
elog(ERROR, "invalid nxids argument: %lld", (long long) nxids);
-
-   if (nxids == 0)
-   lastxid = ReadNextFullTransactionId();
else
lastxid = consume_xids_common(InvalidFullTransactionId, (uint64) nxids);

After calling elog(ERROR) we don't return here, so the "else" is not
necessary. That is, we can rewrite it to:

if (nxids <= 0)
elog(ERROR, "invalid nxids argument: %lld", (long long) nxids);

lastxid = consume_xids_common(InvalidFullTransactionId, (uint64) nxids);

---
+*
+* If no top-level XID is assigned, a new one is obtained,
+* and the consumed XID counter is incremented.

I'm not sure this comment is appropriate since what we do here is
obvious and the comment doesn't explain why we do that. IMO we don't
need to update these comments.

support_blksz_32k.diff:

-   if (xids_left > 2000 &&
+   if (xids_left > COMMIT_TS_XACTS_PER_PAGE &&

I think it's better to just replace 2000 with 4000 and explain why
4000 is sufficient. Probably we can define '#define
XID_SHORTCUT_THRESHOLD 4000" and put an assertion to XidSkip() or
consume_xids_common() to check if the number of consumed XIDs doesn't
exceed XID_SHORTCUT_THRESHOLD.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Multi delete wal IDEA

On Thu, Oct 31, 2024 at 11:42 AM Stepan Neretin  wrote:
>
> Hi there, hackers! How about trying out an idea to add an analog to save 
> memory in WAL files for deleting records, similar to multi-insert 
> optimization? This patch is trying to do just that.

Hi,
Thanks for the patch! Could you explain a bit more how this patch
works? I don't see changes in your patch to heap_delete() where WAL is
emitted, so I'm not sure I understand how it works. And could you
provide a small example or repro? It makes it much easier to review
and understand.
- Melanie

Re: Using read stream in autoprewarm

Dear Nazir,

At first A quick look it looks good. I will take a closer look at it
tomorrow. Could you please let me know about the performance tests and
graphics?

Best regards, Stepan Neretin!

Re: UUID v7

> > I think we typically avoid this kind of check failure by assigning
> > uuidv7() and uuidv7(interval) different C functions that call the
> > common function. That is, we have pg_proc entries like:
> >
>
> Done.
>
> >>>
> >>> It's odd to me that only uuid_extract_timestamp() supports UUID v6 in
> >>> spite of not supporting UUID v6 generation. I think it makes more
> >>> sense to support UUID v6 generation as well, if the need for it is
> >>> high.
> >>
> >> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with
> providing implementation, it's trivial. PFA patch with implementation.
> >>
> >
> > My point is that we should either support full functionality for
> > UUIDv6 (such as generation and extraction) or none of them. I'm not
> > really sure we want UUIDv6 as well, but if we want it, it should be
> > implemented in a separate patch.
>
> Make sense. I've removed all traces of v6.
>


Hi there,

Firstly, I'd like to discuss the increased_clock_precision variable, which
currently divides the timestamp into milliseconds and nanoseconds. However,
this approach only approximates the extra bits for sub-millisecond
precision, leading to imprecise timestamps in high-frequency UUID
generation.

To address this issue, we could consider using a more accurate method for
calculating the timestamp. For instance, we could utilize a higher
resolution clock or implement a more precise algorithm to ensure accurate
timestamps.

Additionally, it would be beneficial to add validation checks for the
interval argument. These checks could verify that the input interval is
within reasonable bounds and that the calculated timestamp is accurate.
Examples of checks could include verifying if the interval is too small,
too large, or exceeds the maximum possible number of milliseconds and
nanoseconds in a timestamp.

By implementing these changes, we can improve the accuracy and reliability
of UUID generation, making it more suitable for high-frequency usage
scenarios.

What do you think about these suggestions? Let me know your thoughts!

Best Regards, Stepan Neretin!

Re: Use "protocol options" name instead of "protocol extensions" everywhere

2024-10-31 Thread Jacob Champion

On Thu, Oct 31, 2024 at 7:51 AM Heikki Linnakangas  wrote:
> Or keep using "protocol extension" and add a paragraph to the docs to
> say explicitly that there's no support for extensions to create protocol
> extensions. TLS extensions is a good comparison.

Of the three proposed, this last one is my preference. I think it'd be
good to draw very clear lines between the transport level, the
protocol level, and the application level.

Thanks,
--Jacob

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

On Fri, Oct 18, 2024 at 04:25:33PM +0530, Amit Kapila wrote:
> Few comments:
> =
> 1.
> 
> -   True if this slot is currently actively being used
> +   True if this slot is currently currently being streamed
>
> 
> currently shouldn't be used twice.
> 
> 2.
> - /* The time since the slot has become inactive */
> + /* The time slot sychronized was stopped. */
>   TimestampTz inactive_since;
> 
> Would it be better to say: "The time slot synchronization was stopped."?
> 
> 3.
> This is useful for slots on the
> +standby that are intended to be synced from a primary server
> 
> I think it is better to be explicit here and probably say: "This is
> useful for slots on the standby that are being synced from a primary
> server .."

Agreed, fixed in the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 61d28e701f2..349f8b3d064 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2435,7 +2435,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
active bool
   
   
-   True if this slot is currently actively being used
+   True if this slot is currently being streamed
   
  
 
@@ -2444,9 +2444,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
active_pid int4
   
   
-   The process ID of the session using this slot if the slot
-   is currently actively being used. NULL if
-   inactive.
+   The process ID of the session streaming data for this slot.
+   NULL if inactive.
   
  
 
@@ -2566,15 +2565,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
inactive_since timestamptz
   
   
-The time since the slot has become inactive.
-NULL if the slot is currently being used.
-Note that for slots on the standby that are being synced from a
-primary server (whose synced field is
-true), the
-inactive_since indicates the last
-synchronization (see
-)
-time.
+The time slot synchronization (see )
+was most recently stopped.  NULL if the slot
+has always been synchronized.  This is useful for slots on the
+standby that are being synced from a primary server (whose
+synced field is true)
+so they know when the slot stopped being synchronized.
   
  
 
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index d62186a5107..f4f80b23129 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1515,7 +1515,7 @@ update_synced_slots_inactive_since(void)
 	 * correctly interpret the inactive_since if the standby gets promoted
 	 * without a restart. We don't want the slots to appear inactive for a
 	 * long time after promotion if they haven't been synchronized recently.
-	 * Whoever acquires the slot i.e.makes the slot active will reset it.
+	 * Whoever acquires the slot, i.e., makes the slot active, will reset it.
 	 */
 	if (!StandbyMode)
 		return;
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 45582cf9d89..5666fcfd3cf 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -205,7 +205,7 @@ typedef struct ReplicationSlot
 	 */
 	XLogRecPtr	last_saved_confirmed_flush;
 
-	/* The time since the slot has become inactive */
+	/* The time slot synchronization was stopped. */
 	TimestampTz inactive_since;
 } ReplicationSlot;

Re: Use "protocol options" name instead of "protocol extensions" everywhere

2024-10-31 Thread Jelte Fennema-Nio

On Thu, 31 Oct 2024 at 17:09, Robert Haas  wrote:
> I don't think it's really viable to promise that we'll never talk
> about extending anything other than in the context of what CREATE
> EXTENSION does.

Agreed, but it seems nice to avoid confusion by not overloading
terminology, if we can find a good/decent alternative. When we do have
an overloaded term, we should definitely document what we mean with it
and how it differs from other usage of the term. But even then many
people won't have read those docs and assume it means the thing that
they expect it to mean.

A good example of this is the fact that initdb creates a "database
cluster". We definitely document that we use the term that way. But if
you talk/write about a database cluster many people (understandably)
expect you to mean a very different thing.

Re: MultiXact\SLRU buffers configuration




> On 31 Oct 2024, at 17:29, Thom Brown  wrote:
> 
> On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin  wrote:
>> 
>> 
>> 
>>> On 29 Oct 2024, at 21:45, Thom Brown  wrote:
>>> 
>>> It clearly checks for interrupts, but when I saw this issue happen, it
>>> wasn't interruptible.
>> 
>> Perhaps, that was different multixacts?
>> When I was observing this pathology on Standby, it was a stream of different 
>> reads encountering different multis.
>> 
>> Either way startup can cancel locking process on it's own. Or is it the case 
>> that cancel was not enough, did you actually need termination, not cancel?
> 
> Termination didn't work on either of the processes.

How did you force the process to actually terminate?
Did you observe repeated read of the same multixact?
Was offending process holding any locks while waiting?


Best regards, Andrey Borodin.

Re: UUID v7



> On 31 Oct 2024, at 22:15, Masahiko Sawada  wrote:
> 
> On Sat, Oct 26, 2024 at 10:05 AM Andrey M. Borodin  
> wrote:
> 
> I think we typically avoid this kind of check failure by assigning
> uuidv7() and uuidv7(interval) different C functions that call the
> common function. That is, we have pg_proc entries like:
> 

Done.

>>> 
>>> It's odd to me that only uuid_extract_timestamp() supports UUID v6 in
>>> spite of not supporting UUID v6 generation. I think it makes more
>>> sense to support UUID v6 generation as well, if the need for it is
>>> high.
>> 
>> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with 
>> providing implementation, it's trivial. PFA patch with implementation.
>> 
> 
> My point is that we should either support full functionality for
> UUIDv6 (such as generation and extraction) or none of them. I'm not
> really sure we want UUIDv6 as well, but if we want it, it should be
> implemented in a separate patch.

Make sense. I've removed all traces of v6.

Thanks!


Best regards, Andrey Borodin.


v29-0001-Implement-UUID-v7.patch
Description: Binary data

Re: Count and log pages set all-frozen by vacuum

On Thu, Oct 31, 2024 at 11:56 AM Alastair Turner  wrote:
>
> For user consumption, to reduce the number of puzzled questions, I'd suggest 
> adding a line to the log output of the form
>
> visibility map: %u pages set all frozen, up to %u may have been removed from 
> the table
>
> rather than appending the info to the frozen: line of the output. By spelling 
> visibility map out in full it gives the curious user something specific 
> enough to look up. It also at least alerts the user to the fact that the 
> number can't just be subtracted from a total.

Would it also be useful to have the number set all-visible? It seems
like if we added a new line prefixed with visibility map, it ought to
include all-visible and all-frozen then.
Something like this:
 visibility map: %u pages set all-visible, %u pages set all-frozen.

I find it more confusing to say "up to X may have been removed from
the table." It's unclear what that means -- especially since we
already have "pages removed" in another part of the log message.

We actually could call visibilitymap_count() at the beginning of the
vacuum and then log the difference between that and the results of
calling it after finishing the vacuum. We currently call it after
truncating the table anyway. That won't tell you how many pages were
set all-frozen by this vacuum, as pages could have been unfrozen by
DML that occurred after the page was vacuumed. It might be useful in
addition to the line about the visibility map.

This is somewhat in conflict with Robert and Peter's points about how
autovacuum logging should be about what this vacuum did. But, we do
have lines that talk about the before and after values:

new relfrozenxid: 748, which is 3 XIDs ahead of previous value

So, we could do something like:
visibility map before: %u pages all-visible, %u pages all-frozen
visibility map after: %u pages all-visible, %u pages all-frozen
or
visibility map after: %u pages all-visible (%u more than before), %u
pages all-frozen (%u more than before)

I still prefer adding how many pages were set all-frozen by this vacuum, though.

- Melanie

Re: Time to add a Git .mailmap?

2024-10-31 Thread Nathan Bossart

On Thu, Oct 31, 2024 at 11:37:13AM +0100, Daniel Gustafsson wrote:
> When looking at our Git tree for a recent conference presentation I happened 
> to
> notice that we have recently gained duplicate names in the shortlog.  Not sure
> if we care enough to fix that with a .mailmap, but if we do the attached diff
> makes sure that all commits are accounted for a single committer entry.

Seems reasonable to me.

-- 
nathan

Re: make all ereport in gram.y print out relative location

jian he  writes:
> now changed to
> static PartitionStrategy parsePartitionStrategy(char *strategy, int location,
> core_yyscan_t yyscanner);

I can take a look at this, since it's basically a followup
to 774171c4f.

regards, tom lane

Re: Count and log pages set all-frozen by vacuum

2024-10-31 Thread Peter Geoghegan

On Thu, Oct 31, 2024 at 10:22 AM Melanie Plageman
 wrote:
> It seems a better solution would be to find a way to document it or
> phrase it clearly in the log. It is true that these pages were set
> all-frozen in the VM. It is just that some of them were later removed.
> And we don't know which ones those are. Is there a way to make this
> clear?

Probably not, but I don't think that it's worth worrying about. ISTM
that it's inevitable that somebody might get confused whenever we
expose implementation details such as these. This particular example
doesn't seem particularly concerning to me.

Fundamentally, the information that you're showing is a precisely
accurate account of the work performed by VACUUM. If somebody is
bothered by the fact that we're setting VM bits for pages that just
get truncated anyway, then they're bothered by the reality of what
VACUUM does -- and not by the instrumentation itself.

Why not just reuse visibilitymap_count for this (and have it count the
number of all-frozen pages when called by heap_vacuum_rel)? That'll
change the nature of the information shown, but that might actually
make it slightly more useful.

-- 
Peter Geoghegan

Re: Consider the number of columns in the sort cost model

2024-10-31 Thread Alena Rybakina

I played around with the examples a bit and couldn't figure out 
something. When I added the same values to different columns - 
firstly in a, later in b, the order of the columns for sort operation 
doesn't change. Isn't this a mistake?


create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);

create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);

insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;

explain analyze select a.x1, s.x2, s.y2 from a left join (select 
distinct * from b) s on a.x1=s.x2;

 QUERY PLAN

 Hash Right Join  (cost=44.99..48.15 rows=5 width=12) (actual 
time=0.225..0.250 rows=8 loops=1)

   Hash Cond: (b.x2 = a.x1)
   ->  HashAggregate  (cost=43.90..46.16 rows=226 width=8) (actual 
time=0.117..0.123 rows=3 loops=1)

 Group Key: b.x2, b.y2
 Batches: 1  Memory Usage: 40kB
 ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=8) 
(actual time=0.030..0.044 rows=48 loops=1)
   ->  Hash  (cost=1.04..1.04 rows=4 width=4) (actual 
time=0.073..0.074 rows=4 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on a  (cost=0.00..1.04 rows=4 width=4) (actual 
time=0.047..0.051 rows=4 loops=1)

 Planning Time: 1.649 ms
 Execution Time: 0.485 ms
(11 rows)

delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;
explain analyze select a.x1, s.x2, s.y2 from a left join (select 
distinct * from b) s on a.x1=s.x2;

  QUERY PLAN
---
 Hash Left Join  (cost=1.79..2.86 rows=4 width=12) (actual 
time=0.083..0.090 rows=4 loops=1)

   Hash Cond: (a.x1 = b.x2)
   ->  Seq Scan on a  (cost=0.00..1.04 rows=4 width=4) (actual 
time=0.010..0.011 rows=4 loops=1)
   ->  Hash  (cost=1.75..1.75 rows=3 width=8) (actual 
time=0.067..0.068 rows=3 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  HashAggregate  (cost=1.72..1.75 rows=3 width=8) (actual 
time=0.063..0.064 rows=3 loops=1)

   Group Key: b.x2, b.y2
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on b  (cost=0.00..1.48 rows=48 width=8) 
(actual time=0.006..0.014 rows=48 loops=1)

 Planning Time: 0.391 ms
 Execution Time: 0.151 ms
(11 rows)


Sorry, I missed vacuum analyze before deleting all data from table b, 
but after running it I still got the same plan.


alena@postgres=# create table a (x1 int, y1 int);
create table b (xcreate table a (x1 int, y1 int);
create table b (x2 int, y2 int););
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);L);
insert into a values (1, NULL);
create index a_x1_idx on a(x1);
create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);
insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;

alena@postgres=# vacuum analyze;
VACUUM
alena@postgres=# explain analyze select a.x1, s.x2, s.y2 from a left 
join (select distinct * from b) s on a.x1=s.x2;

  QUERY PLAN
---
 Hash Left Join  (cost=1.79..2.86 rows=4 width=12) (actual 
time=0.168..0.185 rows=8 loops=1)

   Hash Cond: (a.x1 = b.x2)
   ->  Seq Scan on a  (cost=0.00..1.04 rows=4 width=4) (actual 
time=0.027..0.029 rows=4 loops=1)
   ->  Hash  (cost=1.75..1.75 rows=3 width=8) (actual time=0.129..0.130 
rows=3 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  HashAggregate  (cost=1.72..1.75 rows=3 width=8) (actual 
time=0.119..0.123 rows=3 loops=1)

   Group Key: b.x2, b.y2
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on b  (cost=0.00..1.48 rows=48 width=8) 
(actual time=0.013..0.029 rows=48 loops=1)

 Planning Time: 1.464 ms
 Execution Time: 0.352 ms
(11 rows)

--
Regards,
Alena Rybakina
Postgres Professional

Re: Count and log pages set all-frozen by vacuum

On Thu, Oct 31, 2024 at 11:15 AM Peter Geoghegan  wrote:
>
> On Thu, Oct 31, 2024 at 10:22 AM Melanie Plageman
>  wrote:
> > It seems a better solution would be to find a way to document it or
> > phrase it clearly in the log. It is true that these pages were set
> > all-frozen in the VM. It is just that some of them were later removed.
> > And we don't know which ones those are. Is there a way to make this
> > clear?
>
> Probably not, but I don't think that it's worth worrying about. ISTM
> that it's inevitable that somebody might get confused whenever we
> expose implementation details such as these. This particular example
> doesn't seem particularly concerning to me.
>
> Fundamentally, the information that you're showing is a precisely
> accurate account of the work performed by VACUUM. If somebody is
> bothered by the fact that we're setting VM bits for pages that just
> get truncated anyway, then they're bothered by the reality of what
> VACUUM does -- and not by the instrumentation itself.

Makes sense to me. Though, I'm looking at it as a developer.

> Why not just reuse visibilitymap_count for this (and have it count the
> number of all-frozen pages when called by heap_vacuum_rel)? That'll
> change the nature of the information shown, but that might actually
> make it slightly more useful.

I'm biased because I want the count of pages newly set all-frozen in
the VM for another patch. You think exposing the total number of
all-frozen pages at the end of the vacuum is more useful to users?

- Melanie

Re: Making error message more user-friendly with spaces in a URI

>
> I made a patch to make the error message more user-friendly when using a
> URI to connect a database with psql.
>


Hi, Yushi! I could not find this line in the master branch and I couldn't
apply this patch. I only see this error in the test (I think the test
should also be changed), but the idea of fixing the error looks good to me.
Best Regards, Stepan Neretin!

Re: Having problems generating a code coverage report

2024-10-31 Thread jian he

On Thu, Oct 31, 2024 at 10:02 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > On Wed, Oct 30, 2024 at 05:26:49PM -0400, Peter Geoghegan wrote:
> > > I use Debian unstable for most of my day to day work. Apparently
> > > Debian unstable has exactly the same version of lcov as Ubuntu 24.04.
> > >
> > > I've also been unable to generate coverage reports for some time (at
> > > least on Debian, with LCOV version 2.0-1).
> >
> > So do I, for both the debian SID and the lcov parts.
>
> I downgraded to lcov 1.16 [1] and it helped. This is merely a
> workaround of course, not a long-time solution.
>
> [1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16
>
> --

my ubuntu meson version: 0.61.2, which also fails.
with similar errors you've posted.


from these two posts,
https://github.com/mesonbuild/meson/pull/12345
https://github.com/mesonbuild/meson/issues/11995

Maybe upgrading meson can solve this problem.

Multi delete wal IDEA

Hi there, hackers! How about trying out an idea to add an analog to save
memory in WAL files for deleting records, similar to multi-insert
optimization? This patch is trying to do just that.

Best Regards, Stepan Neretin!
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 1161520f76..fb2d4563fd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3816,13 +3816,14 @@ ExecModifyTable(PlanState *pstate)
 	CmdType		operation = node->operation;
 	ResultRelInfo *resultRelInfo;
 	PlanState  *subplanstate;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = NULL;
 	TupleTableSlot *oldSlot;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
 	ItemPointer tupleid;
 	bool		tuplock;
+	List		*items_to_delete = NULL;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -4179,8 +4180,15 @@ ExecModifyTable(PlanState *pstate)
 break;
 
 			case CMD_DELETE:
-slot = ExecDelete(&context, resultRelInfo, tupleid, oldtuple,
-  true, false, node->canSetTag, NULL, NULL, NULL);
+ItemPointer item_ptr = (ItemPointer) palloc0(sizeof(ItemPointerData));
+item_ptr->ip_blkid = tupleid->ip_blkid;
+item_ptr->ip_posid = tupleid->ip_posid;
+
+if (!items_to_delete)
+	items_to_delete = list_make1(item_ptr);
+else
+	items_to_delete = lappend(items_to_delete, item_ptr);
+
 break;
 
 			case CMD_MERGE:
@@ -4197,10 +4205,28 @@ ExecModifyTable(PlanState *pstate)
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
 		 */
-		if (slot)
+		if (slot && !(operation == CMD_DELETE))
 			return slot;
 	}
 
+	if (list_length(items_to_delete) > 0)
+	{
+		ListCell *cell;
+		ereport(WARNING, errmsg("NUM ITEMS TO DELETE = %d", list_length(items_to_delete)));
+		foreach(cell, items_to_delete)
+		{
+			ItemPointer item_ptr = (ItemPointer) lfirst(cell);
+			if (!slot)
+slot = ExecDelete(&context, resultRelInfo, item_ptr, NULL,
+  true, false, node->canSetTag, NULL, NULL, NULL);
+			else
+ExecDelete(&context, resultRelInfo, item_ptr, NULL,
+			true, false, node->canSetTag, NULL, NULL, NULL);
+		}
+
+		return slot;
+	}
+
 	/*
 	 * Insert remaining tuples for batch insert.
 	 */

Re: Count and log pages set all-frozen by vacuum

2024-10-31 Thread Peter Geoghegan

On Thu, Oct 31, 2024 at 11:26 AM Melanie Plageman
 wrote:
> I'm biased because I want the count of pages newly set all-frozen in
> the VM for another patch. You think exposing the total number of
> all-frozen pages at the end of the vacuum is more useful to users?

The emphasis on the work that one particular VACUUM operation
performed doesn't seem like the most relevant thing to users (I get
why you'd care about it in the context of your work, though). What
matters to users is that the overall picture over time is one where
VACUUM doesn't leave an excessive number of pages
not-all-frozen-in-VM.

What if we're just setting the same few pages all-frozen, again and
again? And what about normal (non-aggressive) VACUUMs that effectively
*increase* the number of pages future aggressive VACUUMs need to scan?
As you well know, by setting some pages all-visible (not all-frozen),
VACUUM essentially guarantees that those same pages can only get
frozen by future aggressive VACUUMs. All these factors seem to argue
for using visibilitymap_count for this (which is not to say that I am
opposed to instrumented pages newly marked all-frozen in the VM, if it
makes sense as part of some much larger project).

-- 
Peter Geoghegan

Re: Having problems generating a code coverage report

Hi Jian,

> > I downgraded to lcov 1.16 [1] and it helped. This is merely a
> > workaround of course, not a long-time solution.
> >
> > [1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16
> >
> > --
>
> my ubuntu meson version: 0.61.2, which also fails.
> with similar errors you've posted.
>
>
> from these two posts,
> https://github.com/mesonbuild/meson/pull/12345
> https://github.com/mesonbuild/meson/issues/11995
>
> Maybe upgrading meson can solve this problem.

Thanks for the hint. I'm using Meson 1.3.2. Although it's not ancient
(Feb 2024) there is 1.6.0 available. I'll try upgrading and let you
know the results.

-- 
Best regards,
Aleksander Alekseev

"command cannot affect row a second time" in INSERT ... ON CONFLICT

2024-10-31 Thread Karthik Ramanathan

Hello hackers,

I am looking to better understand the applicability of the error message
"command cannot affect row a second time".

Consider the following table and data:
CREATE TABLE ioc (i int, UNIQUE(i));
INSERT INTO ioc VALUES (1);

The following two queries produce different errors:
*Query 1*
postgres=# INSERT INTO ioc VALUES (1), (20) ON CONFLICT (i) DO UPDATE SET i
= 20;
ERROR:  21000: ON CONFLICT DO UPDATE command cannot affect row a second time
HINT:  Ensure that no rows proposed for insertion within the same command
have duplicate constrained values.

*Query 2*
postgres=# INSERT INTO ioc VALUES (20), (1) ON CONFLICT (i) DO UPDATE SET i
= 20;
ERROR:  23505: duplicate key value violates unique constraint "ioc_i_key"
DETAIL:  Key (i)=(20) already exists.

INSERT ... ON CONFLICT does not support deferrable unique constraints, and
so the two errors appear to be logically equivalent. However, the MERGE
command which does support deferring unique constraints, consistently
produces the duplicate key violation for similar queries [1] but also
raises "command cannot affect row a second time" in other scenarios as
demonstrated by regress tests in merge.sql.

Naively, it seems to me that attempting to take a tuple lock on both:
1. The conflicting tuple (i = 1 in the second tuple in Query 2) as well as
2. The tuple it updates into (i = 20 in the second tuple in Query 2) (which
may or may not exist)
in ExecOnConflictUpdate could yield a consistent error message in both
scenarios but it offers no real functional gains.

1. Is there a different reason the two queries produce a different error?
2. Is there a better way to think about the "command cannot affect row a
second time"? Appreciate any guidance. Thanks.

Warm regards,
Karthik Ramanathan


[1] MERGE command example
CREATE TABLE source (sid INT);
CREATE TABLE target (tid INT, UNIQUE (tid));
INSERT INTO target VALUES (1);

*Query 1a*
postgres=# INSERT INTO source VALUES (20), (1);
postgres=# MERGE INTO target t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED THEN UPDATE SET tid = 20
WHEN NOT MATCHED THEN INSERT VALUES (s.sid);
ERROR:  23505: duplicate key value violates unique constraint
"target_tid_key"
DETAIL:  Key (tid)=(20) already exists.

*Query 1b*
postgres=# INSERT INTO source VALUES (1), (20);
postgres=# MERGE INTO target t
USING source AS s
ON t.tid = s.sid
WHEN MATCHED THEN UPDATE SET tid = 20
WHEN NOT MATCHED THEN INSERT VALUES (s.sid);
ERROR:  23505: duplicate key value violates unique constraint
"target_tid_key"
DETAIL:  Key (tid)=(20) already exists.

Re: Use "protocol options" name instead of "protocol extensions" everywhere


On 30/10/2024 15:58, Jelte Fennema-Nio wrote:

It was pointed out by Heikki in the thread around protocol-level
wait-for-LSN that "protocol extensions" is a pretty confusing name,
since it suggests a relation to Postgres extensions. Even though there
is no such relationship at all. Attached is a small patch that aligns
on the name "protocol options" instead. This terminology is already
used in a bunch of the docs.

Since no protocol options have been introduced yet, it seems like now
is a good time to align on what to call them. It might even be worth
backporting this to have our docs of previous versions be consistent.


Bikeshedding time:

"protocol option" makes me think of GUCs.

"optional protocol features" perhaps. A bit long though..

Or keep using "protocol extension" and add a paragraph to the docs to 
say explicitly that there's no support for extensions to create protocol 
extensions. TLS extensions is a good comparison.


I don't have a strong opinion, all of those would work for me.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Count and log pages set all-frozen by vacuum

2024-10-31 Thread Peter Geoghegan

On Thu, Oct 31, 2024 at 12:02 PM Robert Haas  wrote:
> I don't see it quite the same way. I agree that what users are really
> concerned about is the excessive number of unfrozen pages in the VM.
> But that's not the question here. The question is what should
> log_autovacuum_min_duration log. And I think it should log what the
> vacuum itself did, not what the state of the table ended up being
> around the time the vacuum ended.

I don't fundamentally disagree that the actual work performed by
VACUUM could also be useful. It's a question of emphasis.

FWIW I do disagree with the principle that log_autovacuum_min_duration
should only log things that are work performed by VACUUM. While most
things that it reports on currently do work that way, that in itself
doesn't seem like it should preclude reporting on visibilitymap_count
now.

-- 
Peter Geoghegan

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-10-31 Thread Jacob Champion

On Thu, Oct 31, 2024 at 4:05 AM Antonin Houska  wrote:
> > >   And regardless, the library appears to be loaded by every backend during
> > >   authentication. Why isn't it loaded by postmaster like libraries listed 
> > > in
> > >   shared_preload_libraries? fork() would then ensure that the backends do 
> > > have
> > >   the library in their address space.
> >
> > It _can_ be, if you want -- there's nothing that I know of preventing
> > the validator from also being preloaded with its own _PG_init(), is
> > there? But I don't think it's a good idea to force that, for the same
> > reason we want to allow SIGHUP.
>
> Loading the library by postmaster does not prevent the backends from reloading
> it on SIGHUP later. I was simply concerned about performance. (I proposed
> loading the library at another stage of backend initialization rather than
> adding _PG_init() to it.)

Okay. I think this is going to be one of the slower authentication
methods by necessity: the builtin flow in libpq requires a human in
the loop, and an online validator is going to be making several HTTP
calls from the backend. So if it turns out later that we need to
optimize the backend logic, I'd prefer to have a case study in hand;
otherwise I think we're likely to optimize the wrong things.

> Easiness of reading is the only "question" here :-) It's might not always be
> obvious why a variable should have some particular value. In general, the
> Assert() statements are almost always preceded with a comment in the PG
> source.

Oh, an assertion label! I can absolutely add one; I originally thought
you were proposing a label for the case itself.

> > >   Or, doesn't the FE_OAUTH_INIT branch of the switch statement actually 
> > > fit
> > >   better into oauth_init()?
> >
> > oauth_init() is the mechanism initialization for the SASL framework
> > itself, which is shared with SCRAM. In the current architecture, the
> > init callback doesn't take the initial client response into
> > consideration at all.
>
> Sure. The FE_OAUTH_INIT branch in oauth_exchange() (FE) also does not generate
> the initial client response.

It might, if it ends up falling through to FE_OAUTH_REQUESTING_TOKEN.
There are two paths that can do that: the case where we have no
discovery URI, and the case where a custom user flow returns a token
synchronously (it was probably cached).

> Anyway, are you sure that pg_SASL_continue() can also receive the SASL_ASYNC
> value from oauth_exchange()? My understanding is that pg_SASL_init() receives
> it if there is no token, but after that, oauth_exchange() is not called util
> the token is available, and thus it should not return SASL_ASYNC anymore.

Correct -- the only way for the current implementation of the
OAUTHBEARER mechanism to return SASL_ASYNC is during the very first
call. That's not an assumption I want to put into the higher levels,
though; I think Michael will be unhappy with me if I introduce
additional SASL coupling after the decoupling work that's been done
over the last few releases. :D

Thanks again,
--Jacob

[PATCH] Improve code coverage of network address functions

Hi hackers,

Recently I played with lcov [1]. In the process it was discovered that
the following functions are not executed by our tests:

- abbrev(inet)
- set_masklen(cidr,int4)
- netmask(inet)
- hostmask(inet)
- inet_client_addr()
- inet_client_port()
- inet_server_addr()
- inet_server_port()

The proposed patch fixes this. For the last four functions the return
values are not checked, only the fact that the functions are callable
and don't crash.

This improves code coverage of src/backend/utils/adt/network.c from
69.8% to 80.1%.

[1]: 
https://postgr.es/m/CAJ7c6TPYPF93%2ByWi%3DThKiOsnhqLpeTmctMrJWz3xRQobGSY6BA%40mail.gmail.com

--
Best regards,
Aleksander Alekseev
From e3a19689a335bf5c0c6aecd19e6fe3d17d456b2b Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Thu, 31 Oct 2024 17:54:56 +0300
Subject: [PATCH v1] Improve code coverage of network address functions

The following functions were not covered by tests:

- abbrev(inet)
- set_masklen(cidr,int4)
- netmask(inet)
- hostmask(inet)
- inet_client_addr()
- inet_client_port()
- inet_server_addr()
- inet_server_port()

Correct this.

Aleksander Alekseev, reviewed by TODO FIXME
Discussion: TODO FIXME
---
 src/test/regress/expected/inet.out | 104 +
 src/test/regress/sql/inet.sql  |  14 
 2 files changed, 118 insertions(+)

diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out
index b6895d9ced..c903c6c93b 100644
--- a/src/test/regress/expected/inet.out
+++ b/src/test/regress/expected/inet.out
@@ -190,6 +190,72 @@ SELECT c AS cidr, masklen(c) AS "masklen(cidr)",
  10.0.0.0/8 | 8 | 9.1.2.3/8  | 8
 (4 rows)
 
+SELECT i AS inet, abbrev(i) AS "abbrev(inet)" FROM INET_TBL;
+   inet   |   abbrev(inet)   
+--+--
+ 192.168.1.226/24 | 192.168.1.226/24
+ 192.168.1.226| 192.168.1.226
+ 192.168.1.0/24   | 192.168.1.0/24
+ 192.168.1.0/25   | 192.168.1.0/25
+ 192.168.1.255/24 | 192.168.1.255/24
+ 192.168.1.255/25 | 192.168.1.255/25
+ 10.1.2.3/8   | 10.1.2.3/8
+ 10.1.2.3/8   | 10.1.2.3/8
+ 10.1.2.3 | 10.1.2.3
+ 10.1.2.3/24  | 10.1.2.3/24
+ 10.1.2.3/16  | 10.1.2.3/16
+ 10.1.2.3/8   | 10.1.2.3/8
+ 11.1.2.3/8   | 11.1.2.3/8
+ 9.1.2.3/8| 9.1.2.3/8
+ 10:23::f1/64 | 10:23::f1/64
+ 10:23::  | 10:23::
+ ::4.3.2.1/24 | ::4.3.2.1/24
+(17 rows)
+
+SELECT i AS inet, netmask(i) AS "netmask(inet)" FROM INET_TBL;
+   inet   |  netmask(inet)  
+--+-
+ 192.168.1.226/24 | 255.255.255.0
+ 192.168.1.226| 255.255.255.255
+ 192.168.1.0/24   | 255.255.255.0
+ 192.168.1.0/25   | 255.255.255.128
+ 192.168.1.255/24 | 255.255.255.0
+ 192.168.1.255/25 | 255.255.255.128
+ 10.1.2.3/8   | 255.0.0.0
+ 10.1.2.3/8   | 255.0.0.0
+ 10.1.2.3 | 255.255.255.255
+ 10.1.2.3/24  | 255.255.255.0
+ 10.1.2.3/16  | 255.255.0.0
+ 10.1.2.3/8   | 255.0.0.0
+ 11.1.2.3/8   | 255.0.0.0
+ 9.1.2.3/8| 255.0.0.0
+ 10:23::f1/64 | :::::
+ 10:23::  | :::::::
+ ::4.3.2.1/24 | :ff00::
+(17 rows)
+
+SELECT i AS inet, hostmask(i) AS "hostmask(inet)" FROM INET_TBL;
+   inet   |   hostmask(inet)   
+--+
+ 192.168.1.226/24 | 0.0.0.255
+ 192.168.1.226| 0.0.0.0
+ 192.168.1.0/24   | 0.0.0.255
+ 192.168.1.0/25   | 0.0.0.127
+ 192.168.1.255/24 | 0.0.0.255
+ 192.168.1.255/25 | 0.0.0.127
+ 10.1.2.3/8   | 0.255.255.255
+ 10.1.2.3/8   | 0.255.255.255
+ 10.1.2.3 | 0.0.0.0
+ 10.1.2.3/24  | 0.0.0.255
+ 10.1.2.3/16  | 0.0.255.255
+ 10.1.2.3/8   | 0.255.255.255
+ 11.1.2.3/8   | 0.255.255.255
+ 9.1.2.3/8| 0.255.255.255
+ 10:23::f1/64 | :::::
+ 10:23::  | ::
+ ::4.3.2.1/24 | 0:ff::::::
+(17 rows)
+
 SELECT c AS cidr, i AS inet FROM INET_TBL
   WHERE c = i;
   cidr  |  inet  
@@ -261,6 +327,44 @@ SELECT set_masklen(inet(text(i)), 24) FROM INET_TBL;
  ::4.3.2.1/24
 (17 rows)
 
+SELECT set_masklen(cidr(text(c)), 24) FROM INET_TBL;
+  set_masklen   
+
+ 192.168.1.0/24
+ 192.168.1.0/24
+ 192.168.1.0/24
+ 192.168.1.0/24
+ 192.168.1.0/24
+ 192.168.1.0/24
+ 10.0.0.0/24
+ 10.0.0.0/24
+ 10.1.2.0/24
+ 10.1.2.0/24
+ 10.1.0.0/24
+ 10.0.0.0/24
+ 10.0.0.0/24
+ 10.0.0.0/24
+ 10::/24
+ 10::/24
+ ::/24
+(17 rows)
+
+-- just making sure these functions are callable and don't crash
+SELECT 'ok' AS result
+FROM unnest(ARRAY[
+text(inet_client_addr()),
+text(inet_client_port()),
+text(inet_server_addr()),
+text(inet_server_port())
+  ]) AS x;
+ result 
+
+ ok
+ ok
+ ok
+ ok
+(4 rows)
+
 -- check that btree index works correctly
 CREATE INDEX inet_idx1 ON inet_tbl(i);
 SET enable_seqscan TO off;
diff --git a/src/test/regress/sql/inet.sql b/src/test/regr

Re: Relcache refactoring


On 31/10/2024 12:06, Heikki Linnakangas wrote:

On 31/10/2024 10:14, Heikki Linnakangas wrote:

Committed with those fixes, thanks for the review!


There is a failure in the buildfarm animal 'prion', which builds with 
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. I'll look 
into that later today. (I did test with -DRELCACHE_FORCE_RELEASE on my 
laptop, but must've missed something).


I was finally able to reproduce this, by running the failing pg_regress 
tests concurrently with repeated "vacuum full pg_database" calls. It's 
curious that 'prion' failed on the first run after the commit, I was not 
able to reproduce it by just running the unmodified regression tests 
with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE.


Committed a fix. There was one codepath that was missing a call to 
RelationInitPhysicalAddr(relation) after the patch. I alluded to this 
earlier in this thread:



## RelationInitPhysicalAddr() call in RelationReloadNailed()

One question or little doubt I have: Before these patches, 
RelationReloadNailed() calls RelationInitPhysicalAddr() even when it 
leaves the relation as invalidated because we're not in a transaction or 
if the relation isn't currently in use. That's a bit surprising, I'd 
expect it to be done when the entry is reloaded, not when it's 
invalidated. That's how it's done for non-nailed relations. And in fact, 
for a nailed relation, RelationInitPhysicalAddr() is called *again* when 
it's reloaded.


Is it important? Before commit a54e1f1587, nailed non-index relations 
were not reloaded at all, except for the call to 
RelationInitPhysicalAddr(), which seemed consistent. I think this was 
unintentional in commit a54e1f1587, or perhaps just overly defensive, as 
there's no harm in some extra RelationInitPhysicalAddr() calls.


This patch removes that extra call from the invalidation path, but if it 
turns out to be wrong, we can easily add it to RelationInvalidateRelation.


Now this wasn't exactly that, but related.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Count and log pages set all-frozen by vacuum

On Thu, Oct 31, 2024 at 11:15 AM Peter Geoghegan  wrote:
> Probably not, but I don't think that it's worth worrying about. ISTM
> that it's inevitable that somebody might get confused whenever we
> expose implementation details such as these. This particular example
> doesn't seem particularly concerning to me.

+1. We could possibly make this less confusing by reworking the output
so that we first talk about what the vacuuming did in one set of log
lines and then talk about what the truncation did afterward. But
that's a lot of work, and I don't feel like it's "must do" work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: small pg_combinebackup improvements

On Thu, Oct 31, 2024 at 11:41 AM Bertrand Drouvot
 wrote:
> 0001 looks pretty straightforward and good to me.

Thanks for the review.

> What about moving the new comment just before the new "else if"?

Well, the block comment applies to the whole if-else if-else
construction. If we get too many else-ifs here it may need
restructuring, but I don't think it needs that now.

> Yeah, "copy" is needed. I tested to "just" create an empty incremental file
> and got:
>
> $ pg_combinebackup backup1 backup2 -o restored_full
> pg_combinebackup: error: could not read file 
> "backup1/base/1/INCREMENTAL.6113": read 0 of 4
>
> Which is not the error we want to produce.

Right, it needs to look like a valid incremental file.

> s/pg_combinebackup fails/pg_combinebackup fails due to an unexpected 
> incremental file/?

OK

> I'm not sure about 0001 but I think 0002 deserves a back patch as I think it 
> fits
> into the "low-risk fixes" category [0].

I'm inclined to back-patch both, then. We might have more small fixes
and they'll be less risky to back-patch if we back-patch them all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: minor doc issue in 9.16.2.1.1. Boolean Predicate Check Expressions

On Fri, Oct 18, 2024 at 10:00:54AM +0800, jian he wrote:
> On Fri, Oct 18, 2024 at 2:05 AM Bruce Momjian  wrote:
> > Yes, updated patch attached.
> >
> looks good.
> 
> in the meantime, do you think it's necessary to slightly rephrase
> jsonb_path_match doc entry:
> 
> currently doc entry:
> jsonb_path_match ( target jsonb, path jsonpath [, vars jsonb [, silent
> boolean ]] ) → boolean
> Returns the result of a JSON path predicate check for the specified JSON 
> value.
> 
> 
> "the result of a JSON path predicate check for the specified JSON
> value." is a jsonb boolean.
> but jsonb_path_match returns sql boolean.
> maybe add something to describe case like: "if JSON path predicate
> check return jsonb null, jsonb_path_match will return SQL null".

Yes, I think that is a good point, updated patch attached.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 05f630c6a6c..25e445467c3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17178,8 +17178,8 @@ ERROR:  value too long for type character(2)
 boolean


-Returns the result of a JSON path predicate check for the specified
-JSON value.
+Returns the SQL boolean result of a JSON path predicate check
+for the specified JSON value.
 (This is useful only
 with predicate
 check expressions, not SQL-standard JSON path expressions,
@@ -17646,9 +17646,9 @@ SELECT '{
  Boolean predicate, whereas the SQL standard allows predicates only within
  filters. While SQL-standard path expressions return the relevant
  element(s) of the queried JSON value, predicate check expressions
- return the single three-valued result of the
+ return the single three-valued jsonb result of the
  predicate: true,
- false, or unknown.
+ false, or null.
  For example, we could write this SQL-standard filter expression:
 
 => select jsonb_path_query(:'json', '$.track.segments ?(@[*].HR > 130)');

Re: small pg_combinebackup improvements

2024-10-31 Thread Bertrand Drouvot

Hi,

On Wed, Oct 30, 2024 at 03:50:53PM -0400, Robert Haas wrote:
> Hi,
> 
> Here are two small patches to improve pg_combinebackup slightly.
> 
> 0001: I noticed that there is some logic in reconstruct.c that
> constructs a pathname of the form a/b//c instead of a/b/c. AFAICT,
> this still works fine; it just looks ugly. It's possible to get one of
> these pathnames to show up in an error message if you have a corrupted
> backup, e.g.:
> 
> pg_combinebackup: error: could not open file
> "x/base/1//INCREMENTAL.6229": No such file or directory
> pg_combinebackup: removing output directory "xy"
> 
> So 0001 fixes this.

Yeah, I don't think that was an issue per say but better to display a path of
the form a/b/c (to not be "distracted" by the output at the first place).

0001 looks pretty straightforward and good to me.

> 0002: If you try to do something like pg_combinebackup incr1 incr2 -o
> result, you'll get an error saying that the first backup must be a
> full backup. This is an implementation restriction that might be good
> to lift, but right now that's how it is. However, if you do
> pg_combinebackup full incr -o result, but the full backup happens to
> contain an INCREMENTAL.* file that also exists in the incremental
> backup, then you won't get an error. Instead you'll reconstruct a
> completely bogus full file and write it to the result directory. Now,
> this should really be impossible unless you're intentionally trying to
> break pg_combinebackup, but it's also pretty lame, so 0002 fixes
> things so that you get a proper error, and also adds a test case.
> 

1 ===

+* If we have only incremental files, and there's no full file at any
+* point in the backup chain, something has gone wrong. Emit an error.
+*
 * Otherwise, reconstruct.
 */
if (copy_source != NULL)
copy_file(copy_source->filename, output_filename,
  &checksum_ctx, copy_method, dry_run);
+   else if (sidx == 0 && source[0]->header_length != 0)
+   {
+   pg_fatal("full backup contains unexpected incremental file 
\"%s\"",
+source[0]->filename);
+   }

What about moving the new comment just before the new "else if"?

2 ===

+  copy("$backup2path/base/1/$iname", "$backup1path/base/1/$iname")

Yeah, "copy" is needed. I tested to "just" create an empty incremental file
and got:

$ pg_combinebackup backup1 backup2 -o restored_full
pg_combinebackup: error: could not read file "backup1/base/1/INCREMENTAL.6113": 
read 0 of 4

Which is not the error we want to produce.

3 ===

+   qr/full backup contains unexpected incremental file/,
+   "pg_combinebackup fails");

s/pg_combinebackup fails/pg_combinebackup fails due to an unexpected 
incremental file/?

> Review appreciated. My plan is to commit at least to master and
> possibly back-patch. Opinions on whether to back-patch are also
> appreciated.

I'm not sure about 0001 but I think 0002 deserves a back patch as I think it 
fits
into the "low-risk fixes" category [0].

[0]: https://www.postgresql.org/support/versioning/

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Count and log pages set all-frozen by vacuum

On Thu, Oct 31, 2024 at 11:49 AM Peter Geoghegan  wrote:
> The emphasis on the work that one particular VACUUM operation
> performed doesn't seem like the most relevant thing to users (I get
> why you'd care about it in the context of your work, though). What
> matters to users is that the overall picture over time is one where
> VACUUM doesn't leave an excessive number of pages
> not-all-frozen-in-VM.

I don't see it quite the same way. I agree that what users are really
concerned about is the excessive number of unfrozen pages in the VM.
But that's not the question here. The question is what should
log_autovacuum_min_duration log. And I think it should log what the
vacuum itself did, not what the state of the table ended up being
around the time the vacuum ended.

And I think there is certainly a use case for knowing how much work of
each particular kind vacuum did. You might for example be trying to
judge whether a particular vacuum was useless. Knowing the cumulative
state of the table around the time the vacuum finished doesn't help
you figure that out; a count of how much work the vacuum itself did
does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use "protocol options" name instead of "protocol extensions" everywhere

On Thu, Oct 31, 2024 at 10:51 AM Heikki Linnakangas  wrote:
> Bikeshedding time:
>
> "protocol option" makes me think of GUCs.
>
> "optional protocol features" perhaps. A bit long though..
>
> Or keep using "protocol extension" and add a paragraph to the docs to
> say explicitly that there's no support for extensions to create protocol
> extensions. TLS extensions is a good comparison.
>
> I don't have a strong opinion, all of those would work for me.

I don't particularly like "optional protocol features". I find
"protocol extensions" to be mildly clearer than "protocol options,"
but only mildly.

I don't think it's really viable to promise that we'll never talk
about extending anything other than in the context of what CREATE
EXTENSION does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Use "protocol options" name instead of "protocol extensions" everywhere

2024-10-31 Thread Jelte Fennema-Nio

On Thu, 31 Oct 2024 at 15:50, Heikki Linnakangas  wrote:
> Bikeshedding time:

Another few options:
4. Protocol enhancement
5. Protocol flag
6. Protocol feature-flag
7. Protocol configuration
8. Protocol parameter

One thing to consider is that there's two ways of using them:
1. Turning an optional protocol feature on/of (send LSN yes/no)
2. Configuring an optional protocol feature (compress with gzip/lz4/zstd)

I think "protocol extension" is a good name for the first. But it
reads/writes a bit awkward for the second usage imo:.
1. The wait_for_lsn protocol extension needs to be enabled.
2. I configured the compression protocol extension to be gzip.

I like that "protocol option" because it works for both:
1. The wait_for_lsn protocol option needs to be enabled.
2. I set the compression protocol option to gzip.

I still think of these "protocol xyzs" as essentially being GUCs for
the protocol. Especially because they are configured the same way as
GUCs in the StartupMessage. So having "protocol option" making you
think of GUCs doesn't necessarily seem like a bad thing to me.

Re: Fix typos where 'the' was repeated

2024-10-31 Thread Daniel Gustafsson

> On 31 Oct 2024, at 10:24, vignesh C  wrote:

> I noticed a couple of typos in code. "the the" should have been "the",
> attached patch has the changes for the same.

Fixed, thanks.

--
Daniel Gustafsson

Re: Eager aggregation, take 3

2024-10-31 Thread jian he

hi.
still trying to understand v13. found a bug.

minimum test :
drop table if exists t1, t2;
CREATE TABLE t1 (a int, b int, c int);
CREATE TABLE t2 (a int, b int, c int);
SET enable_eager_aggregate TO on;
explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON
t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0;


create_agg_clause_infos
foreach(lc, tlist_exprs)
{
Expr   *expr = (Expr *) lfirst(lc);
if (IsA(expr, GroupingFunc))
return;
}
if (root->parse->havingQual != NULL)
{
List   *having_exprs;
having_exprs = pull_var_clause((Node *) root->parse->havingQual,
   PVC_INCLUDE_AGGREGATES |
   PVC_RECURSE_PLACEHOLDERS);
if (having_exprs != NIL)
{
tlist_exprs = list_concat(tlist_exprs, having_exprs);
list_free(having_exprs);
}
}

havingQual can have GroupingFunc.
if that happens, then segmentation fault.

Re: IPC::Run::time[r|out] vs our TAP tests

2024-10-31 Thread Daniel Gustafsson

> On 28 Oct 2024, at 11:56, Heikki Linnakangas  wrote:
> 
> On 09/04/2024 20:10, Daniel Gustafsson wrote:
>> Turning the timeout into a timer and returning undef along with logging a 
>> test
>> failure in case of expiration seems a bit saner (maybe Andrew can suggest an
>> API which has a better Perl feel to it).  Most callsites don't need any 
>> changes
>> to accommodate for this, the attached 0002 implements this timer change and
>> modify the few sites that need it, converting one to plain query() where the
>> added complexity of query_until isn't required.
> 
> +1. The patch looks good to me. I didn't comb through the tests to check for 
> bugs of omission, i.e. cases where the caller would need adjustments because 
> of this, I trust that you found them all.

Thanks for review!

>> =item $session->quit
>> Close the session and clean up resources. Each test run must be closed with
>> C.  Returns TRUE when the session was cleanly terminated, otherwise
>> FALSE.  A testfailure will be issued in case the session failed to finish.
> 
> What does "session failed to finish" mean? Does it mean the same as "not 
> cleanly terminated", i.e. a test failure is issued whenever this returns 
> FALSE?

It was very literally referring to the finish() method.  I've reworded the
comment to indicated that it throws a failure in case the process returns a
non-zero exit status to finish().

> typo: testfailure -> test failure

Fixed.

>> my $pid = $bgpsql->query('SELECT pg_backend_pid()');
>> +isnt($pid, undef, 'Get backend PID');
>>  # create the database, prevent drop database via lock held by a 2PC 
>> transaction
>> ok( $bgpsql->query_safe(
> 
> I'm not sure I understand these changes. These can only fail if the query() 
> or query_until() function times out, right? In that case, the query() or 
> query_until() would also report a test failure, so these additional checks 
> after the call seem redundant. They don't do any harm either, but I wonder 
> why have them in these particular call sites and not in other query() or 
> query_until() calls.

Fair point, they were added to provide additional detail in case of failure,
but they are to some degree overzealous and definitely not required.

Attached is a v2 with the above changes and 0003 dropped due to already being
implemented.

--
Daniel Gustafsson



v2-0001-Configure-interactive-instance-to-restart-timer.patch
Description: Binary data


v2-0002-Report-test-failure-rather-than-aborting-in-case-.patch
Description: Binary data

Re: MultiXact\SLRU buffers configuration

> On 29 Oct 2024, at 21:45, Thom Brown  wrote:
> 
> It clearly checks for interrupts, but when I saw this issue happen, it
> wasn't interruptible.

Perhaps, that was different multixacts?
When I was observing this pathology on Standby, it was a stream of different 
reads encountering different multis.

Either way startup can cancel locking process on it's own. Or is it the case 
that cancel was not enough, did you actually need termination, not cancel?

Thanks!

Best regards, Andrey Borodin.

Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2024-10-31 Thread Vitaly Davydov

Dear Hackers,
I'd like to discuss a problem with replication slots's restart LSN. Physical
slots are saved to disk at the beginning of checkpoint. At the end of
checkpoint, old WAL segments are recycled or removed from disk, if they are not
kept by slot's restart_lsn values.
If an existing physical slot is advanced in the middle of checkpoint execution,
WAL segments, which are related to saved on disk restart LSN may be removed. It
is because the calculation of the replication slot miminal LSN is occured at
the end of checkpoint, prior to old WAL segments removal. If to hard stop
(pg_stl -m immediate) the postgres instance right after checkpoint and to
restart it, the slot's restart_lsn may point to the removed WAL segment. I
believe, such behaviour is not good.
The doc [0] describes that restart_lsn may be set to the some past value after
reload. There is a discussion [1] on pghackers where such behaviour is
discussed. The main reason of not flushing physical slots on advancing is a
performance reason. I'm ok with such behaviour, except of that the
corresponding WAL segments should not be removed.
I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots.
Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy
restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the
format of storing the slot contents on disk. I attached a patch. It is not yet
complete, but demonstate a way to solve the problem.
I reproduced the problem by the following way:
* Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint
execution. * Execute checkpoint and pg_replication_slot_advance right after
starting of the checkpoint from another connection. * Hard restart the server
right after checkpoint completion. * After restart slot's restart_lsn may point
to removed WAL segment.
The proposed patch fixes it.
[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1]
https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru

Re: Relcache refactoring


On 31/10/2024 10:14, Heikki Linnakangas wrote:

Committed with those fixes, thanks for the review!


There is a failure in the buildfarm animal 'prion', which builds with 
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. I'll look 
into that later today. (I did test with -DRELCACHE_FORCE_RELEASE on my 
laptop, but must've missed something).


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

2024-10-31 Thread Vitaly Davydov


Sorry, attached the missed patch.

On Thursday, October 31, 2024 13:18 MSK, "Vitaly Davydov" 
 wrote:

Dear Hackers,
 
I'd like to discuss a problem with replication slots's restart LSN. Physical 
slots are saved to disk at the beginning of checkpoint. At the end of 
checkpoint, old WAL segments are recycled or removed from disk, if they are not 
kept by slot's restart_lsn values.
 
If an existing physical slot is advanced in the middle of checkpoint execution, 
WAL segments, which are related to saved on disk restart LSN may be removed. It 
is because the calculation of the replication slot miminal LSN is occured at 
the end of checkpoint, prior to old WAL segments removal. If to hard stop 
(pg_stl -m immediate) the postgres instance right after checkpoint and to 
restart it, the slot's restart_lsn may point to the removed WAL segment. I 
believe, such behaviour is not good.
 
The doc [0] describes that restart_lsn may be set to the some past value after 
reload. There is a discussion [1] on pghackers where such behaviour is 
discussed. The main reason of not flushing physical slots on advancing is a 
performance reason. I'm ok with such behaviour, except of that the 
corresponding WAL segments should not be removed.
 
I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. 
Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy 
restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the 
format of storing the slot contents on disk. I attached a patch. It is not yet 
complete, but demonstate a way to solve the problem.
 
I reproduced the problem by the following way:
 * Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint 
execution. * Execute checkpoint and pg_replication_slot_advance right after 
starting of the checkpoint from another connection. * Hard restart the server 
right after checkpoint completion. * After restart slot's restart_lsn may point 
to removed WAL segment.
The proposed patch fixes it.
 
[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] 
https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru



 
From acae6b55fc766d2fe1bfe85eb8af85110f55dcc8 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov 
Date: Thu, 31 Oct 2024 12:29:12 +0300
Subject: [PATCH] Keep WAL segments by slot's flushed restart LSN

---
 src/backend/replication/slot.c | 9 +++--
 src/include/replication/slot.h | 4 
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 6828100cf1..ee7ab3678e 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1148,7 +1148,9 @@ ReplicationSlotsComputeRequiredLSN(void)
 			continue;
 
 		SpinLockAcquire(&s->mutex);
-		restart_lsn = s->data.restart_lsn;
+		restart_lsn = s->restart_lsn_flushed != InvalidXLogRecPtr ?
+			s->restart_lsn_flushed :
+			s->data.restart_lsn;
 		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
@@ -1207,7 +1209,9 @@ ReplicationSlotsComputeLogicalRestartLSN(void)
 
 		/* read once, it's ok if it increases while we're checking */
 		SpinLockAcquire(&s->mutex);
-		restart_lsn = s->data.restart_lsn;
+		restart_lsn = s->restart_lsn_flushed != InvalidXLogRecPtr ?
+			s->restart_lsn_flushed :
+			s->data.restart_lsn;
 		invalidated = s->data.invalidated != RS_INVAL_NONE;
 		SpinLockRelease(&s->mutex);
 
@@ -2097,6 +2101,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
 	SpinLockAcquire(&slot->mutex);
 
 	memcpy(&cp.slotdata, &slot->data, sizeof(ReplicationSlotPersistentData));
+	slot->restart_lsn_flushed = slot->data.restart_lsn;
 
 	SpinLockRelease(&slot->mutex);
 
diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
index 45582cf9d8..ca4c3aab3b 100644
--- a/src/include/replication/slot.h
+++ b/src/include/replication/slot.h
@@ -207,6 +207,10 @@ typedef struct ReplicationSlot
 
 	/* The time since the slot has become inactive */
 	TimestampTz inactive_since;
+
+	/* Latest restart LSN that was flushed to disk */
+	XLogRecPtr restart_lsn_flushed;
+
 } ReplicationSlot;
 
 #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid)
-- 
2.34.1

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

2024-10-31 Thread Amit Langote

On Wed, Oct 30, 2024 at 9:36 PM Junwang Zhao  wrote:
> On Wed, Oct 30, 2024 at 11:58 AM jian he  wrote:
> >
> > I missed a case when column collation and partition key collation are
> > the same and indeterministic.
> > that should be fine for partition-wise join.
> > so v2 attached.
> >
> > have_partkey_equi_join, match_expr_to_partition_keys didn't do any
> > collation related check.
> > propose v2 change disallow partitionwise join for  case when
> > column collation is indeterministic *and* is differ from partition
> > key's collation.
> >
> > the attached partition_wise_join_collation.sql is the test script.
> > you may use it to compare with the master behavior.
>
> What if the partkey collation and column collation are both deterministic,
> but with different sort order?
>
> I'm not familiar with this part of the code base, but it seems to me the
> partition wise join should use partkey collation instead of column collation,
> because it's the partkey collation that decides which partition a row to
> be dispatched.
>
> What Jian proposed is also reasonable but seems another aspect of $subject?

I think we should insist that the join key collation and the partition
collation are exactly the same and refuse to match them if they are
not.

+   {
+   Oid colloid =  exprCollation((Node *) expr);
+
+   if ((partcoll != colloid) &&
+OidIsValid(colloid) &&
+!get_collation_isdeterministic(colloid))
+   *coll_incompatiable = true;

I am not quite sure what is the point of checking whether or not the
expression collation is deterministic after confirming that it's not
the same as partcoll.

Attached 0002 is what I came up with.  One thing that's different from
what Jian proposed is that match_expr_to_partition_keys() returns -1
(expr not matched to any key) when the collation is also not matched
instead of using a separate output parameter for that.

0001 is the patch for the partitionwise grouping issue (bug #18568).
I concluded that even partial aggregate should be disallowed when the
grouping collation doesn't match partitioning collation. The result
with partial partitionwise grouping is not the same as when
enable_partitionwise_aggregate is off.

--
Thanks, Amit Langote

v3-0001-Disallow-partitionwise-grouping-when-collation-do.patch
Description: Binary data

v3-0002-Disallow-partitionwise-join-when-collation-doesn-.patch
Description: Binary data

Re: Statistics Import and Export

2024-10-31 Thread Corey Huinker

>
>
>   (c) we are considering whether to use an in-place heap update for the
> relation stats, so that a large restore doesn't bloat pg_class -- I'd
> like feedback on this idea
>

I'd also like feedback, though I feel very strongly that we should do what
ANALYZE does. In an upgrade situation, nearly all tables will have stats
imported, which would result in an immediate doubling of pg_class - not the
end of the world, but not great either.

Given the recent bugs associated with inplace updates and race conditions,
if we don't want to do in-place here, we should also consider getting rid
of it for ANALYZE. I briefly pondered if it would make sense to vertically
partition pg_class into the stable attributes and the attributes that get
modified in-place, but that list is pretty long: relpages, reltuples,
relallvisible, relhasindex, reltoastrelid, relhasrules, relhastriggers,
relfrozenxid, and reminmxid,

If we don't want to do inplace updates in pg_restore_relation_stats(), then
we could mitigate the bloat with a VACUUM FULL pg_class at the tail end of
the upgrade if stats were enabled.


> pg_restore_*_stats() functions. But there's a lot of overlap, so it may
> be worth discussing again whether we should only have one set of
> functions.
>

For the reason of in-place updates and error tolerance, I think they have
to remain separate functions, but I'm also interested in hearing other's
opinions.

Re: Count and log pages set all-frozen by vacuum

Thanks for taking a look, Alastair!

On Thu, Oct 31, 2024 at 5:47 AM Alastair Turner  wrote:
>
> The values returned in a case pages are removed (cases where the empty pages 
> are at the end of the table) are a bit confusing.
>
> In an example similar to yours, but with a normal vacuum operation, since 
> that seems to be the most useful case for this feature:
>
> CREATE TABLE the_table (first int, second int) WITH (autovacuum_enabled = 
> false);
> INSERT INTO the_table SELECT generate_series(1,1000), 1;
> DELETE FROM the_table WHERE first > 50;
> VACUUM (VERBOSE) the_table;
>
> pages: 4 removed, 1 remain, 5 scanned (100.00% of total)
> tuples: 950 removed, 50 remain, 0 are dead but not yet removable
> removable cutoff: 763, which was 1 XIDs old when operation ended
> new relfrozenxid: 761, which is 1 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen. 4 pages set 
> all-frozen in the VM
>
> It looks like 4 pages out of 1 are set all-frozen. So there are -3 to scan 
> for the next aggressive vacuum? The four pages which were set to all frozen 
> are the same four which have been removed from the end of the table.
>
> For an example where the empty pages which are marked all frozen are at the 
> start of the table (deleting values < 950 in the example), the results are 
> more intuitive
>
> pages: 0 removed, 5 remain, 5 scanned (100.00% of total)
> tuples: 949 removed, 51 remain, 0 are dead but not yet removable
> removable cutoff: 768, which was 0 XIDs old when operation ended
> new relfrozenxid: 766, which is 1 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen. 4 pages set 
> all-frozen in the VM
>
> Can the pages which are removed from the end of the table not be counted 
> towards those set all frozen to make the arithmetic a bit more intuitive for 
> this edge case?

This is a good point. It could be confusing to see that 1 page remains
but 4 were set all-frozen in the VM.
>From the perspective of the code, this is because each page is set
all-frozen/all-visible in the VM after it is pruned or vacuumed.
Truncating of the end of the table happens at the end of vacuum --
after all pages have been processed. So, these pages are set
all-frozen in the VM.

I actually don't see a good way that we could accurately decrement the
count. We have LVRelState->removed_pages but we have no idea which of
those pages are all-frozen. We could have
visibilitymap_prepare_truncate() count and return to
RelationTruncate() how many of the truncated pages were all-frozen.
But we have no way of knowing how many of those pages were newly
frozen by this vacuum.

And if we try to track it from the other direction, when freezing
pages, we would have to keep track of all the block numbers of pages
in the relation that were empty and set frozen and then when
truncating the relation find the overlap. That sounds hard and
expensive.

It seems a better solution would be to find a way to document it or
phrase it clearly in the log. It is true that these pages were set
all-frozen in the VM. It is just that some of them were later removed.
And we don't know which ones those are. Is there a way to make this
clear?
And, if not, is it worse than not having the feature at all?

- Melanie

Re: Test to dump and restore objects left behind by regression

Michael Paquier  writes:
> On my laptop, testing the plain format adds roughly 12s, in a test
> that now takes 1m20s to run vs 1m32s.  Enabling regress_dump_formats
> and adding three more formats counts for 45s of runtime.  For a test
> that usually shows up as the last one to finish for a heavily
> parallelized run.  So even the default of "plain" is going to be
> noticeable, I am afraid.

Yeah, that's what I've been afraid of from the start.  There's
no way that this will buy us enough new coverage to justify
that sort of addition to every check-world run.

I'd be okay with adding it in a form where the default behavior
is to do no additional checking.  Whether that's worth maintaining
is hard to say though.

regards, tom lane

Re: Proper object locking for GRANT/REVOKE

2024-10-31 Thread Bertrand Drouvot

Hi,

On Mon, Oct 28, 2024 at 04:20:42PM +0100, Peter Eisentraut wrote:
> This patch started out as a refactoring, thinking that objectNamesToOids()
> in aclchk.c should really mostly be a loop around get_object_address().
> This is mostly true, with a few special cases because the node
> representations are a bit different in some cases, and OBJECT_PARAMETER_ACL,
> which is obviously very different.  This saves a bunch of duplicative code,
> which is nice.
> 
> Additionally, get_object_address() handles locking, which
> objectNamesToOids() somewhat famously does not do, and there is a code
> comment about it.  With this refactoring, we get the locking pretty much for
> free.

Thanks for the patch, this refactoring makes sense to me.

A few random comments:

1 ===

+   default:

I like the idea of using default as the first "case" as a way to emphasize that
this is the most "common" behavior.

2 === Nit

+  address = get_object_address(objtype, lfirst(cell), &relation, lockmode, 
false);
+  Assert(relation == NULL);

Worth to explain why we do expect relation to be NULL here? (the comment on top
of get_object_address() says it all, but maybe a few words here could be worth
it).

3 ===

-
-   /* Used GRANT DOMAIN on a non-domain? */
-   if (istmt->objtype == OBJECT_DOMAIN &&
-   pg_type_tuple->typtype != TYPTYPE_DOMAIN)
-   ereport(ERROR,
-   (errcode(ERRCODE_WRONG_OBJECT_TYPE),
-errmsg("\"%s\" is not a domain",
-   
NameStr(pg_type_tuple->typname;

Yeah, get_object_address()->get_object_address_type() does take care of it.

4 ===

> 
> Interestingly, this changes the output of the intra-grant-inplace isolation
> test, which is recent work.  It might be good to get some review from those
> who worked on that, to make sure that the new behavior is still correct
> and/or to check whether those test cases are still applicable.

-s4: WARNING:  got: cache lookup failed for relation REDACTED
+s4: WARNING:  got: relation "intra_grant_inplace" does not exist

I did not work on this but out of curiosity I looked at it, and IIUC this change
comes from:

-  relOid = RangeVarGetRelid(relvar, NoLock, false);
+  relOid = RangeVarGetRelid(relvar, lockmode, false);

So without the lock, the error is coming from ExecGrant_Relation() that way:

[local] DO(+0x22529b) [0x5c56afdb629b]
[local] DO(+0x2220b0) [0x5c56afdb30b0]
[local] DO(ExecuteGrantStmt+0x696) [0x5c56afdb3047]
[local] DO(+0x6dc8cb) [0x5c56b026d8cb]
[local] DO(standard_ProcessUtility+0xd38) [0x5c56b026b9da]
[local] DO(ProcessUtility+0x13a) [0x5c56b026ac9b]
[local] DO(+0x460073) [0x5c56afff1073]

With the lock, the error comes from RangeVarGetRelidExtended() that way:

[local] DO(RangeVarGetRelidExtended+0x4e6) [0x5a1c3ac49d36]
[local] DO(+0x2223fe) [0x5a1c3ac2a3fe]
[local] DO(ExecuteGrantStmt+0x111) [0x5a1c3ac29ac2]
[local] DO(+0x6dc1d2) [0x5a1c3b0e41d2]
[local] DO(standard_ProcessUtility+0xd38) [0x5a1c3b0e22e1]
[local] DO(ProcessUtility+0x13a) [0x5a1c3b0e15a2]

That's due to (in RangeVarGetRelidExtended()):

"
 * But if lockmode = NoLock, then we assume that either the caller is OK
 * with the answer changing under them, or that they already hold some
 * appropriate lock, and therefore return the first answer we get without
 * checking for invalidation messages.
"

So, in the RangeVarGetRelid(relvar, NoLock, false) case (without the patch) then
if we are able to reach "relId = RelnameGetRelid(relation->relname);" before the
drop gets committed then we get the "cache lookup failed" error.

But if we are slow enough so that we don't reach "relId = 
RelnameGetRelid(relation->relname);"
before the drop get committed then we would also get the "relation 
"intra_grant_inplace" does not exist"
error (tested manually, attaching gdb on s4 and breakpoint before the 
RelnameGetRelid(relation->relname)
call in RangeVarGetRelidExtended()).

The fact that we use lockmode != NoLock in the patch produces a lock followed
by a "retry" in RangeVarGetRelidExtended() and so we get the "relation 
"intra_grant_inplace" does not exist"
error.

I think that the new behavior is still correct and in fact is more 
"appropriate" (
I mean that's the kind of error I expect to see from a user point of view).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Separate memory contexts for relcache and catcache

2024-10-31 Thread Melih Mutlu

Hi Jeff,

Jeff Davis , 30 Eki 2024 Çar, 01:00 tarihinde şunu yazdı:

> On Wed, 2024-04-03 at 16:12 +0300, Melih Mutlu wrote:
> > Rebased. PSA.
>
> Thank you. I missed your patch and came up with a similar patch over
> here:
>
>
> https://www.postgresql.org/message-id/flat/78599c442380ddb5990117e281a4fa65a74231af.ca...@j-davis.com
>
> I closed my thread and we can continue this one.
>

Thanks for being interested in this patch. I simply merged your patch and
mine, which was pretty easy as your patch is quite similar to mine but
covers much more caches.

One difference is that I tried to capture almost all uses of
> CacheMemoryContext so that it would become just a parent context
> without many allocations of its own.
>
> The plan cache and SPI caches can be important, too. Or, one of the
> other caches that we expect to be small might grow in some edge cases
> (or due to a bug), and it would be good to be able to see that.
>

My only concern would be allocating too much memory for each cache
type unnecessarily. Especially the ones that are expected to be small most
of the time, would we see cases where we waste too much memory when the
number of backends increases? Or maybe having separate contexts for each
cache wouldn't hurt at all. I don't really have enough knowledge about each
cache type.

I only quickly checked the memory usage right after starting a backend to
ensure that the patch does not slow down backend starts. We now have an
additional TypCacheContext, which was previously using CacheMemoryContext,
with 8KB init size. I'm not sure how much we should be worried about this
additional 8KB. I think we can reduce the initial size of
CacheMemoryContext instead,
assuming that CacheMemoryContext wouldn't have many allocations of its own
anymore.

I agree with others that we should look at changing the initial size or
> type of the contexts, but that should be a separate commit.
>

Attached a separate patch to change initial sizes for relcache and catcache
contexts as they grow large from the start. This was suggested in the
thread previously [1].
Also changed CacheMemoryContext to use ALLOCSET_START_SMALL_SIZES, so it
starts from 1KB.

Regards,
-- 
Melih Mutlu
Microsoft

v4-0002-Adjusting-cache-memory-context-sizes.patch
Description: Binary data

v4-0001-Separate-memory-contexts-for-caches.patch
Description: Binary data

Re: Inval reliability, especially for inplace updates

2024-10-31 Thread Alexander Lakhin


Hello Noah,

31.10.2024 04:39, Noah Misch wrote:

I had pushed this during the indicated week, before your mail.  Reverting it
is an option.  Let's see if more opinions arrive.


I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:
for ((j=1; j<=100; j++)); do
echo "iteration $j"

cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration 
$j"; break; }
SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 
50) g
\gexec

SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g
\gexec
EOF
done

with
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1

in parallel using separate servers (the full script is attached), like:
parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

I can catch the following:
...
3   hanged on iteration 51
...
19  hanged on iteration 64
...
39  hanged on iteration 99

And after the script run, I see the server processes hanging:
law  1081433   1  0 16:22 ?    00:00:00 
.../usr/local/pgsql/bin/postgres
law  1081452 1081433  0 16:22 ?    00:00:00 postgres: checkpointer
law  1081453 1081433  0 16:22 ?    00:00:00 postgres: background writer
law  1081460 1081433  0 16:22 ?    00:00:00 postgres: walwriter
law  1081462 1081433  0 16:22 ?    00:00:00 postgres: autovacuum 
launcher
law  1081464 1081433  0 16:22 ?    00:00:00 postgres: logical 
replication launcher
law  1143065 1081433  0 16:32 ?    00:00:00 postgres: postgres postgres 
[local] CREATE TABLE
law  1143263 1081433  0 16:32 ?    00:00:00 postgres: autovacuum worker 
postgres
law  1143320 1081433  0 16:32 ?    00:00:00 postgres: autovacuum worker 
postgres
law  1143403 1081433  0 16:32 ?    00:00:00 postgres: autovacuum worker


Attaching to process 1143065
...
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=, cancel=true, abstime=0x0, op=265, expected=0, 
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=, abstime=0x0, clockid=0, expected=0, 
futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, 
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=) at ./nptl/futex-internal.c:139
#3  0x7feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at 
./nptl/sem_waitcommon.c:111

#4  0x7feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, 
abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6  0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8  0x561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, 
slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146

...
(the full backtrace is attached)

All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5  0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6  0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I
haven't found out the exact combination needed yet.

Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.

Initially I saw this on a slowed down VM, but with the attached patch
applied I could reproduce it on my workstation too.

Best regards,
Alexander

reproi.sh
Description: application/shellscript
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2168259247..730ef9b5a2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -138,6 +138,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+uint64 r = 0;
+for (uint64 i = 0; i < 10; i++) r += (r ^ i);
+Assert(r != 0);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=, cancel=true, 
abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at 
./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=, 
abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at 
./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x7fed9a8171b

Re: Consider the number of columns in the sort cost model

2024-10-31 Thread Alena Rybakina


On 29.10.2024 05:47, Andrei Lepikhov wrote:

On 10/28/24 16:48, Alena Rybakina wrote:

On 23.10.2024 04:39, Andrei Lepikhov wrote:

On 15/10/2024 12:15, David Rowley wrote:
And the last patch is a demo of how I'm going to use the previous 
three patches and add one more strategy to improve the order of 
columns in the GROUP-BY clause.


To be honest, I didn’t find information about this in the code, but 
did I understand everything correctly?

Yes

Thanks for the confirmation.


2. I noticed that statistics of distinct values are calculated 
several times. Maybe I miss something, but this can be avoided if 
these statistics can be saved after calculation. For example, I saw 
that it is possible that you calculate the same statistic information 
for the same equivalence members in cost_incremental_sort and 
identify_sort_ecmember. Is it possible to store information in a 
structure and use it later?
Hmm, I don't see multiple calculations. em_distinct has made 
specifically for this sake. Can you provide specific case and code lines?
When you said that, I saw it. You are right, checking for 0 is what 
stops us from useless recalculation. I noticed it right away.


3. I think you should initialize the variable ndist in your patch. I 
faced the compile complaining during your code compilation.


costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized 
[-Werror=maybe-uninitialized]

  6694 | em->em_ndistinct = ndist;
   | ~^
costsize.c:6680:33: note: ‘ndist’ was declared here
  6680 | double  ndist;
   | ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [: costsize.o] Error 1
I think you can just update your compiler. But I added the ndist 
initialisation to make more compilers happy :).

Thank you :)



+        Assert (node != NULL);
+
      examine_variable(root, node, 0, &vardata);
      if (!HeapTupleIsValid(vardata.statsTuple))
          continue;
I don't think so. At least until you provide the case when the 
get_sortgroupclause_expr function returns NULL.
That's more, remember - the patch 0004 here is just to show the 
perspective and still under construction.
Anyway, thanks, I found out that the patch set doesn't apply correctly 
because of 828e94c. So, see the new version in the attachment.


Yes, you are right, if this case is possible, then it is connected with 
an error in the formation of the target list.



I played around with the examples a bit and couldn't figure out 
something. When I added the same values to different columns - firstly 
in a, later in b, the order of the columns for sort operation doesn't 
change. Isn't this a mistake?


create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);

create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);

insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;

explain analyze select a.x1, s.x2, s.y2 from a left join (select 
distinct * from b) s on a.x1=s.x2;

 QUERY PLAN

 Hash Right Join  (cost=44.99..48.15 rows=5 width=12) (actual 
time=0.225..0.250 rows=8 loops=1)

   Hash Cond: (b.x2 = a.x1)
   ->  HashAggregate  (cost=43.90..46.16 rows=226 width=8) (actual 
time=0.117..0.123 rows=3 loops=1)

 Group Key: b.x2, b.y2
 Batches: 1  Memory Usage: 40kB
 ->  Seq Scan on b  (cost=0.00..32.60 rows=2260 width=8) 
(actual time=0.030..0.044 rows=48 loops=1)
   ->  Hash  (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074 
rows=4 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on a  (cost=0.00..1.04 rows=4 width=4) (actual 
time=0.047..0.051 rows=4 loops=1)

 Planning Time: 1.649 ms
 Execution Time: 0.485 ms
(11 rows)

delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;
explain analyze select a.x1, s.x2, s.y2 from a left join (select 
distinct * from b) s on a.x1=s.x2;

  QUERY PLAN
---
 Hash Left Join  (cost=1.79..2.86 rows=4 width=12) (actual 
time=0.083..0.090 rows=4 loops=1)

   Hash Cond: (a.x1 = b.x2)
   ->  Seq Scan on a  (cost=0.00..1.04 rows=4 width=4) (actual 
time=0.010..0.0

Re: Having problems generating a code coverage report

Hi,

> On Wed, Oct 30, 2024 at 05:26:49PM -0400, Peter Geoghegan wrote:
> > I use Debian unstable for most of my day to day work. Apparently
> > Debian unstable has exactly the same version of lcov as Ubuntu 24.04.
> >
> > I've also been unable to generate coverage reports for some time (at
> > least on Debian, with LCOV version 2.0-1).
>
> So do I, for both the debian SID and the lcov parts.

I downgraded to lcov 1.16 [1] and it helped. This is merely a
workaround of course, not a long-time solution.

[1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16

-- 
Best regards,
Aleksander Alekseev

Re: protocol-level wait-for-LSN

2024-10-31 Thread Jelte Fennema-Nio

On Thu, 31 Oct 2024 at 13:59, Jesper Pedersen
 wrote:
> And, it will be confusing to clean-room implementations: When you have
> this startup parameter then you get these message formats, when you have
> this startup parameter then you get these message formats -- and what
> about combinations ? Like Tatsuo-san stated up-thread.

I really don't understand why you think that's so difficult. To be
clear, no client is forced to implement any of these protocol options.
And all of these protocol options would be documented in the official
protocol docs. For instance the ReadyForQuery docs on the "Message
Formats" page in the docs could easily be made to look like the
following, which imho would be very clear to any implementer of the
protocol about ordering of these fields:

ReadyForQuery (B)
Byte1('Z')

Identifies the message type. ReadyForQuery is sent whenever the
backend is ready for a new query cycle.

Int32

Length of message contents in bytes, including self.

Int64: Only present if protocol option wait_for_lsn is set to 1 by the client

The LSN at time of commit

Int64: Only present if protocol option query_time is set to 1 by the client

Time it took to run the query in microseconds

Byte1

Current backend transaction status indicator. Possible values are 'I'
if idle (not in a transaction block); 'T' if in a transaction block;
or 'E' if in a failed transaction block (queries will be rejected
until block is ended).

> You are also assuming that all PostgreSQL protocol implementations uses
> the Length (Int32) field very strict - so when one developer adds the
> startup parameter, but doesn't change the underlying implementation
> everything will break.

Yes... But that seems equivalent to saying: If a developer of a
Postgres client advertises that they support protocol v4, but don't
actually implement it, then everything will break.

i.e. it's the job of the client author to not send protocol options
that it doesn't know anything about. Just like it's the job of the
client author not to request versions that it does not know anything
about.

> The protocol format must be 100% clear and well-documented in all cases.

Agreed. See above.

> If this door is open then it has to very clear how multiple startup
> parameters are handled at the protocol level, and that is a much bigger
> fish because what happens if extensions add startup parameters as well.

Postgres extensions **cannot** add such startup parameters. Heikki
already mentioned that the naming was confusing in the docs. At this
point in time we're only discussing protocol changes that are coming
from Postgres core (which is already a contentious enough topic).

Re: UUID v7

> On 1 Nov 2024, at 03:00, Masahiko Sawada  wrote:
> 
> Therefore, if the
> system clock moves backward due to NTP, we cannot guarantee
> monotonicity and sortability. Is that right?

Not exactly. Monotonicity is ensured for a given backend. We make sure that 
timestamp is advanced at least for ~250ns forward on each UUID generation. 60 
bits of time are unique and ascending for a given backend.

Thanks!

Best regards, Andrey Borodin.

Re: Pgoutput not capturing the generated columns

2024-10-31 Thread Peter Smith

On Thu, Oct 31, 2024 at 3:16 AM vignesh C  wrote:

> Thanks for committing this patch, here is a rebased version of the
> remaining patches.
>

Hi Vignesh.

Here are my review comments for the docs patch v1-0002.

==
Commit message

1.
This patch updates docs to describe the new feature allowing
replication of generated
columns. This includes addition of a new section "Generated Column
Replication" to the
"Logical Replication" documentation chapter.

~

That first sentence was correct previously when this patch contained
*all* the gencols documentation, but now some of the feature docs are
already handled by previous patches, so the first sentence can be
removed.

Now patch 0002 is only for adding the new chapter, plus the references to it.

~

/This includes addition of a new section/This patch adds a new section/

==
doc/src/sgml/protocol.sgml

2.

-  Next, one of the following submessages appears for each column
(except generated columns):
+  Next, one of the following submessages appears for each column:

AFAIK this simply cancels out a change from the v1-0001 patch which
IMO should have not been there in the first place. Please refer to my
v1-0001 review for the same.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: define pg_structiszero(addr, s, r)

2024-10-31 Thread David Rowley

On Thu, 31 Oct 2024 at 19:17, Michael Paquier  wrote:
> Seems kind of OK seen from here.  I am wondering if others more
> comments about the name of the macro, its location, the fact that we
> still have pagebytes in bufpage.c, etc.

This change looks incorrect:

@@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber
blkno, int flags)
}

/* Check all-zeroes case */
-   all_zeroes = true;
pagebytes = (size_t *) page;
-   for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
-   {
-   if (pagebytes[i] != 0)
-   {
-   all_zeroes = false;
-   break;
-   }
-   }

-   if (all_zeroes)
+   if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t
return true;

I think this should be passing BLCKSZ rather than (BLCKSZ /
sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
zero rather than the entire page.

However, I've also performance concerns as if I look at the assembly
of PageIsVerifiedExtended, I see the zero checking is now done 1 byte
at a time:

(gcc 11.4)

leaq 1024(%rbx), %rdx <-- 1KB bug
.p2align 4,,10
.p2align 3
.L60:
cmpb $0, (%rax) <-- check 1 byte is zero.
jne .L59
addq $1, %rax <-- increment by 1 byte.
cmpq %rax, %rdx <-- check if we've done 1024 bytes yet.
jne .L60

Whereas previously it was doing:

movq %rbx, %rax
leaq 8192(%rbx), %rdx <-- 8KB
jmp .L60
.p2align 4,,10
.p2align 3
.L90:
addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t))
cmpq %rax, %rdx
je .L61
.L60:
cmpq $0, (%rax)  <-- checks an entire 8 bytes are zero.

I didn't test how performance-critical this is, but the header comment
for this function does use the words "cheaply detect".

 * This is called when a page has just been read in from disk.  The idea is
 * to cheaply detect trashed pages before we go nuts following bogus line
 * pointers, testing invalid transaction identifiers, etc.

so it seems a bit dangerous to switch this loop to byte-at-a-time
rather than doing 8 bytes at once without testing the performance
isn't affected.

David

Re: Showing applied extended statistics in explain Part 2

2024-10-31 Thread Tatsuro Yamada

Hi All,

I apologize for not being able to continue development due to various
circumstances.
The attached file is the rebased patch.
I will now catch up on the discussion and try to revise the patch.

Regards,
Tatsuro Yamada


On Fri, Jul 19, 2024 at 7:17 PM  wrote:

> > On 7/18/24 12:37, masahiro.ik...@nttdata.com wrote:
> > >> Let me share my opinion on those questions ...
> > > ...>
> > >> For ndistinct, I think we don't show this because it doesn't go
> > >> through clauselist_selectivity, which is the only thing I modified in
> the PoC.
> > >> But I guess we might improve estimate_num_groups() to track the stats
> > >> in a similar way, I guess.
> > >
> > > Thanks. IIUC, the reason is that it doesn't go through
> > > statext_clauselist_selectivity() because the number of clauses is one
> though it goes
> > through clauselist_selectivity().
> > >
> >
> > Ah, I see I misunderstood the original report. The query used was
> >
> >   EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
> > WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
> >
> > And it has nothing to do with the number of clauses being one neither.
> >
> > The problem is this estimate is handled by examine_variable() matching
> the expression
> > to the "expression" stats, and injecting it into the variable, so that
> the
> > clauselist_selectivity() sees these stats.
> >
> > This would happen even if you build just expression statistics on each
> of the
> > date_trunc() calls, and then tried a query with two clauses:
> >
> >   CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3;
> >   CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3;
> >
> >   EXPLAIN SELECT * FROM t3
> > WHERE date_trunc('month', a) = '2020-01-01'::timestamp
> >   AND date_trunc('day', 'a') = '2020-01-01'::timestamp;
> >
> > Not sure how to handle this - we could remember when explain_variable()
> injects
> > statistics like this, I guess. But do we know that each call to
> > examine_variable() is for estimation? And do we know for which clause?
>
> I see. The issue is related to extended statistics for single expression.
> As a
> first step, it's ok for me that we don't support it.
>
> The below is just an idea to know clauses...
> Although I'm missing something, can callers of examine_variable()
> for estimation to rebuild the clauses from partial information of "OpExpr"?
>
> Only clause_selectivity_ext() knows the information of actual full clauses.
> But we don't need full information. It's enough to know the information
> to show "OpExpr" for EXPLAIN.
>
> get_oper_expr() deparse "OpExpr" using only the operator oid and arguments
> in get_oper_expr().
>
> If so, the caller to estimate, for example eqsel_internal(),
> scalarineqsel_wrapper()
> and so on, seems to be able to know the "OpExpr" information, which are
> operator
> oid and arguments, and used extended statistics easily to show for EXPLAIN.
>
> # Memo: the call path of the estimation function
>  caller to estimate selectivity
> (eqsel_internal()/scalargtjoinsel_wrappter()/...)
>   -> get_restriction_variable()/get_join_valiables()
>-> examine_variable()
>
>
> > >>> ERROR:  unrecognized node type: 268
> > >
> > > Regarding the above error, do "applied_stats" need have the list of
> "StatisticExtInfo"
> > > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN
> output in the
> > current patch?
> > > change_to_applied_stats_has_list_of_oids.diff is the change I assumed.
> > > Do you have any plan to show extra information for example "kind" of
> > "StatisticExtInfo"?
> > >
> > > The above is just one idea came up with while I read the following
> > > comments of header of pathnodes.h, and to support copy
> "StatisticExtInfo" will leads
> > many other nodes to support copy.
> > >   * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes.
> > >   * There are some subsidiary structs that are useful to copy, though.
> > >
> >
> > I do think tracking just the OID would work, because we already know how
> to copy List
> > objects. But if we want to also track the clauses, we'd have to keep
> multiple lists, right?
> > That seems a bit inconvenient.
>
> Understood. In future, we might show not only the applied_clauses but also
> the clauses of
> its extended statistics (StatisticExtInfo->exprs).
>
>
> > > By the way, I found curios result while I tested with the above patch.
> It shows same
> > "Ext Stats" twice.
> > > I think it's expected behavior because the stat is used when estimate
> the cost of
> > "Partial HashAggregate" and "Group".
> > > I've shared the result because I could not understand soon when I saw
> > > it first time. I think it's better to let users understand when the
> stats are used, but I
> > don't have any idea now.
> > >
> > > -- I tested with the example of CREATE STATISTICS documentation.
> > > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a),
> date_trunc('day',
> > a) FROM t3 GROUP BY 1, 2;
> > >

Re: Eager aggregation, take 3

2024-10-31 Thread Richard Guo

On Thu, Oct 31, 2024 at 12:25 AM Robert Haas  wrote:
> Well, the key thing here is the reasoning, which you don't really
> spell out either here or in the patch. The patch's justification for
> the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image
> operators, "we may lose some information that could be needed to
> evaluate upper qual clauses." But there's no explanation of why that
> would happen, and I think that makes it difficult to have a good
> discussion.
>
> Here's an example from the optimizer README:
>
> # 3.  (A leftjoin B on (Pab)) leftjoin C on (Pbc)
> # = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
> #
> # Identity 3 only holds if predicate Pbc must fail for all-null B rows
> # (that is, Pbc is strict for at least one column of B).  If Pbc is not
> # strict, the first form might produce some rows with nonnull C columns
> # where the second form would make those entries null.
>
> This isn't the clearest justification for a rule that I've ever read,
> but it's something. Someone reading this can think about whether the
> two sentences of justification are a correct argument that Pbc must be
> strict in order for the identity to hold.
>
> I think you should be trying to offer similar (or better)
> justifications, and perhaps similar identities as well. It's a little
> bit hard to think about how to write the identities for this patch,
> although you could start with aggregate(X) =
> finalizeagg(partialagg(X)) and then explain how the partialagg can be
> commuted with certain joins and what is required for it to do so. The
> argument needs to be detailed enough that we can evaluate whether it's
> correct or not.
>
> Personally, I'm still stuck on the point I wrote about yesterday. When
> you partially aggregate a set of rows, the resulting row serves as a
> proxy for the original set of rows. So, all of those rows must have
> the same destiny. As I mentioned then, if you ever partially aggregate
> a row that should ultimately be filtered out with one that shouldn't,
> that breaks everything. But also, I need all the rows that feed into
> each partial group to be part of the same set of output rows. For
> instance, suppose I run a report like this:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id GROUP BY 1;
>
> If I'm thinking of executing this as FinalizeAgg(order JOIN
> PartialAgg(order_lines)), what must be true in order for that to be a
> valid execution plan? I certainly can't aggregate on some unrelated
> column, say part_number. If I do, then first of all I don't know how
> to get an order_id column out so that I can still do the join. But
> even if that were no issue, you might have two rows with different
> values of the order_id column and the same value for part_number, and
> that the partial groups that you've created on the order_lines table
> don't line up properly with the groups that you need to create on the
> orders table. Some particular order_id might need some of the rows
> that went into a certain part_number group and not others; that's no
> good.
>
> After some thought, I think the process of deduction goes something
> like this. We ultimately need to aggregate on a column in the orders
> table, but we want to partially aggregate the order_lines table. So we
> need a set of columns in the order_lines table that can act as a proxy
> for o.order_number. Our proxy should be all the columns that appear in
> the join clauses between orders and order_lines. That way, all the
> rows in a single partially aggregate row will have the "same" order_id
> according to the operator class implementing the = operator, so for a
> given row in the "orders" table, either every row in the group has
> o.order_id = l.order_id or none of them do; hence they're all part of
> the same set of output rows.
>
> It doesn't matter, for example, whether o.order_number is unique. If
> it isn't, then we'll flatten together some different rows from the
> orders table -- but each of those rows will match all the rows in
> partial groupings where o.order_id = partially_grouped_l.order_id and
> none of the rows where that isn't the case, so the optimization is
> still valid. Likewise it doesn't matter whether o.order_id is unique:
> if order_id = 1 occurs twice, then both of the rows will match the
> partially aggregated group with order_id = 1, but that's fine. The
> only thing that's a problem is if the same row from orders was going
> to match some but not all of the partially aggregate rows from some
> order_lines group, and that won't happen here.
>
> Now consider this example:
>
> SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l
> WHERE o.order_id = l.order_id  AND o.something < l.something GROUP BY
> 1;
>
> Here, we cannot partially group order_lines on just l.order_id,
> because we might have a row in the orders table with order_id = 1 and
> something = 1 and two rows in the order_lines table that both have

Re: define pg_structiszero(addr, s, r)

On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote:
> I think this should be passing BLCKSZ rather than (BLCKSZ /
> sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
> zero rather than the entire page.

Ugh, Friday brain fart.  The attached should be able to fix that, this
brings back the movl to its correct path:
-   movl$1024, %esi
+   movl$8192, %esi

Does that look fine to you?

> I didn't test how performance-critical this is, but the header comment
> for this function does use the words "cheaply detect".

Under gcc -O2 or -O3, the single-byte check or the 8-byte check don't
make a difference.  Please see the attached (allzeros.txt) for a quick
check if you want to check by yourself.  With 1M iterations, both
average around 3ms for 1M iterations on my laptop (not the fastest
thing around).

Under -O0, though, the difference is noticeable:
- 1-byte check: 3.52s for 1M iterations, averaging one check at
3.52ns.
- 8-byte check: 0.46s for 1M iterations, averaging one check at
0.46ns.

Even for that, I doubt that this is going to be noticeable in
practice, still the difference exists.
--
Michael
#include 
#include 
#include 

#define BLCKSZ 8192
#define LOOPS 100	/* 1M */

static inline bool
allzeros_char(const void *ptr, size_t len)
{
	const char *p = (const char *) ptr;

	for (size_t i = 0; i < len; i++)
	{
		if (p[i] != 0)
			return false;
	}
	return true;
}

static inline bool
allzeros_size_t(const void *ptr, size_t len)
{
	const size_t *p = (const size_t *) ptr;

	for (size_t i = 0; i < len / sizeof(size_t); i++)
	{
		if (p[i] != 0)
			return false;
	}
	return true;
}

int main()
{
	size_t pagebytes[BLCKSZ] = {0};

	for (int i = 0; i < LOOPS; i++)
	{
		allzeros_char(pagebytes, BLCKSZ);
		//allzeros_size_t(pagebytes, BLCKSZ);
	}

	return 0;
}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 5ee1e58cd4..db5954def2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -88,7 +88,6 @@ bool
 PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 {
 	PageHeader	p = (PageHeader) page;
-	size_t	   *pagebytes;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
 	uint16		checksum = 0;
@@ -124,9 +123,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags)
 	}
 
 	/* Check all-zeroes case */
-	pagebytes = (size_t *) page;
-
-	if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t
+	if (pg_memory_is_all_zeros(page, BLCKSZ))
 		return true;
 
 	/*


signature.asc
Description: PGP signature

Re: Eager aggregation, take 3

2024-10-31 Thread Richard Guo

On Thu, Oct 31, 2024 at 9:16 PM jian he  wrote:
>
> hi.
> still trying to understand v13. found a bug.
>
> minimum test :
> drop table if exists t1, t2;
> CREATE TABLE t1 (a int, b int, c int);
> CREATE TABLE t2 (a int, b int, c int);
> SET enable_eager_aggregate TO on;
> explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON
> t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0;
>
> havingQual can have GroupingFunc.
> if that happens, then segmentation fault.

Nice catch.  Thanks.

create_agg_clause_infos does check for GROUPING() expressions, but
not in the right place.  Will fix it in the next version.

Thanks
Richard

Re: UUID v7

On Thu, Oct 31, 2024 at 9:53 PM Andrey M. Borodin  wrote:
>
>
>
> > On 1 Nov 2024, at 03:00, Masahiko Sawada  wrote:
> >
> > Therefore, if the
> > system clock moves backward due to NTP, we cannot guarantee
> > monotonicity and sortability. Is that right?
>
> Not exactly. Monotonicity is ensured for a given backend. We make sure that 
> timestamp is advanced at least for ~250ns forward on each UUID generation. 60 
> bits of time are unique and ascending for a given backend.
>

Thank you for your explanation. I now understand this code guarantees
the monotonicity:

+/* minimum amount of ns that guarantees step of increased_clock_precision */
+#define SUB_MILLISECOND_STEP (100/4096 + 1)
+   ns = get_real_time_ns();
+   if (previous_ns + SUB_MILLISECOND_STEP >= ns)
+   ns = previous_ns + SUB_MILLISECOND_STEP;
+   previous_ns = ns;

I think that one of the most important parts in UUIDv7 implementation
is which method (1, 2, or 3 described in RFC 9562) we use to guarantee
the monotonicity. The current patch employs method 3 with the
assumption that 12 bits of sub-millisecond information is available on
most of the systems we support. However, as far as I tested, on MacOS,
values returned by  clock_gettime(CLOCK_REALTIME) are only microsecond
precision, meaning that we could waste some randomness. Has this point
been considered?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: define pg_structiszero(addr, s, r)

2024-10-31 Thread Bertrand Drouvot

Hi,

On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote:
> On Thu, 31 Oct 2024 at 19:17, Michael Paquier  wrote:
> > Seems kind of OK seen from here.  I am wondering if others more
> > comments about the name of the macro, its location, the fact that we
> > still have pagebytes in bufpage.c, etc.
> 
> This change looks incorrect:
> 
> @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber
> blkno, int flags)
> }
> 
> /* Check all-zeroes case */
> -   all_zeroes = true;
> pagebytes = (size_t *) page;
> -   for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
> -   {
> -   if (pagebytes[i] != 0)
> -   {
> -   all_zeroes = false;
> -   break;
> -   }
> -   }
> 
> -   if (all_zeroes)
> +   if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t
> return true;
> 
> I think this should be passing BLCKSZ rather than (BLCKSZ /
> sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is
> zero rather than the entire page.

Thanks for looking at it!

Nice catch, indeed by using the new function we are changing the pointer 
arithmetic
here and then we should pass BLCKSZ instead.

> However, I've also performance concerns as if I look at the assembly
> of PageIsVerifiedExtended, I see the zero checking is now done 1 byte
> at a time:
> 
> (gcc 11.4)
> 
> leaq 1024(%rbx), %rdx <-- 1KB bug
> .p2align 4,,10
> .p2align 3
> .L60:
> cmpb $0, (%rax) <-- check 1 byte is zero.
> jne .L59
> addq $1, %rax <-- increment by 1 byte.
> cmpq %rax, %rdx <-- check if we've done 1024 bytes yet.
> jne .L60
> 
> Whereas previously it was doing:
> 
> movq %rbx, %rax
> leaq 8192(%rbx), %rdx <-- 8KB
> jmp .L60
> .p2align 4,,10
> .p2align 3
> .L90:
> addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t))
> cmpq %rax, %rdx
> je .L61
> .L60:
> cmpq $0, (%rax)  <-- checks an entire 8 bytes are zero.
> 
> I didn't test how performance-critical this is, but the header comment
> for this function does use the words "cheaply detect".
> 
>  * This is called when a page has just been read in from disk.  The idea is
>  * to cheaply detect trashed pages before we go nuts following bogus line
>  * pointers, testing invalid transaction identifiers, etc.
> 
> so it seems a bit dangerous to switch this loop to byte-at-a-time
> rather than doing 8 bytes at once without testing the performance
> isn't affected.

Agree, I did a quick test (see [0]) and it looks like it's significantly slower
using the new inline function.

We could try to write a more elaborate version of pg_memory_is_all_zeros(), but
as it looks like there is only one use case, then it's probably better to not
implement (revert) this change here and "just" add a comment as to why 
pg_memory_is_all_zeros()
should not be used here, thoughts?

[0]: https://godbolt.org/z/xqnW4MPY5

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-10-31 Thread Devulapalli, Raghuveer

> Here is an updated patch with this change.

LGTM. 

Raghuveer

Re: MultiXact\SLRU buffers configuration

2024-10-31 Thread Thom Brown

On Thu, 31 Oct 2024, 17:33 Andrey M. Borodin,  wrote:

>
>
> > On 31 Oct 2024, at 17:29, Thom Brown  wrote:
> >
> > On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin 
> wrote:
> >>
> >>
> >>
> >>> On 29 Oct 2024, at 21:45, Thom Brown  wrote:
> >>>
> >>> It clearly checks for interrupts, but when I saw this issue happen, it
> >>> wasn't interruptible.
> >>
> >> Perhaps, that was different multixacts?
> >> When I was observing this pathology on Standby, it was a stream of
> different reads encountering different multis.
> >>
> >> Either way startup can cancel locking process on it's own. Or is it the
> case that cancel was not enough, did you actually need termination, not
> cancel?
> >
> > Termination didn't work on either of the processes.
>
> How did you force the process to actually terminate?
> Did you observe repeated read of the same multixact?
> Was offending process holding any locks while waiting?
>

Unfortunately I didn't gather much information when it was occuring, and
prioritised getting rid of the process blocking replay. I just attached gdb
to it, got a backtrace and then "print ProcessInterrupts()".

Thom

>

Re: Popcount optimization using AVX512

2024-10-31 Thread Nathan Bossart

On Wed, Oct 30, 2024 at 04:10:10PM -0500, Nathan Bossart wrote:
> On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote:
>> BTW, I just realized function attributes for xsave and avx512 don't work
>> on MSVC (see
>> https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630).
>> Not sure if you care about it. Its an easy fix (see
>> https://gcc.godbolt.org/z/Pebdj3vMx).
> 
> Oh, good catch.  IIUC we only need to check for #ifndef _MSC_VER in the
> configure programs for meson.  pg_attribute_target will be empty on MSVC,
> and I believe we only support meson builds there.

Here is an updated patch with this change.

-- 
nathan
>From 8cf7c08220a9c0a1dec809794af2dfb719981923 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 16 Oct 2024 15:57:55 -0500
Subject: [PATCH v2 1/1] use __attribute__((target(...))) for AVX-512 stuff

---
 config/c-compiler.m4 |  60 +-
 configure| 163 ++-
 configure.ac |  17 +--
 meson.build  |  21 ++--
 src/Makefile.global.in   |   5 -
 src/include/c.h  |  10 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  12 +-
 src/port/meson.build |   7 +-
 src/port/pg_popcount_avx512.c|  86 +-
 src/port/pg_popcount_avx512_choose.c | 102 -
 11 files changed, 175 insertions(+), 312 deletions(-)
 delete mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a..aa90f8ef33 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl
 # Check if the compiler supports the XSAVE instructions using the _xgetbv
 # intrinsic function.
 #
-# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
-# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+# If the intrinsics are supported, sets pgac_xsave_intrinsics.
 AC_DEFUN([PGAC_XSAVE_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [return _xgetbv(0) & 0xe0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl
+AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+__attribute__((target("xsave")))
+static int xsave_test(void)
+{
+  return _xgetbv(0) & 0xe0;
+}],
+  [return xsave_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cachevar" = x"yes"; then
-  CFLAGS_XSAVE="$1"
   pgac_xsave_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
@@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl
 # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64,
 # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions.
 #
-# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq
-# -mavx512bw).  If the intrinsics are supported, sets
-# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT.
+# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics.
 AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar],
-[pgac_save_CFLAGS=$CFLAGS
-CFLAGS="$pgac_save_CFLAGS $1"
-AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
-  [const char buf@<:@sizeof(__m512i)@:>@;
-   PG_INT64_TYPE popcnt = 0;
-   __m512i accum = _mm512_setzero_si512();
-   const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, 
(const __m512i *) buf);
-   const __m512i cnt = _mm512_popcnt_epi64(val);
-   accum = _mm512_add_epi64(accum, cnt);
-   popcnt = _mm512_reduce_add_epi64(accum);
-   /* return computed value, to prevent the above being optimized away */
-   return popcnt == 0;])],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include 
+__attribute__((target("avx512vpopcntdq","avx512bw")))
+static int popcount_test(void)
+{
+  const char buf@<:@sizeof(__m512i)@:>@;
+  PG_INT64_TYPE popcnt = 0;
+  __m512i accum = _mm512_setzero_si512();
+  const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 
0xf0f0f0f0f0f0f0f0, (const __m512i *) buf);
+  const __m512i cnt = _mm512_popcnt_epi64(val);
+  accum = _mm512_add_epi64(accum, cnt);
+  popcnt = _mm512_reduce_add_epi64(accum);
+  return (int) popcnt;
+}],
+  [return popcount_test();])],
   [Ac_cachevar=yes],
-  [Ac_cachevar=no])
-CFLAGS="$pgac_save_CFLAGS"])
+  [Ac_cachevar=no])])
 if test x"$Ac_cach

Re: pg_parse_json() should not leak token copies on failure

2024-10-31 Thread Jacob Champion

On Mon, Oct 7, 2024 at 3:45 PM Jacob Champion
 wrote:
> I've added a 0002 as well.

0002 has since been applied [1] so I'm reattaching v4-0001 to get the
cfbot happy again.

--Jacob

[1] https://git.postgresql.org/cgit/postgresql.git/commit/?id=41b023946d


v4-0001-jsonapi-add-lexer-option-to-keep-token-ownership.patch
Description: Binary data

Re: pg_upgrade check for invalid databases

On Fri, Oct 25, 2024 at 01:55:57PM +0200, Daniel Gustafsson wrote:
> > On 14 Oct 2024, at 18:57, Bruce Momjian  wrote:
> 
> > What might be acceptable would be to add an option that would make
> > pg_upgrade more tolerant of problems in many areas, that is a lot more
> > research and discussion.
> 
> I agree that the concept of having pg_upgrade perform (opt-in) skipping and/or
> repairs of the old cluster warrants a larger discussion in its own thread.
> There has been significant amount of energy spent recently to add structure to
> the checks, any new feature should be properly designed for the get-go.
> 
> In the meantime, the OP has a good point that it's a tad silly that pg_upgrade
> fails hard on invalid databases instead of detecting and reporting like how
> other errors are handled.  The attached adds this check and expands the report
> wording to cover it.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Time to add a Git .mailmap?

On Thu, Oct 31, 2024 at 02:45:02PM -0500, Nathan Bossart wrote:
> On Thu, Oct 31, 2024 at 11:37:13AM +0100, Daniel Gustafsson wrote:
>> When looking at our Git tree for a recent conference presentation I happened 
>> to
>> notice that we have recently gained duplicate names in the shortlog.  Not 
>> sure
>> if we care enough to fix that with a .mailmap, but if we do the attached diff
>> makes sure that all commits are accounted for a single committer entry.
> 
> Seems reasonable to me.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Making error message more user-friendly with spaces in a URI

On Thu, Oct 31, 2024 at 10:35:41PM +0700, Stepan Neretin wrote:
> Hi, Yushi! I could not find this line in the master branch and I couldn't
> apply this patch. I only see this error in the test (I think the test
> should also be changed), but the idea of fixing the error looks good to me.

This line exists on HEAD and the patch applies cleanly today as of
2d8bff603c9e.  So perhaps just make sure that you are up to date?
--
Michael


signature.asc
Description: PGP signature

Re: libedit history seems to be misbehaving / broken

On Sun, Oct 27, 2024 at 09:07:44PM +0100, Tomas Vondra wrote:
> On 10/27/24 20:03, Tom Lane wrote:
> > FWIW, I don't observe any particular misbehavior with the very ancient
> > libedit that macOS ships.  On Fedora 39, I notice something related to
> > what you say: it seems like the "\q" ending a session gets repeated
> > into .psql_history by the next session.  I'm surprised that it's not
> > exactly the same as your results though, because it seems to be the
> > same source version:
> > 
> > $ rpm -q libedit
> > libedit-3.1-53.20240808cvs.fc39.x86_64
> > 
> 
> That's probably because I usually terminate psql by Ctrl-D, not by
> typing "\q". But yeah, if I use "\q" it gets added twice.
> 
> > Didn't try the too-many-lines behavior, but it looks like that
> > is implemented totally by history_truncate_file(), so if that's
> > busted it's surely their fault.
> > 
> 
> Sounds likely. What surprises me a bit, I haven't found any reports of
> particularly similar bugs ... I'd have expected other people to hit this
> too, but who knows.

I wonder if our previous libedit workarounds aren't needed anymore and
are causing the bugs.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Inval reliability, especially for inplace updates

2024-10-31 Thread Noah Misch

On Thu, Oct 31, 2024 at 05:00:02PM +0300, Alexander Lakhin wrote:
> I've accidentally discovered an incorrect behaviour caused by commit
> 4eac5a1fa. Running this script:

Thanks.  This looks important.

> parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

This didn't reproduce it for me, at -j20, -j40, or -j80.  I tested at commit
fb7e27a.  At what commit(s) does it reproduce for you?  At what commits, if
any, did your test not reproduce this?

> All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
> for the same buffer lock:
> #5  0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at 
> pg_sema.c:327
> #6  0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
> at lwlock.c:1318
> #7  0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Can you share the full backtrace for the autovacuum workers?

This looks like four backends all waiting for BUFFER_LOCK_SHARE on the same
pg_class page.  One backend is in CREATE TABLE, and three are in autovacuum.
There are no other apparent processes that would hold the
BUFFER_LOCK_EXCLUSIVE blocking these four processes.

> Also as a side note, these processes can't be terminated with SIGTERM, I
> have to kill them.

That suggests they're trying to acquire one LWLock while holding another.
I'll recreate your CREATE TABLE stack trace and study its conditions.  It's
not readily clear to me how that would end up holding relevant lwlocks.

Guessing how this happened did lead me to a bad decision in commit a07e03f,
but I expect fixing that bad decision won't fix the hang you captured.  That
commit made index_update_stats() needlessly call RelationGetNumberOfBlocks()
and visibilitymap_count() with a pg_class heap buffer lock held.  Both do I/O,
and the latter can exclusive-lock a visibility map buffer.  The attached patch
corrects that.  Since the hang you captured involved a pg_class heap buffer
lock, I don't think this patch will fix that hang.  The other inplace updaters
are free from similar badness.
Author: Noah Misch 
Commit: Noah Misch 

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 74d0f30..bef1af6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2806,6 +2806,9 @@ index_update_stats(Relation rel,
   bool hasindex,
   double reltuples)
 {
+   boolupdate_stats;
+   BlockNumber relpages;
+   BlockNumber relallvisible;
Oid relid = RelationGetRelid(rel);
Relationpg_class;
ScanKeyData key[1];
@@ -2815,6 +2818,42 @@ index_update_stats(Relation rel,
booldirty;

/*
+* As a special hack, if we are dealing with an empty table and the
+* existing reltuples is -1, we leave that alone.  This ensures that
+* creating an index as part of CREATE TABLE doesn't cause the table to
+* prematurely look like it's been vacuumed.  The final rd_rel may be
+* different from rel->rd_rel due to e.g. commit of concurrent GRANT, 
but
+* the commands that change reltuples take locks conflicting with ours.
+* (Even if a command changed reltuples under a weaker lock, this 
affects
+* only statistics for an empty table.)
+*/
+   if (reltuples == 0 && rel->rd_rel->reltuples < 0)
+   reltuples = -1;
+
+   /*
+* Don't update statistics during binary upgrade, because the indexes 
are
+* created before the data is moved into place.
+*/
+   update_stats = reltuples >= 0 && !IsBinaryUpgrade;
+
+   /*
+* Finish I/O and visibility map buffer locks before
+* systable_inplace_update_begin() locks the pg_class buffer.  The final
+* rd_rel may be different from rel->rd_rel due to e.g. commit of
+* concurrent GRANT, but no command changes a relkind from non-index to
+* index.  (Even if one did, relallvisible doesn't break functionality.)
+*/
+   if (update_stats)
+   {
+   relpages = RelationGetNumberOfBlocks(rel);
+
+   if (rel->rd_rel->relkind != RELKIND_INDEX)
+   visibilitymap_count(rel, &relallvisible, NULL);
+   else/* don't bother for 
indexes */
+   relallvisible = 0;
+   }
+
+   /*
 * We always update the pg_class row using a non-transactional,
 * overwrite-in-place update.  There are several reasons for this:
 *
@@ -2858,15 +2897,6 @@ index_update_stats(Relation rel,
/* Should this be a more comprehensive test? */
Assert(rd_rel->relkind != RELKIND_PARTITIONED_INDEX);

-   /*
-* As a special hack, if we are dealing with an empty table and the
-* existing reltuples is -1, we leave that alone.  This ensures that
-* creating an ind

Re: make all ereport in gram.y print out relative location

I wrote:
> I can take a look at this, since it's basically a followup
> to 774171c4f.

Pushed with some cleanup and further hacking.

regards, tom lane

Re: IPC::Run::time[r|out] vs our TAP tests


On 31/10/2024 14:27, Daniel Gustafsson wrote:

On 28 Oct 2024, at 11:56, Heikki Linnakangas  wrote:

On 09/04/2024 20:10, Daniel Gustafsson wrote:

=item $session->quit
Close the session and clean up resources. Each test run must be closed with
C.  Returns TRUE when the session was cleanly terminated, otherwise
FALSE.  A testfailure will be issued in case the session failed to finish.


What does "session failed to finish" mean? Does it mean the same as "not cleanly 
terminated", i.e. a test failure is issued whenever this returns FALSE?


It was very literally referring to the finish() method.  I've reworded the
comment to indicated that it throws a failure in case the process returns a
non-zero exit status to finish().


I see.


@@ -148,7 +148,9 @@ sub _wait_connect
 =item $session->quit
 
 Close the session and clean up resources. Each test run must be closed with

-C.
+C.  Returns TRUE when the session was cleanly terminated, otherwise
+FALSE.  A test failure will be issued in case the session exited with a non-
+zero exit status (the finish() method returns TRUE for 0 exit status).


I still find that confusing. What finish() method? Yes, there's a 
finish() method in IPC::Run, but that's BackgroundPsql's internal 
affair, not exposed to the callers in any other way. And why do I care 
what that finish() returns for 0 exit status? That's not visible to the 
quit method's caller.


Perhaps sommething like this:

"Close the psql session and clean up resources. Each psql session must 
be closed with C before the end of the test.
Returns TRUE if psql exited successfully (i.e. with zero exit code), 
otherwise returns FALSE and reports a test failure. "


Would that be accurate?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: vacuumdb --analyze-only (e.g., after a major upgrade) vs. partitioned tables: pg_statistic missing stats for the partitioned table itself

On Thu, Oct 24, 2024 at 02:48:42PM -0700, Nikolay Samokhvalov wrote:
> [ACg8ocIyQq]
> Nikolay Samokhvalov  2:47 PM (0 minutes ago)
>[cleardot]
> to pglsql-hackers  [cleardot]
> [cleardot]
> I just learned that vacuumdb --analyze-only doesn't update stats for the
> partitioned table itself, taking care only about individual partitions:

Yes, this is covered in the ANALYZE manual page:

https://www.postgresql.org/docs/current/sql-analyze.html

For partitioned tables, ANALYZE gathers statistics by sampling
rows from all partitions; in addition, it will recurse into
each partition and update its statistics. Each leaf partition
is analyzed only once, even with multi-level partitioning. No
statistics are collected for only the parent table (without data
from its partitions), because with partitioning it's guaranteed
to be empty.

It is discussed here:


https://www.postgresql.org/message-id/flat/CAB%2B%3D1TULKjDSBHxqSQVQstxcHshGzQUnHfp45GSESAu2qm0VKg%40mail.gmail.com#586bc5deef05c35ac16100dee99f6e9e

> I propose considering 3 fixes:
> 
> 1) vacuumdb --analyze / --analyze-only to update stats for the partitioned
> table, so people using pg_upgrade are not in trouble
> 2) present the ANALYZE metadata for partitioned tables in pg_stat_all_tables
> 3) for old versions, either backpatch with fix (1) OR just add to the docs 
> (and
> maybe to the final words pg_upgrade prints), suggesting something like this in
> addition to vacuumdb analyze-only:
> 
> -- psql snippet
> select format(
>   'analyze verbose %I.%I;',
>   relnamespace::oid::regnamespace,
>   oid::regclass
> ) as vacuum_command
> from pg_class
> where relkind = 'p' \gexec
> 
> Additionally, I do like the idea of ANALYZE ONLY from the -general discussion
> above (though, there might be confusion with logic of --analyze and
> --analyze-only in vacuumdb).
> 
> Does it make sense?

I certainly would like to see this improved.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

2024-10-31 Thread Tender Wang

Amit Langote  于2024年10月31日周四 21:09写道：

> On Wed, Oct 30, 2024 at 9:36 PM Junwang Zhao  wrote:
> > On Wed, Oct 30, 2024 at 11:58 AM jian he 
> wrote:
> > >
> > > I missed a case when column collation and partition key collation are
> > > the same and indeterministic.
> > > that should be fine for partition-wise join.
> > > so v2 attached.
> > >
> > > have_partkey_equi_join, match_expr_to_partition_keys didn't do any
> > > collation related check.
> > > propose v2 change disallow partitionwise join for  case when
> > > column collation is indeterministic *and* is differ from partition
> > > key's collation.
> > >
> > > the attached partition_wise_join_collation.sql is the test script.
> > > you may use it to compare with the master behavior.
> >
> > What if the partkey collation and column collation are both
> deterministic,
> > but with different sort order?
> >
> > I'm not familiar with this part of the code base, but it seems to me the
> > partition wise join should use partkey collation instead of column
> collation,
> > because it's the partkey collation that decides which partition a row to
> > be dispatched.
> >
> > What Jian proposed is also reasonable but seems another aspect of
> $subject?
>
> I think we should insist that the join key collation and the partition
> collation are exactly the same and refuse to match them if they are
> not.
>
> +   {
> +   Oid colloid =  exprCollation((Node *) expr);
> +
> +   if ((partcoll != colloid) &&
> +OidIsValid(colloid) &&
> +!get_collation_isdeterministic(colloid))
> +   *coll_incompatiable = true;
>
> I am not quite sure what is the point of checking whether or not the
> expression collation is deterministic after confirming that it's not
> the same as partcoll.
>

Me, too.

>
> Attached 0002 is what I came up with.  One thing that's different from
> what Jian proposed is that match_expr_to_partition_keys() returns -1
>

Agree.

(expr not matched to any key) when the collation is also not matched
> instead of using a separate output parameter for that.
>

In have_partkey_equi_join()
...
if (exprs_known_equal(root, expr1, expr2, btree_opfamily))
{
 Oid partcoll2 = rel1->part_scheme->partcollation[ipk];
 

I think we should use rel2 here, not rel1.


-- 
Thanks,
Tender Wang

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

2024-10-31 Thread Tender Wang

Amit Langote  于2024年10月31日周四 21:09写道：

>
> 0001 is the patch for the partitionwise grouping issue (bug #18568).
> I concluded that even partial aggregate should be disallowed when the
> grouping collation doesn't match partitioning collation. The result
> with partial partitionwise grouping is not the same as when
> enable_partitionwise_aggregate is off.
>

LGTM


-- 
Thanks,
Tender Wang

Re: UUID v7

On Thu, Oct 31, 2024 at 11:46 AM Andrey M. Borodin  wrote:
>
>
>
> > On 31 Oct 2024, at 22:15, Masahiko Sawada  wrote:
> >
> > On Sat, Oct 26, 2024 at 10:05 AM Andrey M. Borodin  
> > wrote:
> >
> > I think we typically avoid this kind of check failure by assigning
> > uuidv7() and uuidv7(interval) different C functions that call the
> > common function. That is, we have pg_proc entries like:
> >
>
> Done.
>
> >>>
> >>> It's odd to me that only uuid_extract_timestamp() supports UUID v6 in
> >>> spite of not supporting UUID v6 generation. I think it makes more
> >>> sense to support UUID v6 generation as well, if the need for it is
> >>> high.
> >>
> >> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with 
> >> providing implementation, it's trivial. PFA patch with implementation.
> >>
> >
> > My point is that we should either support full functionality for
> > UUIDv6 (such as generation and extraction) or none of them. I'm not
> > really sure we want UUIDv6 as well, but if we want it, it should be
> > implemented in a separate patch.
>
> Make sense. I've removed all traces of v6.

Thank you for updating the patch.

I've been studying UUID v7 and have a question about the current (v29)
implementation. IIUC the current implementation uses
nanosecond-precision timestamps for both the first 48 bit space and 12
bits of pseudorandom data space (referred as 'rand_a' space in RFC
9562). IOW, all data except for random, version, and variant parts
consist of a timestamp. The nanosecond-precision timestamp is
generated using clock_gettime() with CLOCK_REALTIME on Linux, which
however could be affected by time adjustment by NTP. Therefore, if the
system clock moves backward due to NTP, we cannot guarantee
monotonicity and sortability. Is that right?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add ExprState hashing for GROUP BY and hashed SubPlans

2024-10-31 Thread David Rowley

On Thu, 31 Oct 2024 at 15:30, Andrei Lepikhov  wrote:
> Comparing the master with and without your patch, the first, I see is
> more extensive usage of memory (see complete explains in the attachment):
>
> Current master:
>Batches: 1  Memory Usage: 74513kB

> Patched:
>Batches: 261  Memory Usage: 527905kB  Disk Usage: 4832656kB

Thanks for testing that. I had forgotten to adjust the storage
location for the hash initial value when I rebased the patch against
the changes made in 9ca67658d. I've fixed that in the attached patch.

The patched version comes out faster for me now:

master: Execution Time: 15515.401 ms
v4 patch: Execution Time: 15024.395 ms

David

v4-0001-Use-ExprStates-for-hashing-in-GROUP-BY-and-SubPla.patch
Description: Binary data

Re: In-placre persistance change of a relation