date:20240530

Re: Exposing the lock manager's WaitForLockers() to SQL

2024-05-30 Thread Will Mortensen

I got some very helpful off-list feedback from Robert Haas that this
needed more self-contained explanation/motivation. So here goes. :-)

This patch set adds a new SQL function pg_wait_for_lockers(), which
waits for transactions holding specified table locks to commit or roll
back. This can be useful with knowledge of the queries in those
transactions, particularly for asynchronous and incremental processing
of inserted/updated rows.

Specifically, consider a scenario where INSERTs and UPDATEs always set
a serial column to its default value. A client can call
pg_sequence_last_value() + pg_wait_for_lockers() and then take a new
DB snapshot and know that rows committed after this snapshot will have
values of the serial column greater than the value from
pg_sequence_last_value(). As shown in the example at the end, this
allows the client to asynchronously and incrementally read
inserted/updated rows with minimal per-client state, without buffering
changes, and without affecting writer transactions.

There are lots of other ways to support incrementally reading new
rows, but they don’t have all of those qualities. For example:

* Forcing writers to commit in a specific order (e.g. by serial column
value) would reduce throughput

* Explicitly tracking or coordinating with writers would likely be
more complex, impact performance, and/or require much more state

* Methods that are synchronous or buffer/queue changes are problematic
if readers fall behind

Existing ways to wait for table locks also have downsides:

* Taking a conflicting lock with LOCK blocks new transactions from
taking the lock of interest while LOCK waits. And in order to wait for
writers holding RowExclusiveLock, we must take ShareLock, which also
conflicts with ShareUpdateExclusiveLock and therefore unnecessarily
interferes with (auto)vacuum. Finally, with multiple tables LOCK locks
them one at a time, so it waits (and holds locks) longer than
necessary.

* Using pg_locks / pg_lock_status() to identify the transactions
holding the locks is more expensive since it also returns all other
locks in the DB cluster, plus there’s no efficient built-in way to
wait for the transactions to commit or roll back.

By contrast, pg_wait_for_lockers() doesn’t block other transactions,
waits on multiple tables in parallel, and doesn’t spend time looking
at irrelevant locks.

This change is split into three patches for ease of review. The first
two patches modify the existing WaitForLockers() C function and other
locking internals to support waiting for lockers in a single lock
mode, which allows waiting for INSERT/UPDATE without waiting for
vacuuming. These changes could be omitted at the cost of unnecessary
waiting, potentially for a long time with slow vacuums. The third
patch adds the pg_wait_for_lockers() SQL function, which just calls
WaitForLockers().

FWIW, another solution might be to directly expose the functions that
WaitForLockers() calls, namely GetLockConflicts() (generalized to
GetLockers() in the first patch) to identify the transactions holding
the locks, and VirtualXactLock() to wait for each transaction to
commit or roll back. That would be more complicated for the client but
could be more broadly useful. I could investigate that further if it
seems preferable.


=== Example ===

Assume we have the following table:

CREATE TABLE page_views (
id bigserial,
view_time timestamptz
);

which is only ever modified by (potentially concurrent) INSERT
commands that assign the default value to the id column. We can run
the following commands:

SELECT pg_sequence_last_value('page_views_id_seq');

 pg_sequence_last_value

   4

SELECT pg_wait_for_lockers(array['page_views']::oid, regclass[],
'RowExclusiveLock', FALSE);

Now we know that all rows where id <= 4 have been committed or rolled
back, and we can observe/process them:

SELECT * FROM page_views WHERE id <= 4;

 id |   view_time
+---
  2 | 2024-01-01 12:34:01.00-00
  3 | 2024-01-01 12:34:00.00-00

Later we can iterate:

SELECT pg_sequence_last_value('page_views_id_seq');

 pg_sequence_last_value

  9

SELECT pg_wait_for_lockers(array['page_views']::oid, regclass[],
'RowExclusiveLock', FALSE);

We already observed all the rows where id <= 4, so this time we can
filter them out:

SELECT * FROM page_views WHERE id > 4 AND id <= 9;

 id |   view_time
+---
  5 | 2024-01-01 12:34:05.00-00
  8 | 2024-01-01 12:34:04.00-00
  9 | 2024-01-01 12:34:07.00-00

We can continue iterating like this to incrementally observe more
newly inserted rows. Note that the only state we persist across
iterations is the value returned by pg_sequence_last_value().

In this example, we processed inserted rows exactly once. Variations
are possible for handling updates, as discussed in the original email,
and I could explain that again

Re: Exposing the lock manager's WaitForLockers() to SQL

2024-05-30 Thread Will Mortensen

I should add that the latest patches remove permissions checks because
pg_locks doesn't have any, and improve the commit messages. Hope I
didn't garble anything doing this late after the dev conference. :-)

Robert asked me about other existing functions that could be
leveraged, such as GetConflictingVirtualXIDs(), but I didn't see any
with close-enough semantics that handle fast-path locks as needed for
tables/relations.

Re: meson "experimental"?

2024-05-30 Thread Aleksander Alekseev

Hi.

> "Alternatively, PostgreSQL can be built using Meson.  This is the only
> option for building PostgreSQL in Windows using Visual Something[*].
> For other platforms, using Meson is currently experimental."

+1 good catch

> [*] What is the correct name for this?

I believe in this section it should be "Visual Studio" as we specify
elsewhere [1][2]. In [2] we name specific required versions. Maybe we
should reference this section.

While on it, in [2] section 17.7.5 is named "Visual". I don't think
such a product exists and find it confusing. Maybe we should rename
the section to "Visual Studio".

Also I don't see any mention of the minimum required version of Ninja.
I think we should add it too, or maybe reference the corresponding
section I couldn't find in "17.1 Requirements"

[1]: https://www.postgresql.org/docs/devel/install-meson.html
[2]: https://www.postgresql.org/docs/devel/installation-platform-notes.html

-- 
Best regards,
Aleksander Alekseev

Re: meson "experimental"?

2024-05-30 Thread Aleksander Alekseev

Hi,

> > [*] What is the correct name for this?
>
> I believe in this section it should be "Visual Studio" as we specify
> elsewhere [1][2]. In [2] we name specific required versions. Maybe we
> should reference this section.
>
> While on it, in [2] section 17.7.5 is named "Visual". I don't think
> such a product exists and find it confusing. Maybe we should rename
> the section to "Visual Studio".

Here are corresponding patches.

> Also I don't see any mention of the minimum required version of Ninja.
> I think we should add it too, or maybe reference the corresponding
> section I couldn't find in "17.1 Requirements"
>
> [1]: https://www.postgresql.org/docs/devel/install-meson.html
> [2]: https://www.postgresql.org/docs/devel/installation-platform-notes.html

By a quick look on the buildfarm we seem to use Ninja >= 1.11.1.
However since Meson can use both Ninja and VS as a backend I'm not
certain which section would be most appropriate for naming the minimal
required version of Ninja.

-- 
Best regards,
Aleksander Alekseev


v1-0001-Meson-is-not-experimental-on-Windows.patch
Description: Binary data


v1-0002-Rename-section-17.7.5-to-Visual-Studio.patch
Description: Binary data

Re: Implementing CustomScan over an index

2024-05-30 Thread Aleksander Alekseev

Hi,

> So I started implementing a CustomScan. It's not trivial.
>
> I've learned that the system executes ExecInitCustomScan automatically, but I 
> probably need it to do most of the stuff in ExecInitIndexScan, and then 
> execute the scan mostly the way it's done in IndexNext.
>
> Basically, I want just a normal index scan, but with the ability to do custom 
> things with the quals and the projection.
>
> So... what's the best approach?
>
> Is there any sample code that does this? A search of github doesn't turn up 
> much.
>
> Is there any way to do this without duplicating everything in nodeIndexscan.c 
> myself?

Yes, unfortunately it is not quite trivial.

There is a "Writing a Custom Scan Provider" chapter in the
documentation that may help [1]. TimescaleDB uses CustomScans, maybe
using its code as an example will help [2]. Hint: exploring `git log`
is often helpful too.

If something in the documentation is not clear, maybe it can be
improved. Let us know here or (even better) provide a patch. If you
have a particular piece of code that doesn't do what you want, try
uploading a minimal example on GitHub and asking here.

By a quick look I couldn't find an example of implementing a
CustomScan in ./contrib/ or ./src/test/. If you can think of a usage
example of CustomScans, consider contributing a test case.

[1]: https://www.postgresql.org/docs/current/custom-scan.html
[2]: https://github.com/timescale/timescaledb/


-- 
Best regards,
Aleksander Alekseev

The xversion-upgrade test fails to stop server

2024-05-30 Thread Alexander Lakhin


Hello Andrew,

While reviewing recent buildfarm failures, I came across this one:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-05-23%2004%3A11%3A03

upgrade.crake/REL_16_STABLE/REL9_5_STABLE-ctl4.log
waiting for server to shut 
down... 
failed

pg_ctl: server does not shut down

Looking at:
https://github.com/PGBuildFarm/client-code/blob/05014d50e/PGBuild/Modules/TestUpgradeXversion.pm#L641

I see that ctl4.log is created after updating extensions and
REL9_5_STABLE-update_extensions.log contains:
You are now connected to database "contrib_regression_redis_fdw" as user 
"buildfarm".
ALTER EXTENSION "hstore" UPDATE;
ALTER EXTENSION
You are now connected to database "contrib_regression_btree_gin" as user 
"buildfarm".
ALTER EXTENSION "btree_gin" UPDATE;
ALTER EXTENSION
...
but I see no corresponding server log file containing these commands in the
failure log.

When running the same test locally, I find these in inst/upgrade_log.

Maybe uploading this log file too would help to understand what is the
cause of the failure...

Best regards,
Alexander

Re: meson "experimental"?

2024-05-30 Thread Andrew Dunstan

On Thu, May 30, 2024 at 6:32 AM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

>
>
> By a quick look on the buildfarm we seem to use Ninja >= 1.11.1.
> However since Meson can use both Ninja and VS as a backend I'm not
> certain which section would be most appropriate for naming the minimal
> required version of Ninja.
>
>
When I tried building with the VS backend it blew up, I don't recall the
details. I think we should just use ninja everywhere. That keeps things
simple. On Windows I just install python and then do "pip install meson
ninja"

cheers

andrew

Re: Add LSN <-> time conversion functionality

2024-05-30 Thread Andrey M. Borodin

Hi everyone!

Me, Bharath, and Ilya are on patch review session at the PGConf.dev :) Maybe we 
got everything wrong, please consider that we are just doing training on 
reviewing patches.


=== Purpose of the patch ===
Currently, we have checkpoint_timeout and max_wal size to know when we need a 
checkpoint. This patch brings a capability to freeze page not only by internal 
state of the system, but also by wall clock time.
To do so we need an infrastructure which will tell when page was modified.

The patch in this thread is doing exactly this: in-memory information to map 
LSNs with wall clock time. Mapping is maintained by bacgroundwriter.

=== Questions ===
1. The patch does not handle server restart. All pages will need freeze after 
any crash?
2. Some benchmarks to proof the patch does not have CPU footprint.

=== Nits ===
"Timeline" term is already taken.
The patch needs rebase due to some header changes.
Tests fail on Windows.
The patch lacks tests.
Some docs would be nice, but the feature is for developers.
Mapping is protected for multithreaded access by walstats LWlock and might have 
tuplestore_putvalues() under that lock. That might be a little dangerous, if 
tuplestore will be on-disk for some reason (should not happen).


Overall, the patch is a base for good feature which would help to do freeze 
right in time. Thanks!


Best regards, Bharath, Andrey, Ilya.

Re: The xversion-upgrade test fails to stop server

2024-05-30 Thread Andrew Dunstan



Sent from my iPhone

> On May 30, 2024, at 8:00 AM, Alexander Lakhin  wrote:
> 
> Hello Andrew,
> 
> While reviewing recent buildfarm failures, I came across this one:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-05-23%2004%3A11%3A03
> 
> upgrade.crake/REL_16_STABLE/REL9_5_STABLE-ctl4.log
> waiting for server to shut 
> down...
>  failed
> pg_ctl: server does not shut down
> 
> Looking at:
> https://github.com/PGBuildFarm/client-code/blob/05014d50e/PGBuild/Modules/TestUpgradeXversion.pm#L641
> 
> I see that ctl4.log is created after updating extensions and
> REL9_5_STABLE-update_extensions.log contains:
> You are now connected to database "contrib_regression_redis_fdw" as user 
> "buildfarm".
> ALTER EXTENSION "hstore" UPDATE;
> ALTER EXTENSION
> You are now connected to database "contrib_regression_btree_gin" as user 
> "buildfarm".
> ALTER EXTENSION "btree_gin" UPDATE;
> ALTER EXTENSION
> ...
> but I see no corresponding server log file containing these commands in the
> failure log.
> 
> When running the same test locally, I find these in inst/upgrade_log.
> 
> Maybe uploading this log file too would help to understand what is the
> cause of the failure...
> 

Will investigate after  I return from pgconf

Cheers 

Andrew

Vacuum statistics

2024-05-30 Thread Alena Rybakina


Hello, everyone!

I think we don't have enough information to analyze vacuum functionality.

Needless to say that the vacuum is the most important process for a 
database system. It prevents problems like table and index bloating and 
emergency freezing if we have a wraparound problem. Furthermore, it 
keeps the visibility map up to date. On the other hand, because of 
incorrectly adjusted aggressive settings of autovacuum it can consume a 
lot of computing resources that lead to all queries to the system 
running longer.


Nowadays the vacuum gathers statistical information about tables, but it 
is important not for optimizer only.


Because the vacuum is an automation process, there are a lot of settings 
that determine their aggressive functionality to other objects of the 
database. Besides, sometimes it is important to set a correct parameter 
for the specified table, because of its dynamic changes.


An administrator of a database needs to set the settings of autovacuum 
to have a balance between the vacuum's useful action in the database 
system on the one hand, and the overhead of its workload on the other. 
However, it is not enough for him to decide on vacuum functionality 
through statistical information about the number of vacuum passes 
through tables and operational data from progress_vacuum, because it is 
available only during vacuum operation and does not provide a strategic 
overview over the considered period.


To sum up, an automation vacuum has a strategic behavior because the 
frequency of its functionality and resource consumption depends on the 
workload of the database. Its workload on the database is minimal for an 
append-only table and it is a maximum for the table with a 
high-frequency updating. Furthermore, there is a high dependence of the 
vacuum load on the number and volume of indexes. Because of the absence 
of the visibility map for indexes, the vacuum scans the index 
completely, and the worst situation when it needs to do it during a 
bloating index situation in a small table.



I suggest gathering information about vacuum resource consumption for 
processing indexes and tables and storing it in the table and index 
relationships (for example, PgStat_StatTabEntry structure like it has 
realized for usual statistics). It will allow us to determine how well 
the vacuum is configured and evaluate the effect of overhead on the 
system at the strategic level, the vacuum has gathered this information 
already, but this valuable information doesn't store it.


--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)

2024-05-30 Thread David Christensen

Hi Justin,

Thanks for the patch and the work on it.  In reviewing the basic
feature, I think this is something that has utility and is worthwhile
at the high level.

A few more specific notes:

The pg_namespace_size() function can stand on its own, and probably
has some utility for the released Postgres versions.

I do think the psql implementation for the \dn+ or \dA+ commands
shouldn't need to use this same function; it's a straightforward
expansion of the SQL query that can be run in a way that will be
backwards-compatible with any connected postgres version, so no reason
to exclude this information for this cases.  (This may have been in an
earlier revision of the patchset; I didn't check everything.)

I think the \dX++ command versions add code complexity without a real
need for it.  We have precedence with \l(+) to show permissions on the
basic display and size on the extended display, and I think this is
sufficient in this case here.  While moving the permissions to \dn is
a behavior change, it's adding information, not taking it away, and as
an interactive command it is unlikely to introduce significant
breakage in any scripting scenario.

(In reviewing the patch we've also seen a bit of odd behavior/possible
bug with the existing extended + commands, which introducing
significant ++ overloading might be confusing, but not the
fault/concern of this patch.)

Quickie summary:

0001-Add-pg_am_size-pg_namespace_size.patch
- fine, but needs rebase to work
0002-psql-add-convenience-commands-dA-and-dn.patch
- split into just + variant; remove \l++
- make the \dn show permission and \dn+ show size
0003-f-convert-the-other-verbose-to-int-too.patch
- unneeded
0004-Move-the-double-plus-Size-columns-to-the-right.patch
- unneeded

Docs on the first patch seemed fine; I do think we'll need docs
changes for the psql changes.

Best,

David

Re: Vacuum statistics

2024-05-30 Thread Alena Rybakina


On 30.05.2024 10:33, Alena Rybakina wrote:


I suggest gathering information about vacuum resource consumption for 
processing indexes and tables and storing it in the table and index 
relationships (for example, PgStat_StatTabEntry structure like it has 
realized for usual statistics). It will allow us to determine how well 
the vacuum is configured and evaluate the effect of overhead on the 
system at the strategic level, the vacuum has gathered this 
information already, but this valuable information doesn't store it.


My colleagues and I have prepared a patch that can help to solve this 
problem.


We are open to feedback.

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From dfda656b35be2a73c076cb723fdeb917630e61e3 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Thu, 30 May 2024 11:02:10 -0700
Subject: [PATCH] Machinery for grabbing an extended vacuum statistics on 
 relations. Remember, statistic on heap and index relations a bit different.

Value of total_blks_hit, total_blks_read, total_blks_dirtied are number of
hitted, missed and dirtied pages in shared buffers during a vacuum operation
respectively.

total_blks_dirtied means 'dirtied only by this action'. So, if this page was
dirty before the vacuum operation, it doesn't count this page as 'dirtied'.

The tuples_deleted parameter is the number of tuples cleaned up by the vacuum
operation.

The delay_time value means total vacuum sleep time in vacuum delay point.
The pages_removed value is the number of pages by which the physical data
storage of the relation was reduced.
The value of pages_deleted parameter is the number of freed pages in the table
(file size may not have changed).

Interruptions number of (auto)vacuum process during vacuuming of a relation.
We report from the vacuum_error_callback routine. So we can log all ERROR
reports. In the case of autovacuum we can report SIGINT signals too.
It maybe dangerous to make such complex task (send) in an error callback -
we can catch ERROR in ERROR problem. But it looks like we have so small
chance to stuck into this problem. So, let's try to use.
This parameter relates to a problem, covered by b19e4250.

Tracking of IO during an (auto)vacuum operation.
Introduced variables blk_read_time and blk_write_time tracks only access to
buffer pages and flushing them to disk. Reading operation is trivial, but
writing measurement technique is not obvious.
So, during a vacuum writing time can be zero incremented because no any flushing
operations were performed.

System time and user time are parameters that describes how much time a vacuum
operation has spent in executing of code in user space and kernel space
accordingly. Also, accumulate total time of a vacuum that is a diff between
timestamps in start and finish points in the vacuum code.
Remember about idle time, when vacuum waited for IO and locks, so total time
isn't equal a sum of user and system time, but no less.

pages_frozen - number of pages that are marked as frozen in vm during vacuum.
This parameter is incremented if page is marked as all-frozen.
pages_all_visible - number of pages that are marked as all-visible in vm during
vacuum.

Authors: Alena Rybakina , Andrei Lepikhov , Andrei Zubkov 
---
 src/backend/access/heap/vacuumlazy.c  | 245 +-
 src/backend/access/heap/visibilitymap.c   |  13 +
 src/backend/catalog/system_views.sql  | 123 +
 src/backend/commands/vacuum.c |   4 +
 src/backend/commands/vacuumparallel.c |   1 +
 src/backend/utils/activity/pgstat.c   | 117 +++--
 src/backend/utils/activity/pgstat_database.c  |   1 +
 src/backend/utils/activity/pgstat_relation.c  |  52 ++-
 src/backend/utils/adt/pgstatfuncs.c   | 290 
 src/backend/utils/error/elog.c|  13 +
 src/include/catalog/pg_proc.dat   |  28 +-
 src/include/commands/vacuum.h |   1 +
 src/include/pgstat.h  | 118 -
 src/include/utils/elog.h  |   2 +-
 src/include/utils/pgstat_internal.h   |  36 +-
 .../expected/vacuum-extended-statistic.out| 419 ++
 .../isolation/expected/vacuum-extending.out   |  68 +++
 src/test/isolation/isolation_schedule |   2 +
 .../specs/vacuum-extended-statistic.spec  | 179 
 .../isolation/specs/vacuum-extending.spec |  58 +++
 src/test/regress/expected/opr_sanity.out  |   9 +-
 src/test/regress/expected/rules.out   |  79 
 22 files changed, 1812 insertions(+), 46 deletions(-)
 create mode 100644 src/test/isolation/expected/vacuum-extended-statistic.out
 create mode 100644 src/test/isolation/expected/vacuum-extending.out
 create mode 100644 src/test/isolation/specs/vacuum-extended-statistic.spec
 create mode 100644 src/test/isolation/specs/vacuum-extending.spec

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c

Re: Vacuum statistics

2024-05-30 Thread Andrei Zubkov

Hi,

Th, 30/05/2024 at 10:33 -0700, Alena Rybakina wrote:
> I suggest gathering information about vacuum resource consumption for
> processing indexes and tables and storing it in the table and index 
> relationships (for example, PgStat_StatTabEntry structure like it has
> realized for usual statistics). It will allow us to determine how
> well 
> the vacuum is configured and evaluate the effect of overhead on the 
> system at the strategic level, the vacuum has gathered this
> information 
> already, but this valuable information doesn't store it.
> 
It seems a little bit unclear to me, so let me explain a little the
point of a proposition.

As the vacuum process is a backend it has a workload instrumentation.
We have all the basic counters available such as a number of blocks
read, hit and written, time spent on I/O, WAL stats and so on.. Also,
we can easily get some statistics specific to vacuum activity i.e.
number of tuples removed, number of blocks removed, number of VM marks
set and, of course the most important metric - time spent on vacuum
operation.

All those statistics must be stored by the Cumulative Statistics System
on per-relation basis. I mean individual cumulative counters for every
table and every index in the database.

Such counters will provide us a clear view about vacuum workload on
individual objects of the database, providing means to measure the
efficiency of performed vacuum fine tuning.
-- 
Andrei Zubkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Add memory context type to pg_backend_memory_contexts view

2024-05-30 Thread David Christensen

Hi David,

Giving this a once-through, this seems straightforward and useful.  I
have a slight preference for keeping "name" the first field in the
view and moving "type" to the second, but that's minor.

Just confirming that the allocator types are not extensible without a
recompile, since it's using a specific node tag to switch on, so there
are no concerns with not properly displaying the output of something
else.

The "" text placeholder might be more appropriate as "",
or perhaps stronger, include a WARNING in the logs, since an unknown
tag at this point would be an indication of some sort of memory
corruption.

Since there are only four possible values, I think there would be
utility in including them in the docs for this field.  I also think it
would be useful to have some sort of comments at least in mmgr/README
to indicate that if a new type of allocator is introduced that you
will also need to add the node to the function for this type, since
it's not an automatic conversion. (For that matter, instead of
switching on node type and outputting a given string, is there a
generic function that could just give us the string value for node
type so we don't need to teach anything else about it anyway?)

Thanks,

David

Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2024-05-30 Thread Tomasz Rybak

On Tue, 2024-04-23 at 16:59 +0800, Tender Wang wrote:
> 
[ cut ]
> 
> I have rebased master and fixed a plan diff case.

We (me, Paul Jungwirth, and Yuki Fujii) reviewed this patch
at PgConf.dev Patch Review Workshop.
Here are our findings.

Patch tries to allow for using materialization together
with parallel subqueries.
It applies cleanly on 8fea1bd5411b793697a4c9087c403887e050c4ac
(current HEAD).
Tests pass locally on macOS and Linux in VM under Windows.
Tests are also green in cfbot (for last 2 weeks; they were
red previously, probably because of need to rebase).

Please add more tests.  Especially please add some negative tests;
current patch checks that it is safe to apply materialization. It would
be helpful to add tests checking that materialization is not applied
in both checked cases:
1. when inner join path is not parallel safe
2. when matpath is not parallel safe

This patch tries to apply materialization only when join type
is not JOIN_UNIQUE_INNER. Comment mentions this, but does not
explain why. So please either add comment describing reason for that
or try enabling materialization in such a case.

Best regards.

-- 
Tomasz Rybak, Debian Developer 
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-05-30 Thread David Christensen

Hello,

I am looking through some outstanding CommitFest entries; I wonder if
this particular patch is already effectively fixed by 5278d0a2, which
is both attributed to the original author as well as an extremely
similar approach.  Can this entry
(https://commitfest.postgresql.org/48/4553/) be closed?

Best,

David

Re: POC: GROUP BY optimization

2024-05-30 Thread Alexander Korotkov

Hi!

On Thu, May 30, 2024 at 7:22 AM Andrei Lepikhov
 wrote:
> On 5/29/24 19:53, Alexander Korotkov wrote:
> > Thank you for your feedback.
> >
> > On Wed, May 29, 2024 at 11:08 AM Andrei Lepikhov
> >  wrote:
> >> On 5/27/24 19:41, Alexander Korotkov wrote:
> >>> Any thoughts?
> >> About 0001:
> >> Having overviewed it, I don't see any issues (but I'm the author),
> >> except grammatical ones - but I'm not a native to judge it.
> >> Also, the sentence 'turning GROUP BY clauses  into pathkeys' is unclear
> >> to me. It may be better to write something like:  'building pathkeys by
> >> the list of grouping clauses'.
> >
> > OK, thank you.  I'll run once again for the grammar issues.

I've revised some grammar including the sentence you've proposed.

> >> 0002:
> >> The part under USE_ASSERT_CHECKING looks good to me. But the code in
> >> group_keys_reorder_by_pathkeys looks suspicious: of course, we do some
> >> doubtful work without any possible way to reproduce, but if we envision
> >> some duplicated elements in the group_clauses, we should avoid usage of
> >> the list_concat_unique_ptr.
> >
> > As I understand Tom, there is a risk that clauses list may contain
> > multiple instances of equivalent SortGroupClause, not duplicate
> > pointers.
> Maybe. I just implemented the worst-case scenario with the intention of
> maximum safety. So, it's up to you.
> >
> >> What's more, why do you not exit from
> >> foreach_ptr immediately after SortGroupClause has been found? I think
> >> the new_group_clauses should be consistent with the new_group_pathkeys.
> >
> > I wanted this to be consistent with preprocess_groupclause(), where
> > duplicate SortGroupClause'es are grouped together.  Otherwise, we
> > could just delete redundant SortGroupClause'es.
> Hm, preprocess_groupclause is called before the standard_qp_callback
> forms the 'canonical form' of group_pathkeys and such behaviour needed.
> But the code that chooses the reordering strategy uses already processed
> grouping clauses, where we don't have duplicates in the first
> num_groupby_pathkeys elements, do we?

Yep, it seems that we work with group clauses which were processed by
standard_qp_callback().  In turn standard_qp_callback() called
make_pathkeys_for_sortclauses_extended() with remove_redundant ==
true.  So, there shouldn't be duplicates.  And it's unclear what
should we be protected from.

I leaved 0002 with just asserts.  And extracted questionable changes into 0005.

--
Regards,
Alexander Korotkov
Supabase


v3-0004-Restore-preprocess_groupclause.patch
Description: Binary data


v3-0005-Teach-group_keys_reorder_by_pathkeys-about-redund.patch
Description: Binary data


v3-0002-Add-invariants-check-to-get_useful_group_keys_ord.patch
Description: Binary data


v3-0003-Rename-PathKeyInfo-to-GroupByOrdering.patch
Description: Binary data


v3-0001-Fix-asymmetry-in-setting-EquivalenceClass.ec_sort.patch
Description: Binary data

Re: Add memory context type to pg_backend_memory_contexts view

2024-05-30 Thread David Rowley

On Fri, 31 May 2024 at 07:21, David Christensen  wrote:
> Giving this a once-through, this seems straightforward and useful.  I
> have a slight preference for keeping "name" the first field in the
> view and moving "type" to the second, but that's minor.

Not having it first make sense, but I don't think putting it between
name and ident is a good idea. I think name and ident belong next to
each other. parent likely should come after those since that also
relates to the name.

How about putting it after "parent"?

> Just confirming that the allocator types are not extensible without a
> recompile, since it's using a specific node tag to switch on, so there
> are no concerns with not properly displaying the output of something
> else.

They're not extensible.

> The "" text placeholder might be more appropriate as "",
> or perhaps stronger, include a WARNING in the logs, since an unknown
> tag at this point would be an indication of some sort of memory
> corruption.

This follows what we do in other places.  If you look at explain.c,
you'll see lots of "???"s.

I think if you're worried about corrupted memory, then garbled output
in pg_get_backend_memory_contexts wouldn't be that high on the list of
concerns.

> Since there are only four possible values, I think there would be
> utility in including them in the docs for this field.

I'm not sure about this. We do try not to expose too much internal
detail in the docs.  I don't know all the reasons for that, but at
least one reason is that it's easy for things to get outdated as code
evolves. I'm also unsure how much value there is in listing 4 possible
values unless we were to document the meaning of each of those values,
and doing that puts us even further down the path of detailing
Postgres internals in the documents. I don't think that's a
maintenance burden that's often worth the trouble.

> I also think it
> would be useful to have some sort of comments at least in mmgr/README
> to indicate that if a new type of allocator is introduced that you
> will also need to add the node to the function for this type, since
> it's not an automatic conversion.

I don't think it's sustainable to do this.  If we have to maintain
documentation that lists everything you must do in order to add some
new node types then I feel it's just going to get outdated as soon as
someone adds something new that needs to be done.  I'm only one
developer, but for me, I'd not even bother looking there if I was
planning to add a new memory context type.

What I would be doing is searching the entire code base for where
special handling is done for the existing types and ensuring I
consider if I need to include a case for the new node type. In this
case, I'd probably choose to search for "T_AllocSetContext", and I'd
quickly land on PutMemoryContextsStatsTupleStore() and update it. This
method doesn't get outdated, provided you do "git pull" occasionally.

> (For that matter, instead of
> switching on node type and outputting a given string, is there a
> generic function that could just give us the string value for node
> type so we don't need to teach anything else about it anyway?)

There isn't.  nodeToString() does take some node types as inputs and
serialise those to something JSON-like, but that includes serialising
each field of the node type too. The memory context types are not
handled by those functions. I think it's fine to copy what's done in
explain.c.  "git grep \"???\" -- *.c | wc -l" gives me 31 occurrences,
so I'm not doing anything new.

I've attached an updated patch which changes the position of the new
column in the view.

Thank you for the review.

David

v3-0001-Add-context-type-field-to-pg_backend_memory_conte.patch
Description: Binary data

Explicit specification of index ensuring uniqueness of foreign columns

2024-05-30 Thread Kaiting Chen

I'd like to resurrect a subset of my proposal in [1], specifically that:

  The FOREIGN KEY constraint syntax gains a [ USING INDEX index_name ] clause
  optionally following the referenced column list.

  The index specified by this clause is used to support the foreign key
  constraint, and it must be a non-deferrable unique or primary key index on the
  referenced table compatible with the referenced columns.

I believe that it may be independently valuable to have some syntax available to
influence which index is used to ensure uniqueness of the foreign columns in a
foreign key constraint. Currently, this index is identified implicitly from the
REFERENCEd columns when the constraint is created. This causes the following to
imperfectly round trip through a pg_dump and restore:

  CREATE TABLE foo (
id INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY
  );

  CREATE UNIQUE INDEX foo_key2 ON foo(id);
  CREATE UNIQUE INDEX foo_key1 ON foo(id);

  CREATE TABLE bar (
foo_id INTEGER NOT NULL CONSTRAINT bar_fkey REFERENCES foo(id)
  );

Using this query to identify the unique index backing the bar_fkey constraint:

  SELECT objid, refobjid::regclass FROM pg_depend
  WHERE objid = (
SELECT oid FROM pg_constraint WHERE conname = 'bar_fkey'
  ) AND refobjsubid = 0;

Then after the DDL is applied, the foreign key constraint depends on foo_key2:

  objid | refobjid
  ---+--
  17152 | foo_key2

But following a pg_dump and restore, the foreign key's unique index dependency
has changed to foo_key1:

  objid | refobjid
  ---+--
  17167 | foo_key1

This discrepancy appears to be caused by this confluence of circumstances:

1. The unique index backing the foreign side of a foreign key constraint is
   searched for in OID order:

   static Oid
   transformFkeyCheckAttrs(Relation pkrel,
   int numattrs, int16 *attnums,
   Oid *opclasses) /*
output parameter */
   {
   ...
   indexoidlist = RelationGetIndexList(pkrel);

   foreach(indexoidscan, indexoidlist)
   {
   ...
   }

2. The indexes appear in the pg_dump output before the FOREIGN KEY constraint,
   and they appear in lexicographic, rather than OID, order.

While, in this minimal reproduction, the two indexes are interchangeable, there
are situations that may reasonably occur in the course of ordinary use in which
they aren't. For example, a redundant unique index with different storage
parameters may exist during the migration of an application schema. If the
incorrect index is then selected to be a dependency of a foreign key constraint
following a pg_dump and restore, it will likely cause subsequent steps in the
migration to fail.

Note that this proposal deals with indexes rather than constraints because this
is, internally, what PostgreSQL uses. Specifically, PostgreSQL doesn’t actually
require there to be a unique constraint on the foreign columns of a foreign key
constraint; a unique index alone is sufficient. However, I believe that this
proposal would be essentially the same if it were changed to a USING CONSTRAINT
clause, since it is already possible to explicitly specify the underlying index
for a unique or primary key constraint.

If I submitted a patch implementing this functionality, would there be
any interest in it?

[1]: 
https://www.postgresql.org/message-id/flat/CA%2BCLzG8HZUk8Gb9BKN88fgdSEqHx%3D2iq5aDdvbz7JqSFjA2WxA%40mail.gmail.com

Cluster forcefully removal without user input

2024-05-30 Thread Zaid Shabbir

Hello,

I have installed PostgreSQL 15 and PostgreSQL 14 side by side and want to
upgrade from 14 to 15. For upgrading purposes, I am using {postgresql-15-setup
check_upgrade}. However, in my case, the installed 14 version is not
compatible with the latest 15.7.

After the installation and cluster initialization of PostgreSQL 14 and 15,
when I run the following command {postgresql-15-setup check_upgrade}, it
returns the following message:
"Performing upgrade check: Upgrade failed. Removing the new cluster. Please
re-initdb the new cluster. failed "


After the failure the postgresql15 cluster removed forcefully due to the
following code written in postgresql-15-setup script file

{
if [ $script_result -eq 0 ]; then
echo $"OK"
else
# Clean up after failure
echo "Upgrade failed. Removing the new cluster. Please re-initdb
the new cluster."

*rm -rf "$PGDATA"*echo $"failed"
fi
}

My concern here is whether forcefully deleting the user cluster without
obtaining permission from the user is the right approach.



Regards,
Zaid Shabbir
AGEDB

Re: Use generation memory context for tuplestore.c

2024-05-30 Thread David Rowley

On Sat, 4 May 2024 at 03:51, Matthias van de Meent
 wrote:
>
> On Fri, 3 May 2024 at 15:55, David Rowley  wrote:
> > master @ 8f0a97dff
> > Storage: Memory  Maximum Storage: 16577kB
> >
> > patched:
> > Storage: Memory  Maximum Storage: 8577kB
>
> Those are some impressive numbers.

This patch needed to be rebased, so updated patches are attached.

I was also reflecting on what Bruce wrote in [1] about having to parse
performance numbers from the commit messages, so I decided to adjust
the placeholder commit message I'd written to make performance numbers
more clear to Bruce, or whoever does the next major version release
notes.  That caused me to experiment with finding the best case for
this patch.  I could scale the improvement much further than I have,
but here's something I came up with that's easy to reproduce.

create table winagg (a int, b text);
insert into winagg select a,repeat('a', 1024) from generate_series(1,1000)a;
set work_mem = '1MB';
set jit=0;
explain (analyze, timing off) select sum(l1),sum(l2) from (select
length(b) l1,length(lag(b, 800) over ()) as l2 from winagg limit
1600);

master:
Execution Time: 6585.685 ms

patched:
Execution Time: 4.159 ms

1583x faster.

I've effectively just exploited the spool_tuples() behaviour of what
it does when the tuplestore goes to disk to have it spool the entire
remainder of the partition, which is 10 million rows.  I'm just taking
a tiny portion of those with the LIMIT 1600.  I just set work_mem to
something that the patched version won't have the tuplestore spill to
disk so that spool_tuples() only spools what's needed in the patched
version. So, artificial is a word you could use, but certainly,
someone could find this performance cliff in the wild and be prevented
from falling off it by this patch.

David

[1] https://www.postgresql.org/message-id/Zk5r2XyI0BhXLF8h%40momjian.us

v2-0001-Add-memory-disk-usage-for-Material-in-EXPLAIN-ANA.patch
Description: Binary data

v2-0002-Optimize-memory-management-and-increase-performan.patch
Description: Binary data

Re: Exposing the lock manager's WaitForLockers() to SQL

Re: Exposing the lock manager's WaitForLockers() to SQL

Re: meson "experimental"?

Re: meson "experimental"?

Re: Implementing CustomScan over an index

The xversion-upgrade test fails to stop server

Re: meson "experimental"?

Re: Add LSN <-> time conversion functionality

Re: The xversion-upgrade test fails to stop server

Vacuum statistics

Re: [PATCH] psql: \dn+ to show size of each schema (and \dA+ for AMs)

Re: Vacuum statistics

Re: Vacuum statistics

Re: Add memory context type to pg_backend_memory_contexts view

Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

Re: POC: GROUP BY optimization

Re: Add memory context type to pg_backend_memory_contexts view

Explicit specification of index ensuring uniqueness of foreign columns

Cluster forcefully removal without user input

Re: Use generation memory context for tuplestore.c

21 matches

Site Navigation

Mail list logo

Footer information