date:20240923

On Mon, 23 Sept 2024 at 19:39, jian he  wrote:
> sql-analyze.html
> For partitioned tables, ANALYZE gathers statistics by sampling rows
> from all partitions. Each leaf partition is also recursively analyzed
> and the statistics updated. This recursive part may be disabled by
> using the ONLY keyword, otherwise, leaf partitions are analyzed only
> once, even with multi-level partitioning. No statistics are collected
> for only the parent table (without data from its partitions), because
> with partitioning it's guaranteed to be empty.
>
> allow me to ask anenglish language question.
> here "otherwise" means  specify ONLY or not?
> As far as i understand.
> if you specify ONLY, postgres will only do "For partitioned tables,
> ANALYZE gathers statistics by sampling rows from all partitions"
> if you not specify ONLY, postgres will do  "For partitioned tables,
> ANALYZE gathers statistics by sampling rows from all partitions *AND*
> also recursively analyze each leaf partition"
>
> Is my understanding correct?

The "Otherwise" case applies when "ONLY" isn't used.

If this is confusing, I think there's a bunch of detail that I tried
to keep that's just not that useful.  The part about analyzing
partitions just once and the part about not collecting non-inheritance
stats for the partitioned table seems like extra detail that's either
obvious or just not that important.

Can you have a look at the attached and let me know if it's easier to
understand now?

David


v8-0001-Add-ONLY-support-for-VACUUM-and-ANALYZE.patch
Description: Binary data


v8-0002-fixup-Implementation-of-the-ONLY-feature-for-ANAL.patch
Description: Binary data

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-23 Thread Tomas Vondra

On 9/23/24 01:06, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 9/22/24 17:45, Tom Lane wrote:
>>> #define FAST_PATH_GROUP(index)  \
>>> -   (AssertMacro(((index) >= 0) && ((index) < FP_LOCK_SLOTS_PER_BACKEND)), \
>>> +   (AssertMacro((uint32) (index) < FP_LOCK_SLOTS_PER_BACKEND), \
>>> ((index) / FP_LOCK_SLOTS_PER_GROUP))
> 
>> For the (x >= 0) asserts, doing it this way relies on negative values
>> wrapping to large positive ones, correct? AFAIK it's guaranteed to be a
>> very large value, so it can't accidentally be less than the slot count.
> 
> Right, any negative value would wrap to something more than
> INT32_MAX.
> 

Thanks. Pushed a fix for these issues, hopefully coverity will be happy.

BTW is the coverity report accessible somewhere? I know someone
mentioned that in the past, but I don't recall the details. Maybe we
should have a list of all these resources, useful for committers,
somewhere on the wiki?

regards

-- 
Tomas Vondra

Re: Clock-skew management in logical replication

On Sun, Sep 22, 2024 at 7:24 PM Joe Conway  wrote:
>
> On 9/21/24 01:31, shihao zhong wrote:
> > Nisha Moond  writes:
> >> Thoughts? Looking forward to hearing others' opinions!
> >
> > Had a productive conversation with Amit Kaplia today about time skew
> > in distributed systems, and wanted to share some thoughts.
> > Essentially, we're grappling with the classic distributed snapshot
> > problem. In a multi-active environment, where multiple nodes can
> > independently process transactions,  it becomes crucial to determine
> > the visibility of these transactions across the system.  Time skew,
> > where different machines have different timestamps make it a hard
> > problem. How can we ensure consistent transaction ordering and
> > visibility when time itself is unreliable?
> >
> > As you mentioned, there are several ways to tackle the time skew
> > problem in distributed systems. These approaches generally fall into
> > three main categories:
> >
> > 1. Centralized Timestamps (Timestamp Oracle)
> > 2. Atomic Clocks (True Time)
> > 3. Hybrid Logical Clocks
> > 4 Local Clocks
>
> > I recommend .. implement a pluggable time access method. This
> > allows users to integrate with different time services as needed.
>
> Huge +1
>

The one idea to provide user control over timestamps that are used for
'latest_write_wins' strategy could be to let users specify the values
in a special column in the table that will be used to resolve
conflicts.

CREATE TABLE foo(c1 int, c2 timestamp default conflict_fn, CHECK CONFLICTS(c2));

Now, for column c2 user can provide its function which can provide
value for each row that can be used to resolve conflict. If the
table_level conflict column is provided then that will be used to
resolve conflicts, otherwise, the default commit timestamp provided by
commit_ts module will be used to resolve conflict.

On the apply-side, we will use a condition like:
if ((source_new_column_value > replica_current_column_value) ||
operation.type == "delete")
  apply_update();

In the above example case, source_new_column_value and
replica_current_column_value will be column c2 on publisher and
subscriber. Note, that in the above case, we allowed deletes to always
win as the delete operation doesn't update the column values. We can
choose a different strategy to apply deletes like comparing the
existing column values as well.

Note that MYSQL [1] and Oracle's Timesten [2] provide a similar
strategy at the table level for conflict resolution to avoid reliance
on system clocks.

Though this provides a way for users to control values required for
conflict resolution, I prefer a simple approach at least for the first
version which is to document that users should ensure time
synchronization via NTP. Even Oracle mentions the same in their docs
[3] (See from: "It is critical to ensure that the clocks on all
databases are identical to one another and it’s recommended that all
database servers are configured to maintain accurate time through a
time server using the network time protocol (NTP). Even in
environments where databases span different time zones, all database
clocks must be set to the same time zone or Coordinated Universal Time
(UTC) must be used to maintain accurate time. Failure to maintain
accurate and synchronized time across the databases in an
active-active replication environment will result in data integrity
issues.")

[1] - 
https://dev.mysql.com/doc/refman/9.0/en/mysql-cluster-replication-schema.html#ndb-replication-ndb-replication
[2] - 
https://docs.oracle.com/en/database/other-databases/timesten/22.1/replication/configuring-timestamp-comparison.html#GUID-C8B0580B-B577-435F-8726-4AF341A09806
[3] - 
https://www.oracle.com/cn/a/tech/docs/technical-resources/wp-oracle-goldengate-activeactive-final2-1.pdf

-- 
With Regards,
Amit Kapila.

Re: Clock-skew management in logical replication

2024-09-23 Thread Nisha Moond

On Fri, Sep 20, 2024 at 7:51 PM Tom Lane  wrote:
>
> Nisha Moond  writes:
> > While considering the implementation of timestamp-based conflict
> > resolution (last_update_wins) in logical replication (see [1]), there
> > was a feedback at [2] and the discussion on whether or not to manage
> > clock-skew at database level.
>
> FWIW, I cannot see why we would do anything beyond suggesting that
> people run NTP.  That's standard anyway on the vast majority of
> machines these days.  Why would we add complexity that we have
> to maintain (and document) in order to cater to somebody not doing
> that?
>
> regards, tom lane

Thank you for your response.

I agree with suggesting users to run NTP and we can recommend it in
the docs rather than introducing additional complexities.

In my research on setting up NTP servers on Linux, I found that
Chrony[1] is a lightweight and efficient solution for time
synchronization across nodes. Another reliable option is the classic
NTP daemon (ntpd)[2], which is also easy to configure and maintain.
Both Chrony and ntpd can be used to configure a local machine as an
NTP server for localized time synchronization, or as clients syncing
from public NTP servers such as 'ntp.ubuntu.com' (default ntp server
pool for Ubuntu systems) or 'time.google.com'(Google Public NTP).
For example, on Ubuntu, Chrony is straightforward to install and
configure[3]. Comprehensive NTP(ntpd) configuration guides are
available for various Linux distributions, such as Ubuntu[4] and
RedHat-Linux[5].

Further, I’m exploring options for implementing NTP on Windows systems.

[1] https://chrony-project.org/index.html
[2] https://www.ntp.org/documentation/4.2.8-series/
[3] 
https://documentation.ubuntu.com/server/how-to/networking/serve-ntp-with-chrony/
[4] https://askubuntu.com/questions/14558/how-do-i-setup-a-local-ntp-server
[5] 
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ntp_using_ntpd#s1-Understanding_the_ntpd_Configuration_File

Thanks,
Nisha

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-23 Thread Tatsuo Ishii

> On Mon, 23 Sept 2024 at 18:28, Tatsuo Ishii  wrote:
>> I agree and made necessary changes. See attached v4 patches.
> 
> Looks good to me.

Thank you for the review! I have pushed the patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: meson and check-tests

2024-09-23 Thread Nazir Bilal Yavuz

Hi,

On Sat, 21 Sept 2024 at 09:01, jian he  wrote:
>
> hi. Thanks for your work!

Thank you for looking into this!

> I do find some issues.
>
>
> TESTS="copy jsonb jsonb" meson test --suite regress
> one will fail. not sure this is expected?

Yes, that is expected.

> in [1] you mentioned "setup", but that "setup" is more or less like
> "meson test  --suite setup --suite regress"
> but originally, I thought was about "src/test/regress/sql/test_setup.sql".
> for example, now you cannot run src/test/regress/sql/stats_ext.sql
> without first running test_setup.sql, because some functions (like fipshash)
> live in test_setup.sql.

Postgres binaries are created at the build step in the make builds so
these binaries can be used for the tests. But in the meson builds,
they are created at the 'meson test  --suite setup' for the testing
('meson install' command creates binaries as well but they are not for
testing, they are for installing binaries to the OS). So, 'meson test
--suite setup' should be run before running regression tests.

> so
> TESTS="copy jsonb stats_ext" meson test --suite regress
> will fail.
>
> to make it work we need change it to
> TESTS="test_setup copy jsonb stats_ext" meson test --suite regress
>
> Many tests depend on test_setup.sql, maybe we can implicitly prepend it.
> Another dependency issue. alter_table depending on create_index.
>
> TESTS="test_setup alter_table" meson test --suite regress
> will fail.
> TESTS="test_setup create_index alter_table" meson test --suite regress
> will work.

Yes, I realized that but since that is how it is done in the make
builds, I didn't want to change the behaviour. Also, I think it makes
sense to leave it to the tester. It is more flexible in that way.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Increase of maintenance_work_mem limit in 64-bit Windows

2024-09-23 Thread Vladlen Popolitov


David Rowley писал(а) 2024-09-23 04:28:

On Fri, 20 Sept 2024 at 01:55, Пополитов Владлен
 wrote:

Currently PostgreSQL built on 64-bit Windows has 2Gb limit for
GUC variables due to sizeof(long)==4 used by Windows compilers.
Technically 64-bit addressing for maintenance_work_mem is possible,
but code base historically uses variables and constants of type 
"long",

when process maintenance_work_mem value.


I agree. Ideally, we shouldn't use longs for anything ever. We should
likely adopt trying to remove the usages of them when possible.

I'd like to suggest you go about this patch slightly differently with
the end goal of removing the limitation from maintenance_work_mem,
work_mem, autovacuum_work_mem and logical_decoding_work_mem.

Patch 0001: Add a macro named something like WORK_MEM_KB_TO_BYTES()
and adjust all places where we do  * 1024L to use this
new macro. Make the macro do the * 1024L as is done today so that this
patch is a simple refactor.
Patch 0002: Convert all places that use long and use Size instead.
Adjust WORK_MEM_KB_TO_BYTES to use a Size type rather than 1024L.

It might be wise to break 0002 down into individual GUCs as the patch
might become large.

I suspect we might have quite a large number of subtle bugs in our
code today due to using longs. 7340d9362 is an example of one that was
fixed recently.

David


Hi David,
Thank you for proposal, I looked at the patch and source code from this
point of view. In this approach we need to change all .
I counted the appearences of these vars in the code:
maintenance_work_mem appears 63 times in 20 files
work_mem appears 113 times in 48 files
logical_decoding_work_mem appears 10 times in 2 files
max_stack_depth appears 11 times in 3 files
wal_keep_size_mb appears 5 times in 3 files
min_wal_size_mb appears 5 times in 2 files
max_wal_size_mb appears 10 times in 2 files
wal_skip_threshold appears 5 times in 2 files
max_slot_wal_keep_size_mb appears 6 times in 3 files
wal_sender_timeout appears 23 times in 3 files
autovacuum_work_mem appears 11 times in 4 files
gin_pending_list_limit appears 8 times in 5 files
pendingListCleanupSize appears 2 times in 2 files
GinGetPendingListCleanupSize appears 2 times in 2 files

maintenance_work_mem appears 63 times and had only 4 cases, where "long"
is used (I fix it in patch). I also found, that this patch also fixed
autovacuum_work_mem , that has only 1 case - the same place in code as
maintenance_work_mem.

Now  in the code are processed based on the context: they 
are
assigned to Size, uint64, int64, double, long, int variables (last 2 
cases

need to fix) or multiplied by (uint64)1024, (Size)1024, 1024L (last case
needs to fix). Also signed value is used for max_stack_depth (-1 used as
error value). I am not sure, that we can solve all this cases by one
macro WORK_MEM_KB_TO_BYTES(). The code needs case by case check.

If I check the rest of the variables, the patch does not need
MAX_SIZE_T_KILOBYTES constant (I introduced it for variables, that are
already checked and fixed), it will contain only fixes in the types of
the variables and the constants.
It requires a lot of time to check all appearances and neighbour
code, but final patch will not be large, I do not expect a lot of
"long" in the rest of the code (only 4 case out of 63 needed to fix
for maintenance_work_mem).
What do you think about this approach?

--
Best regards,

Vladlen Popolitov.

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-23 Thread Jakub Wartak

On Mon, Sep 16, 2024 at 4:19 PM Tomas Vondra  wrote:
> On 9/16/24 15:11, Jakub Wartak wrote:
> > On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra  wrote:
> >
> >> [..]
> >
> >> Anyway, at this point I'm quite happy with this improvement. I didn't
> >> have any clear plan when to commit this, but I'm considering doing so
> >> sometime next week, unless someone objects or asks for some additional
> >> benchmarks etc.
> >
> > Thank you very much for working on this :)
> >
> > The only fact that comes to my mind is that we could blow up L2
> > caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
> > be like one or two 2MB huge pages more @ common max_connections=1000
> > x86_64 (830kB -> ~5.1MB), and indeed:
[..]
> > then maybe(?) one could observe further degradation of dTLB misses in
> > the perf-stat counter under some microbenchmark, but measuring that
> > requires isolated and physical hardware. Maybe that would be actually
> > noise due to overhead of context-switches itself. Just trying to think
> > out loud, what big PGPROC could cause here. But this is already an
> > unhealthy and non-steady state of the system, so IMHO we are good,
> > unless someone comes up with a better (more evil) idea.
> >
>
> I've been thinking about such cases too, but I don't think it can really
> happen in practice, because:
>
> - How likely is it that the sessions will need a lot of OIDs, but not
> the same ones? Also, why would it matter that the OIDs are not the same,
> I don't think it matters unless one of the sessions needs an exclusive
> lock, at which point the optimization doesn't really matter.
>
> - If having more fast-path slots means it doesn't fit into L2 cache,
> would we fit into L2 without it? I don't think so - if there really are
> that many locks, we'd have to add those into the shared lock table, and
> there's a lot of extra stuff to keep in memory (relcaches, ...).
>
> This is pretty much one of the cases I focused on in my benchmarking,
> and I'm yet to see any regression.

Sorry for answering this so late. Just for context here: I was
imagining a scenario with high max_connections about e.g. schema-based
multi-tenancy and no partitioning (so all would be fine without this
$thread/commit ; so under 16 (fast)locks would be taken). The OIDs
need to be different to avoid contention: so that futex() does not end
up really in syscall (just user-space part). My theory was that a much
smaller PGPROC should be doing much less (data) cache-line fetches
than with-the-patch. That hash() % prime , hits various parts of a
larger array - so without patch should be quicker as it wouldn't be
randomly hitting some larger array[], but it might be noise as you
state.  It was a theoretical attempt at crafting the worst possible
conditions for the patch, so feel free to disregard as it already
assumes some anti-pattern (big & all active max_connections).

> > Well the only thing I could think of was to add to the
> > doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
> > is also used as advisory for the number of groups used in
> > lockmanager's fast-path implementation" (that is, without going into
> > further discussion, as even pg_locks discussion
> > doc/src/sgml/system-views.sgml simply uses that term).
> >
>
> Thanks, I'll consider mentioning this in max_locks_per_transaction.
> Also, I think there's a place calculating the amount of per-connection
> memory, so maybe that needs to be updated too.
>

I couldn't find it in current versions, but maybe that's helpful/reaffirming:
- up to 9.2. there were exact formulas used, see "(1800 + 270 *
max_locks_per_transaction) * max_connections" [1] , that's a long time
gone now.
- if anything then Andres might want to improve a little his blog
entry: [1] (my take is that is seems to be the most accurate and
authoritative technical information that we have online)

-J.

[1] - https://www.postgresql.org/docs/9.2/kernel-resources.html
[2] - 
https://blog.anarazel.de/2020/10/07/measuring-the-memory-overhead-of-a-postgres-connection/

Re: Converting README documentation to Markdown

2024-09-23 Thread Peter Eisentraut


On 10.09.24 14:50, Daniel Gustafsson wrote:

On 1 Jul 2024, at 12:22, Daniel Gustafsson  wrote:



Attached is a v2 which fixes a conflict, if there is no interest in Markdown
I'll drop 0001 and the markdown-specifics from 0002 to instead target increased
consistency.


Since there doesn't seem to be much interest in going all the way to Markdown,
the attached 0001 is just the formatting changes for achieving (to some degree)
consistency among the README's.  This mostly boils down to using a consistent
amount of whitespace around code, using the same indentation on bullet lists
and starting sections the same way.  Inspecting the patch with git diff -w
reveals that it's not much left once whitespace is ignored.  There might be a
few markdown hunks left which I'll hunt down in case anyone is interested in
this.


I went through this file by file and checked the results of a 
markdown-to-HTML conversion using cmark and looking at the raw output 
source files.


A lot of the changes are obvious and make sense.  But there are a number 
of cases of code within lists or nested lists or both that need further 
careful investigation.  I'm attaching a fixup patch where I tried to 
improve some of this (and a few other things I found along the way). 
Some of the more complicated ones, such as 
src/backend/storage/lmgr/README-SSI, will need to be checked again and 
even more carefully to make sure that the meaning is not altered by 
these patches.


One underlying problem that I see is that markdown assumes four-space 
tabs, but a standard editor configuration (and apparently your editor) 
uses 8 tabs.  But then, if you have a common situation like


```
1. Run this code

$ sudo kill
```

then that's incorrect (the code line will not be inside the list), 
because it should be


```
1. Run this code

$ sudo kill
```

or

```
1. Run this code

<8 spaces>$ sudo kill
```

So we need to think about a way to make this more robust for future 
people editing.  Maybe something in .gitattributes or some editor 
settings.  Otherwise, it will be all over the places after a while. 
(There are also a couple of places where apparently you changed 
whitespace that wasn't necessary to be changed.)


Apart from this, I don't changing the placeholders like  to < foo 
>.  In some cases, this really decreases readability.  Maybe we should 
look for different approaches there.


Maybe there are some easy changes that could be extracted from this 
patch, but the whitespace and list issue needs more consideration.
From 990329d7a8e6110ecf099162680f6c0eaf38ea14 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 23 Sep 2024 13:13:18 +0200
Subject: [PATCH] fixup! Standardize syntax in internal documentation

---
 contrib/start-scripts/macos/README|  6 ++--
 src/backend/access/gin/README |  6 ++--
 src/backend/access/gist/README|  8 ++---
 src/backend/access/hash/README|  2 +-
 src/backend/lib/README| 22 ++--
 src/backend/optimizer/plan/README |  6 ++--
 src/backend/parser/README | 42 +++
 src/backend/storage/lmgr/README-SSI   |  6 ++--
 src/backend/utils/fmgr/README |  1 -
 src/backend/utils/misc/README |  8 ++---
 src/interfaces/ecpg/preproc/README.parser | 20 +--
 src/port/README   |  8 ++---
 src/test/recovery/README  |  4 +--
 src/test/ssl/README   | 14 
 src/tools/pgindent/README | 24 ++---
 15 files changed, 88 insertions(+), 89 deletions(-)

diff --git a/contrib/start-scripts/macos/README 
b/contrib/start-scripts/macos/README
index 8fe6efb657d..c71f98880ca 100644
--- a/contrib/start-scripts/macos/README
+++ b/contrib/start-scripts/macos/README
@@ -15,11 +15,11 @@ if you plan to run the Postgres server under some user name 
other
 than "postgres", adjust the UserName parameter value for that.
 
 4. Copy the modified org.postgresql.postgres.plist file into
-  /Library/LaunchDaemons/.  You must do this as root:
+/Library/LaunchDaemons/.  You must do this as root:
 
-   sudo cp org.postgresql.postgres.plist /Library/LaunchDaemons
+sudo cp org.postgresql.postgres.plist /Library/LaunchDaemons
 
-  because the file will be ignored if it is not root-owned.
+because the file will be ignored if it is not root-owned.
 
 At this point a reboot should launch the server.  But if you want
 to test it without rebooting, you can do
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index 08faa944105..ccff6896513 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -124,9 +124,9 @@ know are:
   or IndexInfoFindDataOffset + sizeof(int2)) there is a byte indicating
   the "category" of the null entry.  These are the possible categories:
 
-   1 = ordinary null key value extracted from an indexable item
-   2 = placeholder for zero-ke

Re: Pgoutput not capturing the generated columns

On Fri, 20 Sept 2024 at 04:16, Peter Smith  wrote:
>
> On Fri, Sep 20, 2024 at 3:26 AM Masahiko Sawada  wrote:
> >
> > On Thu, Sep 19, 2024 at 2:32 AM Amit Kapila  wrote:
> > >
> ...
> > > I think that the column list should take priority and we should
> > > publish the generated column if it is mentioned in  irrespective of
> > > the option.
> >
> > Agreed.
> >
> > >
> ...
> > >
> > > Users can use a publication like "create publication pub1 for table
> > > t1(c1, c2), t2;" where they want t1's generated column to be published
> > > but not for t2. They can specify the generated column name in the
> > > column list of t1 in that case even though the rest of the tables
> > > won't publish generated columns.
> >
> > Agreed.
> >
> > I think that users can use the publish_generated_column option when
> > they want to publish all generated columns, instead of specifying all
> > the columns in the column list. It's another advantage of this option
> > that it will also include the future generated columns.
> >
>
> OK. Let me give some examples below to help understand this idea.
>
> Please correct me if these are incorrect.
>
> ==
>
> Assuming these tables:
>
> t1(a,b,gen1,gen2)
> t2(c,d,gen1,gen2)
>
> Examples, when publish_generated_columns=false:
>
> CREATE PUBLICATION pub1 FOR t1(a,b,gen2), t2 WITH
> (publish_generated_columns=false)
> t1 -> publishes a, b, gen2 (e.g. what column list says)
> t2 -> publishes c, d
>
> CREATE PUBLICATION pub1 FOR t1, t2(gen1) WITH 
> (publish_generated_columns=false)
> t1 -> publishes a, b
> t2 -> publishes gen1 (e.g. what column list says)
>
> CREATE PUBLICATION pub1 FOR t1, t2 WITH (publish_generated_columns=false)
> t1 -> publishes a, b
> t2 -> publishes c, d
>
> CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns=false)
> t1 -> publishes a, b
> t2 -> publishes c, d
>
> ~~
>
> Examples, when publish_generated_columns=true:
>
> CREATE PUBLICATION pub1 FOR t1(a,b,gen2), t2 WITH
> (publish_generated_columns=true)
> t1 -> publishes a, b, gen2 (e.g. what column list says)
> t2 -> publishes c, d + ALSO gen1, gen2
>
> CREATE PUBLICATION pub1 FOR t1, t2(gen1) WITH (publish_generated_columns=true)
> t1 -> publishes a, b + ALSO gen1, gen2
> t2 -> publishes gen1 (e.g. what column list says)
>
> CREATE PUBLICATION pub1 FOR t1, t2 WITH (publish_generated_columns=true)
> t1 -> publishes a, b + ALSO gen1, gen2
> t2 -> publishes c, d + ALSO gen1, gen2
>
> CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns=true)
> t1 -> publishes a, b + ALSO gen1, gen2
> t2 -> publishes c, d + ALSO gen1, gen2
>
> ==
>
> The idea LGTM, although now the parameter name
> ('publish_generated_columns') seems a bit misleading since sometimes
> generated columns get published "irrespective of the option".
>
> So, I think the original parameter name 'include_generated_columns'
> might be better here because IMO "include" seems more like "add them
> if they are not already specified", which is exactly what this idea is
> doing.
>
> Thoughts?

I have verified the various scenarios discussed here and the patch
works as expected with v32 version patch shared at [1]:

Test presetup:
-- publisher
CREATE TABLE t1 (a int PRIMARY KEY, b int, gen1 int GENERATED ALWAYS
AS (a * 2) STORED, gen2 int GENERATED ALWAYS AS (a * 2) STORED);
CREATE TABLE t2 (c int PRIMARY KEY, d int, gen1 int GENERATED ALWAYS
AS (c * 2) STORED, gen2 int GENERATED ALWAYS AS (d * 2) STORED);

-- subscriber
CREATE TABLE t1 (a int PRIMARY KEY, b int, gen1 int, gen2 int);
CREATE TABLE t2 (c int PRIMARY KEY, d int, gen1 int, gen1 int);

Test1: Publisher replicates the column list data including generated
columns even though publish_generated_columns option is false:
Publisher:
CREATE PUBLICATION pub1 FOR table t1, t2(gen1) WITH
(publish_generated_columns=false)
insert into t1 values(1,1);
insert into t2 values(1,1);

Subscriber:
--t1 -> publishes a, b
subscriber=# select * from t1;
 a | b | gen1 | gen2
---+---+--+--
 1 | 1 |  |
(1 row)

--t2 -> publishes gen1 (e.g. what column list says)
subscriber=# select * from t2;
 c | d | gen1 | gen2
---+---+--+--
   |   |2 |
(1 row)

Test2: Publisher does not replication gen column if
publish_generated_columns option is false
Publisher:
CREATE PUBLICATION pub1 FOR table t1, t2 WITH (publish_generated_columns=false)
insert into t1 values(1,1);
insert into t2 values(1,1);

Subscriber:
--t1 -> publishes a, b
subscriber=# select * from t1;
 a | b | gen1 | gen2
---+---+--+--
 1 | 1 |  |
(1 row)

-- t2 -> publishes c, d
subscriber=# select * from t2;
 c | d | gen1 | gen2
---+---+--+--
 1 | 1 |  |
(1 row)

Test3: Publisher does not replication gen column if
publish_generated_columns option is false
Publisher:
CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns=false)
insert into t1 values(1,1);
insert into t2 values(1,1);

Subscriber:
--t1 -> publishes a, b
subscriber=# select * from t1;
 a | b | gen1 | gen2
---

Re: Pgoutput not capturing the generated columns

On Fri, 20 Sept 2024 at 17:15, Shubham Khanna
 wrote:
>
> On Wed, Sep 11, 2024 at 8:55 AM Peter Smith  wrote:
>
> I have fixed all the comments. The attached patches contain the desired 
> changes.
> Also the merging of 0001 and 0002 can be done once there are no
> comments on the patch to help in reviewing.

Few comments:
1) This commit  message seems wrong, currently irrespective of
publish_generated_columns, the column specified in column list take
preceedene:
When 'publish_generated_columns' is false, generated columns are not
replicated, even when present in a PUBLICATION col-list.

2) Since we have added pubgencols to pg_pubication.h we can specify
"Bump catversion" in the commit message.

3) In create publication column list/publish_generated_columns
documentation we should mention that if generated column is mentioned
in column list, generated columns mentioned in column list will be
replication irrespective of publish_generated_columns option.

4) This warning should be mentioned only if publish_generated_columns is false:
if (TupleDescAttr(tupdesc, attnum - 1)->attgenerated)
-   ereport(ERROR,
+   ereport(WARNING,

errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
-   errmsg("cannot use generated
column \"%s\" in publication column list",
+   errmsg("specified generated
column \"%s\" in publication column list for publication with
publish_generated_columns as false",
   colname));

5) These tests are not required for this feature:
+   'ALTER PUBLICATION pub5 ADD TABLE test_table WHERE (col1 > 0);' => {
+   create_order => 51,
+   create_sql =>
+ 'ALTER PUBLICATION pub5 ADD TABLE
dump_test.test_table WHERE (col1 > 0);',
+   regexp => qr/^
+   \QALTER PUBLICATION pub5 ADD TABLE ONLY
dump_test.test_table WHERE ((col1 > 0));\E
+   /xm,
+   like => { %full_runs, section_post_data => 1, },
+   unlike => {
+   exclude_dump_test_schema => 1,
+   exclude_test_table => 1,
+   },
+   },
+
+   'ALTER PUBLICATION pub5 ADD TABLE test_second_table WHERE
(col2 = \'test\');'
+ => {
+   create_order => 52,
+   create_sql =>
+ 'ALTER PUBLICATION pub5 ADD TABLE
dump_test.test_second_table WHERE (col2 = \'test\');',
+   regexp => qr/^
+   \QALTER PUBLICATION pub5 ADD TABLE ONLY
dump_test.test_second_table WHERE ((col2 = 'test'::text));\E
+   /xm,
+   like => { %full_runs, section_post_data => 1, },
+   unlike => { exclude_dump_test_schema => 1, },
+ },

Regards,
Vignesh

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-23 Thread Alexander Lakhin


Hello Nathan,

20.09.2024 19:51, Nathan Bossart wrote:

Here's a (probably naive) attempt at fixing these, too.  I'll give each
path a closer look once it feels like we've identified all the bugs.


Thank you for the updated patch!

I tested it with two code modifications (1st is to make each created
expression index TOASTed (by prepending 1M of spaces to the indexeprs
value) and 2nd to make each created index an expression index (by
modifying index_elem_options in gram.y) — both modifications are kludgy so
I don't dare to publish them) and found no other snapshot-related issues
during `make check-world`.

Best regards,
Alexander

Re: Incremental Sort Cost Estimation Instability


On 12/9/2024 12:12, David Rowley wrote:

On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov  wrote:

Initial problem causes wrong cost_sort estimation. Right now I think
about providing cost_sort() the sort clauses instead of (or in addition
to) the pathkeys.


I'm not quite sure why the sort clauses matter any more than the
EquivalenceClass.  If the EquivalanceClass defines that all members
will have the same value for any given row, then, if we had to choose
any single member to drive the n_distinct estimate from, isn't the
most accurate distinct estimate from the member with the smallest
n_distinct estimate?  (That assumes the less distinct member has every
value the more distinct member has, which might not be true)

Finally, I implemented this approach in code (see attachment).
The effectiveness may be debatable, but general approach looks even 
better than previous one.

Change status to 'Need review'.

--
regards, Andrei Lepikhov
From 3d31893eeb3ed96c7c9292c6961a0455c2a3af4c Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Mon, 23 Sep 2024 14:30:28 +0200
Subject: [PATCH v2] Stabilize incremental sort estimation.

Carefully identify an expression that can represent the path key on sort
estimation. Columns may have different distinct estimations that can trigger
wrong cost estimations and choice of sort strategy.
Sorting has only pathkeys as input for the cost estimation. This patch, instead
of a blind choice of the first equivalence class member, attempts to find
columns that are used under the estimating sort operator and chooses the most
negligible ndistinct value.
---
 src/backend/optimizer/path/costsize.c | 52 ---
 src/backend/optimizer/util/pathnode.c |  1 +
 src/include/optimizer/cost.h  |  1 +
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index e1523d15df..9c08f6b80c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1984,6 +1984,50 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
*run_cost = cpu_operator_cost * tuples;
 }
 
+static Expr *
+identify_sort_expression(PlannerInfo *root, PathKey *key, Relids relids)
+{
+   EquivalenceMember  *candidate = linitial(key->pk_eclass->ec_members);
+   double  ndistinct = -1.0; /* Not defined */
+
+   if (root == NULL || list_length(key->pk_eclass->ec_members) == 1)
+   /* Fast path */
+   return candidate->em_expr;
+
+   foreach_node(EquivalenceMember, em, key->pk_eclass->ec_members)
+   {
+   VariableStatDatavardata;
+   boolisdefault = true;
+   double  ndist;
+
+   /* Sorting over single partition? */
+   if (em->em_is_child && bms_num_members(relids) > 1)
+   continue;
+
+   if (!bms_is_subset(em->em_relids, relids))
+   /*
+* Avoid expressions not based on a table column. At 
least, the
+* candidate value already initialised as the first EC 
member.
+*/
+   continue;
+
+   /* Let's check candidate's ndistinct value */
+   examine_variable(root, (Node *) em->em_expr, 0, &vardata);
+   if (HeapTupleIsValid(vardata.statsTuple))
+   ndist = get_variable_numdistinct(&vardata, &isdefault);
+   ReleaseVariableStats(vardata);
+
+   if (ndistinct < 0.0 || (!isdefault && ndist < ndistinct))
+   {
+   candidate = em;
+   ndistinct = ndist;
+   }
+   }
+
+   Assert(candidate != NULL);
+   return candidate->em_expr;
+}
+
 /*
  * cost_incremental_sort
  * Determines and returns the cost of sorting a relation incrementally, 
when
@@ -1999,6 +2043,7 @@ cost_tuplesort(Cost *startup_cost, Cost *run_cost,
 void
 cost_incremental_sort(Path *path,
  PlannerInfo *root, List *pathkeys, 
int presorted_keys,
+ Relids relids,
  int input_disabled_nodes,
  Cost input_startup_cost, Cost 
input_total_cost,
  double input_tuples, int width, Cost 
comparison_cost, int sort_mem,
@@ -2053,21 +2098,20 @@ cost_incremental_sort(Path *path,
foreach(l, pathkeys)
{
PathKey*key = (PathKey *) lfirst(l);
-   EquivalenceMember *member = (EquivalenceMember *)
-   linitial(key->pk_eclass->ec_members);
+   Expr   *expr = identify_sort_expression(root, key, relids);
 
/*
 * Check if the expression contains V

Re: miscellaneous pg_upgrade cleanup

2024-09-23 Thread Daniel Gustafsson

> On 17 Sep 2024, at 21:22, Nathan Bossart  wrote:
> 
> Here are a few miscellaneous cleanup patches for pg_upgrade.  I don't think
> there's anything controversial here.

No objections to any of these changes, LGTM.

--
Daniel Gustafsson

Re: Allow logical failover slots to wait on synchronous replication

On Sat, Sep 21, 2024 at 6:34 AM John H  wrote:
>
> On Fri, Sep 20, 2024 at 2:44 AM shveta malik  wrote:
> > > >
> > >
> > > The difference is that the purpose of 'synchronized_standby_slots' is
> > > to mention slot names for which the user expects logical walsenders to
> > > wait before sending the logical changes to subscribers. OTOH,
> > > 'synchronous_standby_names' has a different purpose as well. It is not
> > > clear to me if the users would be interested in syncing failover slots
> > > to all the standbys mentioned in 'synchronous_standby_names'.
> > >
> >
> > Okay, I see your point. But not sure what could be the solution here
> > except documenting. But let me think more.
> >
>
> That's a great find. I didn't consider mixed physical and logical
> replicas in synchronous_standby_names.
> I wonder if there are users running synchronous_standby_names with a
> mix of logical and
> physical replicas and what the use case would be.
>

I am also not aware of the actual use cases of mixing physical and
logical synchronous standbys but as we provide that functionality, we
can't ignore it. BTW, I am also not sure if users would like the slots
to be synced on all the standbys mentioned in
synchronous_standby_names. and even, if they are, it is better to have
an explicit way of letting users specify it.

One possible approach is to extend the syntax of
"synchronized_standby_slots" similar to "synchronous_standby_names" so
that users can directly specify slots similarly and avoid waiting for
more than required standbys.

-- 
With Regards,
Amit Kapila.

Re: type cache cleanup improvements

2024-09-23 Thread Alexander Korotkov

On Wed, Sep 18, 2024 at 5:10 PM Andrei Lepikhov  wrote:
> On 13/9/2024 01:38, Alexander Korotkov wrote:
> > I've tried to implement handling of concurrent invalidation similar to
> > commit fdd965d074.  However that appears to be more difficult that I
> > thought, because for some datatypes like arrays, ranges etc we might
> > need fill the element type and reference it.  So, I decided to
> > continue with the current approach but borrowing some ideas from
> > fdd965d074.  The revised patchset attached.
> Let me rephrase the issue in more straightforward terms to ensure we are
> all clear on the problem:
> The critical problem of the typcache lookup on not-yet-locked data is
> that it can lead to an inconsistent state of the TypEntry, potentially
> causing disruptions in the DBMS's operations, correct?
> Let's exemplify this statement. By filling typentry's lt_opr, eq_opr,
> and gt_opr fields, we access the AMOPSTRATEGY cache. One operation can
> successfully fetch data from the cache, but another can miss data and
> touch the catalogue table, causing invalidations. In this case, we can
> get an inconsistent set of operators. Do I understand the problem
> statement correctly?

Actually, I didn't research much if there is a material problem.  So,
I didn't try to concurrently delete some operator class members
concurrently to lookup_type_cache().  There are probably some bugs,
but they likely have low impact in practice, given that type/opclass
changes are very rare.

Yet I was concentrated on why do lookup_type_cache() returns
TypeCacheEntry filled with whatever caller asked given there could be
concurrent invalidations.

So, my approach was to
1) Document how we currently handle concurrent invalidations.
2) Maintain RelIdToTypeIdCacheHash correctly with concurrent invalidations.

--
Regards,
Alexander Korotkov
Supabase

Re: [PATCH] Proposal to Enable/Disable Index using ALTER INDEX

Hi David,

Thank you so much for the review and pointers. I totally missed expression 
indexes. I am going to do another proper pass along with your feedback and 
follow up with an updated patch and any questions. 

Excited to be learning so much about the internals. 
Shayon

> On Sep 22, 2024, at 6:44 PM, David Rowley  wrote:
> 
> On Mon, 23 Sept 2024 at 05:43, Shayon Mukherjee  wrote:
>> - Modified get_index_paths() and build_index_paths() to exclude disabled
>>  indexes from consideration during query planning.
> 
> There are quite a large number of other places you also need to modify.
> 
> Here are 2 places where the index should be ignored but isn't:
> 
> 1. expression indexes seem to still be used for statistical estimations:
> 
> create table b as select generate_series(1,1000)b;
> create index on b((b%10));
> analyze b;
> explain select distinct b%10 from b;
> -- HashAggregate  (cost=23.00..23.12 rows=10 width=4)
> 
> alter index b_expr_idx disable;
> explain select distinct b%10 from b;
> -- HashAggregate  (cost=23.00..23.12 rows=10 width=4) <-- should be 1000 rows
> 
> drop index b_expr_idx;
> explain select distinct b%10 from b;
> -- HashAggregate  (cost=23.00..35.50 rows=1000 width=4)
> 
> 2. Indexes seem to still be used for join removals.
> 
> create table c (c int primary key);
> explain select c1.* from c c1 left join c c2 on c1.c=c2.c; --
> correctly removes join.
> alter index c_pkey disable;
> explain select c1.* from c c1 left join c c2 on c1.c=c2.c; -- should
> not remove join.
> 
> Please carefully look over all places that RelOptInfo.indexlist is
> looked at and consider skipping disabled indexes. Please also take
> time to find SQL that exercises each of those places so you can verify
> that the behaviour is correct after your change. This is also a good
> way to learn exactly all cases where indexes are used. Using this
> method would have led you to find places like
> rel_supports_distinctness(), where you should be skipping disabled
> indexes.
> 
> The planner should not be making use of disabled indexes for any
> optimisations at all.
> 
>> - catversion.h is updated with a new CATALOG_VERSION_NO to reflect change in 
>> pg_index
>>  schema.
> 
> Please leave that up to the committer. Patch authors doing this just
> results in the patch no longer applying as soon as someone commits a
> version bump.
> 
> Also, please get rid of these notices. The command tag serves that
> purpose. It's not interesting that the index is already disabled.
> 
> # alter index a_pkey disable;
> NOTICE:  index "a_pkey" is now disabled
> ALTER INDEX
> # alter index a_pkey disable;
> NOTICE:  index "a_pkey" is already disabled
> ALTER INDEX
> 
> I've only given the code a very quick glance. I don't quite understand
> why you're checking the index is enabled in create_index_paths() and
> get_index_paths(). I think the check should be done only in
> create_index_paths().  Primarily, you'll find code such as "if
> (index->indpred != NIL && !index->predOK)" in the locations you need
> to consider skipping the disabled index. I think your new code should
> be located very close to those places or perhaps within the same if
> condition unless it makes it overly complex for the human reader.
> 
> I think the documents should also mention that disabling an index is a
> useful way to verify an index is not being used before dropping it as
> the index can be enabled again at the first sign that performance has
> been effected. (It might also be good to mention that checking
> pg_stat_user_indexes.idx_scan should be the first port of call when
> checking for unused indexes)
> 
> David

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-23 Thread Nathan Bossart

On Mon, Sep 23, 2024 at 04:00:00PM +0300, Alexander Lakhin wrote:
> I tested it with two code modifications (1st is to make each created
> expression index TOASTed (by prepending 1M of spaces to the indexeprs
> value) and 2nd to make each created index an expression index (by
> modifying index_elem_options in gram.y) - both modifications are kludgy so
> I don't dare to publish them) and found no other snapshot-related issues
> during `make check-world`.

Thanks.  Here is an updated patch with tests and comments.  I've also moved
the calls to PushActiveSnapshot()/PopActiveSnapshot() to surround only the
section of code where the snapshot is needed.  Besides being more similar
in style to other fixes I found, I think this is safer because much of this
code is cautious to avoid deadlocks.  For example, DefineIndex() has the
following comment:

/*
 * The snapshot subsystem could still contain registered snapshots that
 * are holding back our process's advertised xmin; in particular, if
 * default_transaction_isolation = serializable, there is a transaction
 * snapshot that is still active.  The CatalogSnapshot is likewise a
 * hazard.  To ensure no deadlocks, we must commit and start yet another
 * transaction, and do our wait before any snapshot has been taken in 
it.
 */

I carefully inspected all the code paths this patch touches, and I think
I've got all the details right, but I would be grateful if someone else
could take a look.

-- 
nathan
>From 3be62fac910e930f5635193ac60d1536b17e0d6d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 23 Sep 2024 10:33:58 -0500
Subject: [PATCH v3 1/1] Ensure we have a snapshot when updating pg_index
 entries.

Creating, reindexing, and dropping an index concurrently could
entail accessing pg_index's TOAST table, which was recently added
in commit b52c4fc3c0.  These code paths start and commit their own
transactions internally, but they do not always set an active
snapshot.  This rightfully leads to assertion failures and ERRORs
when trying to access pg_index's TOAST table, such as the
following:

ERROR:  cannot fetch toast data without an active snapshot

To fix, push an active snapshot just before each section of code
that might require accessing pg_index's TOAST table, and pop it
shortly afterwards.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/a97d7401-e7c9-f771-6a00-037379f0a8bb%40gmail.com
---
 src/backend/catalog/index.c| 21 +
 src/backend/commands/indexcmds.c   | 24 
 src/test/regress/expected/indexing.out | 15 +++
 src/test/regress/sql/indexing.sql  | 16 
 4 files changed, 76 insertions(+)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b2b3ecb524..09d79b26b8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2276,9 +2276,17 @@ index_drop(Oid indexId, bool concurrent, bool 
concurrent_lock_mode)
 */
WaitForLockers(heaplocktag, AccessExclusiveLock, true);

+   /*
+* Updating pg_index might involve TOAST table access, so 
ensure we
+* have a valid snapshot.
+*/
+   PushActiveSnapshot(GetTransactionSnapshot());
+
/* Finish invalidation of index and mark it as dead */
index_concurrently_set_dead(heapId, indexId);

+   PopActiveSnapshot();
+
/*
 * Again, commit the transaction to make the pg_index update 
visible
 * to other sessions.
@@ -2326,6 +2334,16 @@ index_drop(Oid indexId, bool concurrent, bool 
concurrent_lock_mode)

RelationForgetRelation(indexId);

+   /*
+* Updating pg_index might involve TOAST table access, so ensure we 
have a
+* valid snapshot.  We only expect to get here without an active 
snapshot
+* in the concurrent path.
+*/
+   if (concurrent)
+   PushActiveSnapshot(GetTransactionSnapshot());
+   else
+   Assert(ActiveSnapshotSet());
+
/*
 * fix INDEX relation, and check for expressional index
 */
@@ -2343,6 +2361,9 @@ index_drop(Oid indexId, bool concurrent, bool 
concurrent_lock_mode)
ReleaseSysCache(tuple);
table_close(indexRelation, RowExclusiveLock);

+   if (concurrent)
+   PopActiveSnapshot();
+
/*
 * if it has any expression columns, we might have stored statistics 
about
 * them.
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f99c2d2dee..7d41cb73ab 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1798,11 +1798,19 @@ DefineIndex(Oid tableId,

PROGRESS_CREATEIDX_PHASE_WAIT_3);
WaitForOlderSnapshot

Re: restrict_nonsystem_relation_kind led to regression (kinda)

2024-09-23 Thread Alvaro Herrera

Hi Magnus,

On 2024-Sep-23, Magnus Holmgren wrote:

> We've set allow_system_table_mods = on so that we could rename 
> pg_database and in its place put a custom view that only lists the
> databases the current user has CONNECT privileges to. This is because
> 1) we allow customers direct (read only) access to their databases, but
> 2) we don't want them to see the other customers, and 3) restricting
> access to pg_database altogether leads to the GUIs the customers use
> spamming error messages because they expect pg_database to be readable,
> and that makes the customers (or their consultants) annoyed.

Your use case and problem seem to match bug report #18604 almost
exactly:
https://postgr.es/m/18604-04d64b68e981c...@postgresql.org

I suggest to read that discussion, as it contains useful information.
As I understand, you're only really safe (not just theatrically safe) by
giving each customer a separate Postgres instance.

Regards

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

restrict_nonsystem_relation_kind led to regression (kinda)

2024-09-23 Thread Magnus Holmgren

Hello,

We've set allow_system_table_mods = on so that we could rename 
pg_database and in its place put a custom view that only lists the
databases the current user has CONNECT privileges to. This is because
1) we allow customers direct (read only) access to their databases, but
2) we don't want them to see the other customers, and 3) restricting
access to pg_database altogether leads to the GUIs the customers use
spamming error messages because they expect pg_database to be readable,
and that makes the customers (or their consultants) annoyed.

A problem arose after the fix for CVE-2024-7348, because certain
queries that pg_dump runs use pg_database, and those are now blocked,
so pg_dump fails. Well, actually, it's just subscriptions that are the
problem when it comes to pg_dump: pg_dump --no-subscriptions works in
our case. However, pg_dumpall runs a different query that also uses
pg_database and that I don't think is possible to avoid.

I realise that if you use allow_system_table_mods, you're kinda on your
own, but it exists after all, and this security fix seems to make it
less usable, if not unusable.

Could views owned by postgres and/or in the pg_catalog namespace be
considered system relations, even if customized? There's no way to
suppress the use of restrict_nonsystem_relation_kind if you know that
there are no untrusted users with object creation privileges, is there?

Alternatively, do you have any other suggestions as to how to solve the
original problem (we'd like to avoid renaming the databases so they
don't reveal the customer names)?

-- 
Greetings,
Magnus Holmgren

Milient Software | www.milientsoftware.com

Re: Fixing backslash dot for COPY FROM...CSV

2024-09-23 Thread Artur Zakirov

On Wed, 31 Jul 2024 at 15:42, Daniel Verite  wrote:
>
> Sutou Kouhei wrote:
>
> > BTW, here is a diff after pgindent:
>
> PFA a v5 with the cosmetic changes applied.

Thank you Daniel for working on it. I've tested the patch and it seems
it works as expected.

I have a couple of minor comments.

It seems it isn't necessary to handle "\." within
"CopyAttributeOutCSV()" (file "src/backend/commands/copyto.c")
anymore.

/*
 * Because '\.' can be a data value, quote it if it appears alone on a
 * line so it is not interpreted as the end-of-data marker.
 */
if (single_attr && strcmp(ptr, "\\.") == 0)
use_quote = true;

You might see the difference in the test "cooy2.sql". Without changes
the output is:

=# COPY testeoc TO stdout CSV;
a\.
\.b
c\.d
"\."

Another thing is that the comparison "copystream ==
pset.cur_cmd_source" happens twice within "handleCopyIn()". TBH it is
a bit confusing to me what is the real purpose of that check, but one
of the comparisons looks unnecessary.

-- 
Kind regards,
Artur
Supabase

Re: Compress ReorderBuffer spill files using LZ4

2024-09-23 Thread Tomas Vondra

Hi,

I've spent a bit more time on this, mostly running tests to get a better
idea of the practical benefits.

Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
for pglz_compress() to return -1. This can happen if the data is not
compressible, and would not fit into the output buffer. The code can't
just do elog(ERROR) in this case, it needs to handle that by storing the
raw data. The attached fixup patch makes this work for me - I'm not
claiming this is the best way to handle this, but it works.

FWIW I find it strange the tests included in the patch did not trigger
this. That probably means the tests are not quite sufficient.


Now, to the testing. Attached are two scripts, testing different cases:

test-columns.sh - Table with a variable number of 'float8' columns.

test-toast.sh - Table with a single text column.

The script always sets up a publication/subscription on two instances,
generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
waits for it to be replicated to the replica, and measures how much data
was spilled to disk with the different compression methods (off, pglz
and lz4). There's a couple more metrics, but that's irrelevant here.

For the "column" test, it looks like this (this is in MB):

rowscolumnsdistributionoff pglzlz4
  
  10   1000compressible778  20   9
 random778 778  16
  
  100   100compressible916 116  62
 random916 916  67

It's very clear that for the "compressible" data (which just copies the
same value into all columns), both pglz and lz4 can significantly reduce
the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
columns it's a bit less efficient, but still good.

For the "random" data (where every column gets a random value, but rows
are copied), it's a very different story - pglz does not help at all,
while lz4 still massively reduces the amount of spilled data.

I think the explanation is very simple - for pglz, we compress each row
on it's own, there's no concept of streaming/context. If a row is
compressible, it works fine, but when the row gets random, pglz can't
compress it at all. For lz4, this does not matter, because with the
streaming mode it still sees that rows are just repeated, and so can
compress them efficiently.

For TOAST test, the results look like this:

  distribution repeatstoast   offpglz lz4
  ===
  compressible   1lz4  14   2   1
  pglz 40   4   3
  1000lz4  32  16   9
  pglz 54  17  10
-
random   1lz4330533053157
  pglz   330533053157
  1000lz4316631621580
  pglz   333433261745
   --
   random2   1lz4330533053157
  pglz   330533053158
  1000lz4316031563010
  pglz   333433263172

The "repeats" value means how long the string is - it's the number of
"md5" hashes added to the string. The number of rows is calculated to
keep the total amount of data the same. The "toast" column tracks what
compression was used for TOAST, I was wondering if it matters.

This time there are three data distributions - compressible means that
each TOAST value is nicely compressible, "random" means each value is
random (not compressible), but the rows are just copy of the same value
(so on the whole there's a lot of redundancy). And "random2" means each
row is random and unique (so not compressible at all).

The table shows that with compressible TOAST values, compressing the
spill file is rather useless. The reason is that ReorderBufferCompress
is handling raw TOAST data, which is already compressed. Yes, it may
further reduce the amount of data, but it's negligible when compared to
the original amount of data.

For the random cases, the spill compression is rather pointless. Yes,
lz4 can reduce it to 1/2 for the shorter strings, but other than that
it's not very useful.

For a while I was thinking this approach is flawed, because it only sees
and compressed changes one by one, and that seeing a batch of changes
would improve this (e.g. we'd see the copied rows). But I realized lz4
already does that (in the streaming mode at least), and yet it does n

Re: Cleaning up ERRCODE usage in our XML code

2024-09-23 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 20 Sep 2024, at 21:00, Tom Lane  wrote:
>> Per cfbot, rebased over d5622acb3.  No functional changes.

> Looking over these I don't see anything mis-characterized so +1 on going ahead
> with these.  It would be neat if we end up translating xml2 errors into XQuery
> Error SQLSTATEs but this is a clear improvement over what we have until then.

Thanks for looking at it!

> There is an ERRCODE_INTERNAL_ERROR in xml_out_internal() which seems a tad odd
> given that any error would be known to be parsing related and b) are caused by
> libxml and not internally.  Not sure if it's worth bothering with but with the
> other ones improved it stood out.

Yeah, I looked at that but wasn't sure what to do with it.  We should
have validated the decl header when the XML value was created, so if
we get here then either the value got corrupted on-disk or in-transit,
or something forgot to do that validation, or libxml has changed its
mind since then about what's a valid decl.  At least some of those
cases are probably legitimately called INTERNAL_ERROR.  I thought for
awhile about ERRCODE_DATA_CORRUPTED, but I'm not convinced that's a
better fit.

regards, tom lane

Re: Add llvm version into the version string

2024-09-23 Thread Dmitry Dolgov

> On Sun, Sep 22, 2024 at 01:15:54PM GMT, Dmitry Dolgov wrote:
> > On Sat, Sep 21, 2024 at 05:25:30PM GMT, Tom Lane wrote:
> >
> > Is there a way to get the llvm library's version at run time?  If so
> > we could consider building a new function to return that.
>
> Yes, there is a C api LLVMGetVersion to get the major, minor and patch
> numbers. The jit provider could be extended to return this information.
>
> About a new function, I think that the llvm runtime version has to be
> shown in some form by pgsql_version. The idea is to skip an emails
> exchange like "here is the bug report" -> "can you attach the llvm
> version?". If it's going to be a new separate function, I guess it won't
> make much difference to ask either to call the new function or the
> llvm-config, right?

Here is what I had in mind. It turns out the LLVMGetVersion API is
available only starting from LLVM 16, so there is need to indicate if
the version string is available (otherwise llvmjit would have to fully
format the version string).
>From dca9347c39cfca45de969ee49536962836ccbba4 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 23 Sep 2024 19:08:52 +0200
Subject: [PATCH v2] Add jit provider's version into the pgsql_version

If jit provider is available and initialized, add its version string
into the pgsql_version output. This should help to get more relevant
information in bug reports about JIT.
---
 src/backend/jit/jit.c   | 13 +
 src/backend/jit/llvm/llvmjit.c  | 17 +
 src/backend/utils/adt/version.c | 15 ++-
 src/include/jit/jit.h   |  8 
 src/include/jit/llvmjit.h   |  2 ++
 5 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index 815b58f..b62422c 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -188,3 +188,16 @@ InstrJitAgg(JitInstrumentation *dst, JitInstrumentation 
*add)
INSTR_TIME_ADD(dst->optimization_counter, add->optimization_counter);
INSTR_TIME_ADD(dst->emission_counter, add->emission_counter);
 }
+
+/*
+ * Return JIT provider's version string for troubleshooting purposes.
+ */
+const char *
+jit_get_version(bool *available)
+{
+   if (provider_init())
+   return provider.get_version(available);
+
+   *available = false;
+   return "";
+}
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 0f6cec5..3ab1080 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -165,6 +165,7 @@ _PG_jit_provider_init(JitProviderCallbacks *cb)
cb->reset_after_error = llvm_reset_after_error;
cb->release_context = llvm_release_context;
cb->compile_expr = llvm_compile_expr;
+   cb->get_version = llvm_version;
 }
 
 
@@ -1382,3 +1383,19 @@ ResOwnerReleaseJitContext(Datum res)
context->resowner = NULL;
jit_release_context(&context->base);
 }
+
+const char *
+llvm_version(bool *available)
+{
+#if LLVM_VERSION_MAJOR > 15
+   unsigned int major, minor, patch;
+
+   LLVMGetVersion(&major, &minor, &patch);
+
+   *available = true;
+   return (const char*) psprintf("llvm: %d.%d.%d", major, minor, patch);
+#else
+   *available = false;
+   return "";
+#endif
+}
diff --git a/src/backend/utils/adt/version.c b/src/backend/utils/adt/version.c
index 28db1f6..fabaf73 100644
--- a/src/backend/utils/adt/version.c
+++ b/src/backend/utils/adt/version.c
@@ -15,10 +15,23 @@
 #include "postgres.h"
 
 #include "utils/builtins.h"
+#include "jit/jit.h"
 
 
 Datum
 pgsql_version(PG_FUNCTION_ARGS)
 {
-   PG_RETURN_TEXT_P(cstring_to_text(PG_VERSION_STR));
+   bool jit_available = false;
+   const char *jit_version = jit_get_version(&jit_available);
+
+   /* Add jit provides's version string if available. */
+   if (jit_available)
+   {
+   PG_RETURN_TEXT_P(cstring_to_text(psprintf("%s, %s", 
PG_VERSION_STR,
+   
  jit_version)));
+   }
+   else
+   {
+   PG_RETURN_TEXT_P(cstring_to_text(PG_VERSION_STR));
+   }
 }
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index d9a080c..35d5a62 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -70,12 +70,14 @@ typedef void (*JitProviderResetAfterErrorCB) (void);
 typedef void (*JitProviderReleaseContextCB) (JitContext *context);
 struct ExprState;
 typedef bool (*JitProviderCompileExprCB) (struct ExprState *state);
+typedef const char* (*JitProviderVersion) (bool *available);
 
 struct JitProviderCallbacks
 {
JitProviderResetAfterErrorCB reset_after_error;
JitProviderReleaseContextCB release_context;
JitProviderCompileExprCB compile_expr;
+   JitProviderVersion get_version;
 };
 
 
@@ -102,5 +104,11 @@ extern void jit_release_context(JitContext *context);
 extern bool jit_compile_exp

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

On Mon, 23 Sept 2024 at 18:28, Tatsuo Ishii  wrote:
> I agree and made necessary changes. See attached v4 patches.

Looks good to me.

David

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2024-09-23 Thread Jakub Wartak

Good morning!

FYI: I know many people are/were tracking this email thread rather
than the newer and more recent one "scalability bottlenecks with
(many) partitions (and more)",  but please see [1] [2] , where Tomas
committed enhanced fast-path locking to the master(18).

Thanks Tomas for persistence on this!

-J.

[1] - 
https://www.postgresql.org/message-id/E1ss4gX-000IvX-63%40gemulon.postgresql.org
[2] - 
https://www.postgresql.org/message-id/7c1eeafb-2375-4ff6-8469-0640d52d44ed%40vondra.me

Re: PostgreSQL 17 RC1 & GA dates

2024-09-23 Thread Sandeep Thakkar

Hi,

I would like to know if the PostgreSQL 17 GAis  happening this week or is
there a change in dates?

On Fri, Aug 23, 2024 at 7:44 PM Jonathan S. Katz 
wrote:

> Hi,
>
> The date for PostgreSQL 17 Release Candidate 1 (RC1) is September 5,
> 2024. Please ensure all open items[1] are completed and committed before
> August 31, 2024 12:00 UTC to allow enough time for them to clear the
> buildfarm.
>
> The current target date for the PostgreSQL 17 GA release is September
> 26, 2024. While this date could change if the release team decides the
> candidate release is not ready, please plan for this date to be the GA
> release.
>
> Thanks,
>
> Jonathan
>
> [1] https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
>


-- 
Sandeep Thakkar

Re: ANALYZE ONLY

On Mon, Sep 23, 2024 at 12:46 PM David Rowley  wrote:
>
> On Mon, 23 Sept 2024 at 15:29, jian he  wrote:
> > Given the above context, I am still confused with this sentence in
> > sql-analyze.html.
> > "If ONLY is specified before the table name, only that table is analyzed."
> >
> > like in the above sql example, only_inh_parent's child is also being 
> > analyzed.
>
> I guess it depends what you define "analyzed" to mean. In this
> context, it means gathering statistics specifically for a certain
> table.
>
> Would it be more clear if "only that table is analyzed." was changed
> to "then statistics are only gathered specifically for that table."?
>

looking at expand_vacuum_rel, analyze_rel.
if we
-
if (onerel->rd_rel->relhassubclass)
do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
  true, in_outer_xact, elevel);

change to

if (onerel->rd_rel->relhassubclass && ((!relation ||
relation->inh) || onerel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE))
do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
   true, in_outer_xact, elevel);

then the inheritance table will behave the way the doc says.

for example:
drop table if exists only_inh_parent,only_inh_child;
CREATE TABLE only_inh_parent (a int , b INT) with (autovacuum_enabled = false);
CREATE TABLE only_inh_child () INHERITS (only_inh_parent) with
(autovacuum_enabled = false);
INSERT INTO only_inh_child(a,b) select g % 80, (g + 1) % 200 from
generate_series(1,1000) g;
ANALYZE ONLY only_inh_parent;
select stadistinct,starelid::regclass,staattnum, stainherit
from  pg_statistic
where starelid = ANY ('{only_inh_parent, only_inh_child}'::regclass[]);

will return zero rows, since the physical table only_inh_parent has no storage.

sql-analyze.html
For partitioned tables, ANALYZE gathers statistics by sampling rows
from all partitions. Each leaf partition is also recursively analyzed
and the statistics updated. This recursive part may be disabled by
using the ONLY keyword, otherwise, leaf partitions are analyzed only
once, even with multi-level partitioning. No statistics are collected
for only the parent table (without data from its partitions), because
with partitioning it's guaranteed to be empty.

allow me to ask anenglish language question.
here "otherwise" means  specify ONLY or not?
As far as i understand.
if you specify ONLY, postgres will only do "For partitioned tables,
ANALYZE gathers statistics by sampling rows from all partitions"
if you not specify ONLY, postgres will do  "For partitioned tables,
ANALYZE gathers statistics by sampling rows from all partitions *AND*
also recursively analyze each leaf partition"

Is my understanding correct?

I think there is a whitespace error in "ANALYZE ONLY vacparted(a,b); "
in vacuum.out.

Re: Why don't we consider explicit Incremental Sort?

2024-09-23 Thread Tomas Vondra

On 9/23/24 05:03, Richard Guo wrote:
> On Sun, Sep 22, 2024 at 1:38 PM David Rowley  wrote:
>> Just looking at the commit message:
>>
>>> The rationale is based on the assumption that incremental sort is
>>> always faster than full sort when there are presorted keys, a premise
>>> that has been applied in various parts of the code.  This assumption
>>> does not always hold, particularly in cases with a large skew in the
>>> number of rows within the presorted groups.
>>
>> My understanding is that the worst case for incremental sort is the
>> same as sort, i.e. only 1 presorted group, which is the same effort to
>> sort. Is there something I'm missing?
> 
> I was thinking that when there’s a large skew in the number of rows
> per pre-sorted group, incremental sort might underperform full sort.
> As mentioned in the comments for cost_incremental_sort, it has to
> detect the sort groups, plus it needs to perform tuplesort_reset after
> each group.  However, I doubt these factors would have a substantial
> impact on the performance of incremental sort.  So maybe in the worst
> skewed groups case, incremental sort may still perform similarly to
> full sort.
> 
> Another less obvious reason is that in cases of skewed groups, we may
> significantly underestimate the cost of incremental sort.  This could
> result in choosing a more expensive subpath under the sort.  Such as
> the example in [1], we end up with IndexScan->IncrementalSort rather
> than SeqScan->FullSort, while the former plan is more expensive to
> execute.  However, this point does not affect this patch, because for
> a mergejoin, we only consider outerrel's cheapest-total-cost when we
> assume that an explicit sort atop is needed, i.e., we do not have a
> chance to select which subpath to use in this case.
> 
> [1] 
> https://postgr.es/m/CAMbWs49+CXsrgbq0LD1at-5jR=ahhn0ytdy9yvgxasmfndz...@mail.gmail.com
> 

You don't need any skew. Consider this perfectly uniform dataset:

create table t (a int, b int);
insert into t select 10 * random(), 100 * random()
  from generate_series(1,100) s(i);
create index on t (a);
vacuum analyze;
checkpoint;

explain analyze select * from t order by a, b;

QUERY PLAN
-
 Sort  (cost=127757.34..130257.34 rows=100 width=8)
   (actual time=186.288..275.813 rows=100 loops=1)
   Sort Key: a, b
   Sort Method: external merge  Disk: 17704kB
   ->  Seq Scan on t  (cost=0.00..14425.00 rows=100 width=8)
(actual time=0.005..35.989 rows=100 loops=1)
 Planning Time: 0.075 ms
 Execution Time: 301.143 ms
(6 rows)


set enable_incremental_sort = on;
explain analyze select * from t order by a, b;

QUERY PLAN
-
 Incremental Sort  (cost=1.03..68908.13 rows=100 width=8)
(actual time=0.102..497.143 rows=100 loops=1)
   Sort Key: a, b
   Presorted Key: a
   Full-sort Groups: 27039  Sort Method: quicksort
 Average Memory: 25kB  Peak Memory: 25kB
   ->  Index Scan using t_a_idx on t
   (cost=0.42..37244.25 rows=100 width=8)
(actual time=0.050..376.403 rows=100 loops=1)
 Planning Time: 0.057 ms
 Execution Time: 519.301 ms
(7 rows)

Sure, the table is very small, but the example is not crazy. In fact,
this is the "nicest" example for estimation - it's perfectly random, no
correlations, no skew.


regards

-- 
Tomas Vondra

Re: miscellaneous pg_upgrade cleanup

2024-09-23 Thread Nathan Bossart

On Mon, Sep 23, 2024 at 03:04:22PM +0200, Daniel Gustafsson wrote:
> No objections to any of these changes, LGTM.

Thanks for reviewing.  I'll commit these once the v17 release freeze is
over (since 0001 needs to be back-patched there).

-- 
nathan

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-23 Thread Tom Lane

Tomas Vondra  writes:
> Thanks. Pushed a fix for these issues, hopefully coverity will be happy.

Thanks.

> BTW is the coverity report accessible somewhere? I know someone
> mentioned that in the past, but I don't recall the details. Maybe we
> should have a list of all these resources, useful for committers,
> somewhere on the wiki?

Currently those reports only go to the security team.  Perhaps
we should rethink that?

regards, tom lane

Re: Incremental Sort Cost Estimation Instability


On 12/9/2024 16:57, Tomas Vondra wrote:

On 9/12/24 12:12, David Rowley wrote:

On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov  wrote:

 but doesn't estimate_num_groups()
supposed to already do that? The comment says:

  * 3.  If the list contains Vars of different relations that are known equal
  * due to equivalence classes, then drop all but one of the Vars from each
  * known-equal set, keeping the one with smallest estimated # of values
  * (since the extra values of the others can't appear in joined rows).
  * Note the reason we only consider Vars of different relations is that
  * if we considered ones of the same rel, we'd be double-counting the
  * restriction selectivity of the equality in the next step.

I haven't debugged this, but how come this doesn't do the trick?

I've got your point now.
Unfortunately, this comment says that estimate_num_groups removes 
duplicates from the list of grouping expressions (see 
exprs_known_equal). But it doesn't discover em_members to find the 
most-fitted clause for each grouping position.


--
regards, Andrei Lepikhov

Re: Increase of maintenance_work_mem limit in 64-bit Windows

2024-09-23 Thread Vladlen Popolitov


David Rowley писал(а) 2024-09-23 15:35:

On Mon, 23 Sept 2024 at 21:01, Vladlen Popolitov
 wrote:
Thank you for proposal, I looked at the patch and source code from 
this

point of view. In this approach we need to change all .
I counted the appearences of these vars in the code:
maintenance_work_mem appears 63 times in 20 files
work_mem appears 113 times in 48 files
logical_decoding_work_mem appears 10 times in 2 files
max_stack_depth appears 11 times in 3 files
wal_keep_size_mb appears 5 times in 3 files
min_wal_size_mb appears 5 times in 2 files
max_wal_size_mb appears 10 times in 2 files
wal_skip_threshold appears 5 times in 2 files
max_slot_wal_keep_size_mb appears 6 times in 3 files
wal_sender_timeout appears 23 times in 3 files
autovacuum_work_mem appears 11 times in 4 files
gin_pending_list_limit appears 8 times in 5 files
pendingListCleanupSize appears 2 times in 2 files
GinGetPendingListCleanupSize appears 2 times in 2 files


Why do you think all of these appearances matter? I imagined all you
care about are when the values are multiplied by 1024.
Common pattern in code - assign  to local variable and 
send

local variable as parameter to function, then to nested function, and
somewhere deep multiply function parameter by 1024. It is why I needed 
to

check all appearances, most of them are correct.

If I check the rest of the variables, the patch does not need
MAX_SIZE_T_KILOBYTES constant (I introduced it for variables, that are
already checked and fixed), it will contain only fixes in the types of
the variables and the constants.
It requires a lot of time to check all appearances and neighbour
code, but final patch will not be large, I do not expect a lot of
"long" in the rest of the code (only 4 case out of 63 needed to fix
for maintenance_work_mem).
What do you think about this approach?


I don't think you can do maintenance_work_mem without fixing work_mem
too. I don't think the hacks you've put into RI_Initial_Check() to
ensure you don't try to set work_mem beyond its allowed range are very
good. It effectively means that maintenance_work_mem does not do what
it's meant to for the initial validation of referential integrity
checks. If you're not planning on fixing work_mem too, would you just
propose to leave those hacks in there forever?

I agree, it is better to fix all them together. I also do not like this
hack, it will be removed from the patch, if I check and change
all  at once.
I think, it will take about 1 week to fix and test all changes. I will
estimate the total volume of the changes and think, how to group them
in the patch ( I hope, it will be only one patch)

--
Best regards,

Vladlen Popolitov.

Re: ANALYZE ONLY

On Mon, 23 Sept 2024 at 23:23, jian he  wrote:
> looks fine. but maybe we can add the following information
> "if The ONLY is specified, the second set of
> statistics won't include each children individual statistics"
> I think that's the main difference between specifying ONLY or not?

Ok, I think you're not understanding this yet and I'm not sure what I
can make more clear in the documents.

Let me explain... For inheritance parent tables, ANALYZE ONLY will
gather inheritance and non-inheritance statistics for ONLY the parent.

Here's an example of that:

drop table if exists parent,child;
create table parent(a int);
create table child () inherits (parent);
insert into parent values(1);
insert into child values(1);

analyze ONLY parent;
select starelid::regclass,stainherit,stadistinct from pg_statistic
where starelid in ('parent'::regclass,'child'::regclass);
 starelid | stainherit | stadistinct
--++-
 parent   | f  |  -1  <- this is the distinct estimate
for SELECT * FROM ONLY parent;
 parent   | t  |-0.5 <- this is the distinct estimate
for SELECT * FROM parent;
(2 rows)

For the stainherit==false stats, only 1 row is sampled here as that's
the only row directly located in the "parent" table.
For the stainherit==true stats, 2 rows are sampled, both of them have
"a" == 1. The stadistinct reflects that fact.

Note there have been no statistics recorded for "child". However,
analyze did sample rows in that table as part of gathering sample rows
for "parent" for the stainherit==true row.

Now let's try again without ONLY.

analyze parent;

select starelid::regclass,stainherit,stadistinct from pg_statistic
where starelid in ('parent'::regclass,'child'::regclass);
 starelid | stainherit | stadistinct
--++-
 parent   | f  |  -1
 parent   | t  |-0.5
 child| f  |  -1
(3 rows)

All of the above rows were re-calculated with the "analyze parent"
command, the first two rows have the same values as nothing changed in
the table, however, there are now statistics stored for the "child"
table.

> catalog-pg-statistic.html second paragraph seems very clear to me.
> Maybe we can link it somehow

I don't know what "link it" means in this context.

David

Re: Pgoutput not capturing the generated columns

On Thu, 12 Sept 2024 at 11:01, Peter Smith  wrote:
>
> Because this feature is now being implemented as a PUBLICATION option,
> there is another scenario that might need consideration; I am thinking
> about where the same table is published by multiple PUBLICATIONS (with
> different option settings) that are subscribed by a single
> SUBSCRIPTION.
>
> e.g.1
> -
> CREATE PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = true);
> CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> -
>
> e.g.2
> -
> CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns = 
> true);
> CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> -
>
> Do you know if this case is supported? If yes, then which publication
> option value wins?

I have verified the various scenarios discussed here and the patch
works as expected:
Test presetup:
-- publisher
CREATE TABLE t1 (a int PRIMARY KEY, b int, c int, gen1 int GENERATED
ALWAYS AS (a * 2) STORED, gen2 int GENERATED ALWAYS AS (a * 2)
STORED);
-- Subscriber
CREATE TABLE t1 (a int PRIMARY KEY, b int, c int, d int, e int);

Test1: Subscriber will have only non-generated columns a,b,c
replicated from publisher:
create publication pub1 for all tables with (
publish_generated_columns = false);
INSERT INTO t1 (a,b,c) VALUES (1,1,1);

--Subscriber will have only non-generated columns a,b,c replicated
from publisher:
subscriber=# select * from t1;
 a | b | c | d | e
---+---+---+---+---
 1 | 1 | 1 |   |
(1 row)

Test2: Subscriber will include generated columns a,b,c replicated from
publisher:
create publication pub1 for all tables with ( publish_generated_columns = true);
INSERT INTO t1 (a,b,c) VALUES (1,1,1);

-- Subscriber will include generated columns a,b,c replicated from publisher:
subscriber=# select * from t1;
 a | b | c | d | e
---+---+---+---+---
 1 | 1 | 1 | 2 | 2
(1 row)

Test3: Cannot have subscription subscribing to publication with
publish_generated_columns as true and false
-- publisher
create publication pub1 for all tables with (publish_generated_columns = false);
create publication pub2 for all tables with (publish_generated_columns = true);

-- subscriber
subscriber=# create subscription sub1 connection 'dbname=postgres
host=localhost port=5432' publication pub1,pub2;
ERROR:  cannot use different column lists for table "public.t1" in
different publications

Test4a: Warning thrown when a generated column is specified in column
list along with publish_generated_columns as false
-- publisher
postgres=# create publication pub1 for table t1(a,b,gen1) with (
publish_generated_columns = false);
WARNING:  specified generated column "gen1" in publication column list
for publication with publish_generated_columns as false
CREATE PUBLICATION

Regards,
Vignesh

Re: Pgoutput not capturing the generated columns

On Fri, 20 Sept 2024 at 17:15, Shubham Khanna
 wrote:
>
> On Wed, Sep 11, 2024 at 8:55 AM Peter Smith  wrote:
> >
> > Here are a some more review comments for patch v30-0001.
> >
> > ==
> > src/sgml/ref/create_publication.sgml
> >
> > 1.
> > + 
> > +  If the publisher-side column is also a generated column
> > then this option
> > +  has no effect; the publisher column will be filled as normal 
> > with the
> > +  publisher-side computed or default data.
> > + 
> >
> > It should say "subscriber-side"; not "publisher-side". The same was
> > already reported by Sawada-San [1].
> >
> > ~~~
> >
> > 2.
> > + 
> > + This parameter can only be set true if
> > copy_data is
> > + set to false.
> > + 
> >
> > IMO this limitation should be addressed by patch 0001 like it was
> > already done in the previous patches (e.g. v22-0002). I think
> > Sawada-san suggested the same [1].
> >
> > Anyway, 'copy_data' is not a PUBLICATION option, so the fact it is
> > mentioned like this without any reference to the SUBSCRIPTION seems
> > like a cut/paste error from the previous implementation.
> >
> > ==
> > src/backend/catalog/pg_publication.c
> >
> > 3. pub_collist_validate
> > - if (TupleDescAttr(tupdesc, attnum - 1)->attgenerated)
> > - ereport(ERROR,
> > - errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
> > - errmsg("cannot use generated column \"%s\" in publication column list",
> > -colname));
> > -
> >
> > Instead of just removing this ERROR entirely here, I thought it would
> > be more user-friendly to give a WARNING if the PUBLICATION's explicit
> > column list includes generated cols when the option
> > "publish_generated_columns" is false. This combination doesn't seem
> > like something a user would do intentionally, so just silently
> > ignoring it (like the current patch does) is likely going to give
> > someone unexpected results/grief.
> >
> > ==
> > src/backend/replication/logical/proto.c
> >
> > 4. logicalrep_write_tuple, and logicalrep_write_attrs:
> >
> > - if (att->attisdropped || att->attgenerated)
> > + if (att->attisdropped)
> >   continue;
> >
> > Why aren't you also checking the new PUBLICATION option here and
> > skipping all gencols if the "publish_generated_columns" option is
> > false? Or is the BMS of pgoutput_column_list_init handling this case?
> > Maybe there should be an Assert for this?
> >
> > ==
> > src/backend/replication/pgoutput/pgoutput.c
> >
> > 5. send_relation_and_attrs
> >
> > - if (att->attisdropped || att->attgenerated)
> > + if (att->attisdropped)
> >   continue;
> >
> > Same question as #4.
> >
> > ~~~
> >
> > 6. prepare_all_columns_bms and pgoutput_column_list_init
> >
> > + if (att->attgenerated && !pub->pubgencolumns)
> > + cols = bms_del_member(cols, i + 1);
> >
> > IIUC, the algorithm seems overly tricky filling the BMS with all
> > columns, before straight away conditionally removing the generated
> > columns. Can't it be refactored to assign all the correct columns
> > up-front, to avoid calling bms_del_member()?
> >
> > ==
> > src/bin/pg_dump/pg_dump.c
> >
> > 7. getPublications
> >
> > IIUC, there is lots of missing SQL code here (for all older versions)
> > that should be saying "false AS pubgencolumns".
> > e.g. compare the SQL with how "false AS pubviaroot" is used.
> >
> > ==
> > src/bin/pg_dump/t/002_pg_dump.pl
> >
> > 8. Missing tests?
> >
> > I expected to see a pg_dump test for this new PUBLICATION option.
> >
> > ==
> > src/test/regress/sql/publication.sql
> >
> > 9. Missing tests?
> >
> > How about adding another test case that checks this new option must be
> > "Boolean"?
> >
> > ~~~
> >
> > 10. Missing tests?
> >
> > --- error: generated column "d" can't be in list
> > +-- ok: generated columns can be in the list too
> >  ALTER PUBLICATION testpub_fortable ADD TABLE testpub_tbl5 (a, d);
> > +ALTER PUBLICATION testpub_fortable DROP TABLE testpub_tbl5;
> >
> > (see my earlier comment #3)
> >
> > IMO there should be another test case for a WARNING here if the user
> > attempts to include generated column 'd' in an explicit PUBLICATION
> > column list while the "publish_generated-columns" is false.
> >
> > ==
> > [1]  
> > https://www.postgresql.org/message-id/CAD21AoA-tdTz0G-vri8KM2TXeFU8RCDsOpBXUBCgwkfokF7%3DjA%40mail.gmail.com
> >
>
> I have fixed all the comments. The attached patches contain the desired 
> changes.
> Also the merging of 0001 and 0002 can be done once there are no
> comments on the patch to help in reviewing.

The warning message appears to be incorrect. Even though
publish_generated_columns is set to true, the warning indicates that
it is false.
CREATE TABLE t1 (a int, gen1 int GENERATED ALWAYS AS (a * 2) STORED);
postgres=# CREATE PUBLICATION pub1 FOR table t1(gen1) WITH
(publish_generated_columns=true);
WARNING:  specified generated column "gen1" in publication column list
for publication with publish_generated_columns as false

Regards,

Re: Increase of maintenance_work_mem limit in 64-bit Windows

On Mon, 23 Sept 2024 at 21:01, Vladlen Popolitov
 wrote:
> Thank you for proposal, I looked at the patch and source code from this
> point of view. In this approach we need to change all .
> I counted the appearences of these vars in the code:
> maintenance_work_mem appears 63 times in 20 files
> work_mem appears 113 times in 48 files
> logical_decoding_work_mem appears 10 times in 2 files
> max_stack_depth appears 11 times in 3 files
> wal_keep_size_mb appears 5 times in 3 files
> min_wal_size_mb appears 5 times in 2 files
> max_wal_size_mb appears 10 times in 2 files
> wal_skip_threshold appears 5 times in 2 files
> max_slot_wal_keep_size_mb appears 6 times in 3 files
> wal_sender_timeout appears 23 times in 3 files
> autovacuum_work_mem appears 11 times in 4 files
> gin_pending_list_limit appears 8 times in 5 files
> pendingListCleanupSize appears 2 times in 2 files
> GinGetPendingListCleanupSize appears 2 times in 2 files

Why do you think all of these appearances matter? I imagined all you
care about are when the values are multiplied by 1024.

> If I check the rest of the variables, the patch does not need
> MAX_SIZE_T_KILOBYTES constant (I introduced it for variables, that are
> already checked and fixed), it will contain only fixes in the types of
> the variables and the constants.
> It requires a lot of time to check all appearances and neighbour
> code, but final patch will not be large, I do not expect a lot of
> "long" in the rest of the code (only 4 case out of 63 needed to fix
> for maintenance_work_mem).
> What do you think about this approach?

I don't think you can do maintenance_work_mem without fixing work_mem
too. I don't think the hacks you've put into RI_Initial_Check() to
ensure you don't try to set work_mem beyond its allowed range are very
good. It effectively means that maintenance_work_mem does not do what
it's meant to for the initial validation of referential integrity
checks. If you're not planning on fixing work_mem too, would you just
propose to leave those hacks in there forever?

David

Re: Incremental Sort Cost Estimation Instability


On 12/9/2024 16:57, Tomas Vondra wrote:

On 9/12/24 12:12, David Rowley wrote:

On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov  wrote:

I'm not sure how to fix this, but it seems estimate_num_groups() needs
to do things differently. And I agree looking for the minimum ndistinct
seems like the right approach. but doesn't estimate_num_groups()
supposed to already do that? The comment says:
I've rewritten the code in the previous email. It looks like we can try 
to rewrite estimate_num_groups to do it more effectively, but I haven't 
done it yet.
Regarding the tiny change in the cost, my initial reason was to teach 
cost_sort to differ sort orderings: begin by considering the number of 
columns in the cost estimation and then consider the distinct estimation 
of the first column.
BTW, it was triggered by user reports, where a slight change in the 
balance between MergeAppend/GatherMerge/Sort/IncrementalSort (or columns 
order) could give significant profit. Especially when grouping millions 
of rows in 2-4 bytea columns.


--
regards, Andrei Lepikhov

pg_stat_statements: use spaces to indent in upgrade scripts

2024-09-23 Thread Karina Litskevich

Hi hackers,

I just noticed that all indents in upgrade scripts of pg_stat_statemets are
made using spaces, except for "CREATE FUNCTION pg_stat_statements_reset"
statement in pg_stat_statements--1.6--1.7.sql and
pg_stat_statements--1.10--1.11.sql.

I made a patch to fix it in pg_stat_statements--1.10--1.11.sql if it's not
too late for PostgreSQL 17 yet.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
From 2273e72e35fe2d4ff500a2127413121b1709be17 Mon Sep 17 00:00:00 2001
From: Karina Litskevich 
Date: Mon, 23 Sep 2024 16:06:25 +0300
Subject: [PATCH v1] pg_stat_statements: use spaces to indent in upgrade
 scripts

---
 .../pg_stat_statements/pg_stat_statements--1.10--1.11.sql   | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.10--1.11.sql b/contrib/pg_stat_statements/pg_stat_statements--1.10--1.11.sql
index 0bb2c39771..86769184a3 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.10--1.11.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.10--1.11.sql
@@ -70,9 +70,9 @@ CREATE VIEW pg_stat_statements AS
 GRANT SELECT ON pg_stat_statements TO PUBLIC;
 
 CREATE FUNCTION pg_stat_statements_reset(IN userid Oid DEFAULT 0,
-	IN dbid Oid DEFAULT 0,
-	IN queryid bigint DEFAULT 0,
-	IN minmax_only boolean DEFAULT false
+IN dbid Oid DEFAULT 0,
+IN queryid bigint DEFAULT 0,
+IN minmax_only boolean DEFAULT false
 )
 RETURNS timestamp with time zone
 AS 'MODULE_PATHNAME', 'pg_stat_statements_reset_1_11'
-- 
2.34.1

Re: meson and check-tests

2024-09-23 Thread Nazir Bilal Yavuz

Hi,

On Mon, 23 Sept 2024 at 11:46, Nazir Bilal Yavuz  wrote:
> On Sat, 21 Sept 2024 at 09:01, jian he  wrote:
> > in [1] you mentioned "setup", but that "setup" is more or less like
> > "meson test  --suite setup --suite regress"
> > but originally, I thought was about "src/test/regress/sql/test_setup.sql".
> > for example, now you cannot run src/test/regress/sql/stats_ext.sql
> > without first running test_setup.sql, because some functions (like fipshash)
> > live in test_setup.sql.
>
> Postgres binaries are created at the build step in the make builds so
> these binaries can be used for the tests. But in the meson builds,
> they are created at the 'meson test  --suite setup' for the testing
> ('meson install' command creates binaries as well but they are not for
> testing, they are for installing binaries to the OS). So, 'meson test
> --suite setup' should be run before running regression tests.

The above sentence lacks some information. It appears that if binaries
are not created beforehand (only running configure, not make), they
are generated during tests in the make builds. This also applies to
meson builds when the meson test command is run, as meson executes
setup suite tests first, which creates the binaries. However, if we
specify a different test suite, like regress in this case, the setup
suite tests are not executed, and the binaries are not created,
preventing the tests from running. I am not sure how to configure
meson builds to run setup suite tests if they are not executed
beforehand.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: PostgreSQL 17 RC1 & GA dates

On Mon, 23 Sept 2024 at 19:28, Sandeep Thakkar
 wrote:
> I would like to know if the PostgreSQL 17 GAis  happening this week or is 
> there a change in dates?

Nothing so far has come up to cause the dates to change.

David

Re: POC, WIP: OR-clause support for indexes

2024-09-23 Thread Alexander Korotkov

On Mon, Sep 16, 2024 at 3:44 PM Andrei Lepikhov  wrote:
> On 9/9/2024 12:36, Alexander Korotkov wrote:
> > Also, I agree it get it's wrong to directly copy RestrictInfo struct
> > in group_similar_or_args().  Instead, I've renamed
> > make_restrictinfo_internal() to make_plain_restrictinfo(), which is
> > intended to handle non-recursive cases when you've children already
> > wrapped with RestrictInfos.  make_plain_restrictinfo() now used in
> > group_similar_or_args().
> Great work. Thanks for doing this!
>
> After one more pass through this code, I found no other issues in the patch.
> Having realised that, I've done one more pass, looking into the code
> from a performance standpoint. It looks mostly ok, but In my opinion, in
> the cycle:
>
> foreach(lc, orclause->args)
> {
> }
>
> we should free the consts list before returning NULL on unsuccessful
> attempt. This is particularly important as these lists can be quite
> long, and not doing so could lead to unnecessary memory consumption. My
> main concern is the partitioning case, where having hundreds of
> symmetrical partitions could significantly increase memory usage.

Makes sense.  Please, check the attached patch freeing the consts list
while returning NULL from match_orclause_to_indexcol().

--
Regards,
Alexander Korotkov
Supabase


v40-0002-Teach-bitmap-path-generation-about-transforming-.patch
Description: Binary data


v40-0001-Transform-OR-clauses-to-SAOP-s-during-index-matc.patch
Description: Binary data

Re: Cleaning up ERRCODE usage in our XML code

2024-09-23 Thread Daniel Gustafsson

> On 20 Sep 2024, at 21:00, Tom Lane  wrote:
> 
> I wrote:
>> [ v1-clean-up-errcodes-for-xml.patch ]
> 
> Per cfbot, rebased over d5622acb3.  No functional changes.

Looking over these I don't see anything mis-characterized so +1 on going ahead
with these.  It would be neat if we end up translating xml2 errors into XQuery
Error SQLSTATEs but this is a clear improvement over what we have until then.

There is an ERRCODE_INTERNAL_ERROR in xml_out_internal() which seems a tad odd
given that any error would be known to be parsing related and b) are caused by
libxml and not internally.  Not sure if it's worth bothering with but with the
other ones improved it stood out.

--
Daniel Gustafsson

Re: ANALYZE ONLY

On Mon, Sep 23, 2024 at 6:04 PM David Rowley  wrote:
>
>
> If this is confusing, I think there's a bunch of detail that I tried
> to keep that's just not that useful.  The part about analyzing
> partitions just once and the part about not collecting non-inheritance
> stats for the partitioned table seems like extra detail that's either
> obvious or just not that important.
>
> Can you have a look at the attached and let me know if it's easier to
> understand now?
>

Now the regress test passed.

For partitioned tables, ANALYZE gathers statistics by
sampling rows from all partitions.  By default,
ANALYZE will also recursively collect and update the
statistics for each partition.  The ONLY keyword may be
used to disable this.

is very clear to me!

If the table being analyzed has inheritance children,
ANALYZE gathers two sets of statistics: one on the rows
of the parent table only, and a second including rows of both the parent
table and all of its children.  This second set of statistics is needed when
planning queries that process the inheritance tree as a whole.  The
autovacuum daemon, however, will only consider inserts or updates on the
parent table itself when deciding whether to trigger an automatic analyze
for that table.  If that table is rarely inserted into or updated, the
inheritance statistics will not be up to date unless you run
ANALYZE manually.  By default,
ANALYZE will also recursively collect and update the
statistics for each inheritance child table.  The ONLY
keyword may be used to disable this.

looks fine. but maybe we can add the following information
"if The ONLY is specified, the second set of
statistics won't include each children individual statistics"
I think that's the main difference between specifying ONLY or not?

catalog-pg-statistic.html second paragraph seems very clear to me.
Maybe we can link it somehow

Other than that, it looks good to me.

Re: Remove useless GROUP BY columns considering unique index

2024-09-23 Thread Zhang Mingli

Hi, all


I haven't paid attention to this topic in a long time, thanks all for the 
advices, I will study them then update.

Thanks again.


Zhang Mingli
www.hashdata.xyz
On Sep 18, 2024 at 15:50 +0800, David Rowley , wrote:
> On Wed, 18 Sept 2024 at 19:28, Peter Eisentraut  wrote:
> >
> > On 12.09.24 03:43, David Rowley wrote:
> > > (Likely it could just look at pg_attribute.attnotnull instead)
> >
> > That won't work because you can't record dependencies on that. (This is
> > one of the reasons for cataloging not-null constraints as real constraints.)
>
> I'm not seeing any need to record constraint dependencies for this
> optimisation. It would be different for detecting functional
> dependencies in a view using a unique constraint+not null constraints
> for ungrouped columns, but that's not what this is. This is just a
> planner optimisation. The plan can be invalidated by a relcache
> invalidation, which will happen if someone does ALTER TABLE DROP NOT
> NULL.
>
> For reference, see 5b736e9cf.
>
> David

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-23 Thread Peter Eisentraut


On 09.09.24 23:38, Shayon Mukherjee wrote:

*Problem*:
Adding and removing indexes is a common operation in PostgreSQL. On 
larger databases, however, these operations can be resource-intensive. 
When evaluating the performance impact of one or more indexes, dropping 
them might not be ideal since as a user you may want a quicker way to 
test their effects without fully committing to removing & adding them 
back again. Which can be a time taking operation on larger tables.


*Proposal*:
I propose adding an ALTER INDEX command that allows for enabling or 
disabling an index globally. This could look something like:


ALTER INDEX index_name ENABLE;
ALTER INDEX index_name DISABLE;

A disabled index would still receive updates and enforce constraints as 
usual but would not be used for queries. This allows users to assess 
whether an index impacts query performance before deciding to drop it 
entirely.


I think a better approach would be to make the list of disabled indexes 
a GUC setting, which would then internally have an effect similar to 
enable_indexscan, meaning it would make the listed indexes unattractive 
to the planner.


This seems better than the proposed DDL command, because you'd be able 
to use this per-session, instead of forcing a global state, and even 
unprivileged users could use it.


(I think we have had proposals like this before, but I can't find the 
discussion I'm thinking of right now.)

Re: ANALYZE ONLY

On Mon, Sep 23, 2024 at 7:53 PM David Rowley  wrote:
>
> On Mon, 23 Sept 2024 at 23:23, jian he  wrote:
> > looks fine. but maybe we can add the following information
> > "if The ONLY is specified, the second set of
> > statistics won't include each children individual statistics"
> > I think that's the main difference between specifying ONLY or not?
>
> Ok, I think you're not understanding this yet and I'm not sure what I
> can make more clear in the documents.
>
> Let me explain... For inheritance parent tables, ANALYZE ONLY will
> gather inheritance and non-inheritance statistics for ONLY the parent.
>
> Here's an example of that:
>
> drop table if exists parent,child;
> create table parent(a int);
> create table child () inherits (parent);
> insert into parent values(1);
> insert into child values(1);
>
> analyze ONLY parent;
> select starelid::regclass,stainherit,stadistinct from pg_statistic
> where starelid in ('parent'::regclass,'child'::regclass);
>  starelid | stainherit | stadistinct
> --++-
>  parent   | f  |  -1  <- this is the distinct estimate
> for SELECT * FROM ONLY parent;
>  parent   | t  |-0.5 <- this is the distinct estimate
> for SELECT * FROM parent;
> (2 rows)
>
> For the stainherit==false stats, only 1 row is sampled here as that's
> the only row directly located in the "parent" table.
> For the stainherit==true stats, 2 rows are sampled, both of them have
> "a" == 1. The stadistinct reflects that fact.
>
> Note there have been no statistics recorded for "child". However,
> analyze did sample rows in that table as part of gathering sample rows
> for "parent" for the stainherit==true row.
>
> Now let's try again without ONLY.
>
> analyze parent;
>
> select starelid::regclass,stainherit,stadistinct from pg_statistic
> where starelid in ('parent'::regclass,'child'::regclass);
>  starelid | stainherit | stadistinct
> --++-
>  parent   | f  |  -1
>  parent   | t  |-0.5
>  child| f  |  -1
> (3 rows)
>
> All of the above rows were re-calculated with the "analyze parent"
> command, the first two rows have the same values as nothing changed in
> the table, however, there are now statistics stored for the "child"
> table.

thanks for your explanation!
now I don't have any questions about this patch.



> > catalog-pg-statistic.html second paragraph seems very clear to me.
> > Maybe we can link it somehow
>
> I don't know what "link it" means in this context.
>

i mean, change to:

By default,
ANALYZE will also recursively collect and update the
statistics for each inheritance child table.  The ONLY
keyword may be used to disable this.
You may also refer to catalog pg_statistic
description about stainherit.


but pg_statistic
already mentioned once.
maybe not a good idea.

Re: Pgoutput not capturing the generated columns

2024-09-23 Thread Peter Smith

Hi. Here are my review comments for v32-0001

You wrote: "I have addressed all the comments in the v32-0001 Patch.",
however, I found multiple old review comments not addressed. Please
give a reason if a comment is deliberately left out, otherwise, I will
assume they are omitted by accident and so keep repeating them.

There were also still some unanswered questions from previous reviews,
so I have reminded you about those again here.

==
Commit message

1.
This commit enables support for the 'publish_generated_columns' option
in logical replication, allowing the transmission of generated column
information and data alongside regular table changes. The option
'publish_generated_columns' is a PUBLICATION parameter.

~

That PUBLICATION info in the 2nd sentence would be easier to say in
the 1st sentence.
SUGGESTION:
This commit supports the transmission of generated column information
and data alongside regular table changes. This behaviour is controlled
by a new PUBLICATION parameter ('publish_generated_columns').

~~~

2.
When 'publish_generated_columns' is false, generated columns are not
replicated, even when present in a PUBLICATION col-list.

Hm. This contradicts the behaviour that Amit wanted, (e.g.
"column-list takes precedence"). So I am not sure if this patch is
already catering for the behaviour suggested by Amit or if that is yet
to come in v33. For now, I am assuming that 32* has not caught up with
the latest behaviour requirements, but that might be a wrong
assumption; perhaps it is only this commit message that is bogus.

~~~

3. General.

On the same subject, there is lots of code, like:

if (att->attgenerated && !pub->pubgencols)
continue;

I suspect that might not be quite what you want for the "column-list
takes precedence" behaviour, but I am not going to identify all those
during this review. It needs lots of combinations of column list tests
to verify it.

==
doc/src/sgml/ddl.sgml

4ab.
nit - Huh?? Not changed the linkend as told in a previous review [1-#3a]
nit - Huh?? Not changed to call this a "parameter" instead of an
"option" as told in a previous review [1-#3b]

==
doc/src/sgml/protocol.sgml

5.
- 
-  Next, the following message part appears for each column included in
-  the publication (except generated columns):
- 
-

nit -- Huh?? I don't think you can just remove this whole paragraph.
But, probably you can just remove the "except generated columns" part.
I posted this same comment [4 #11] 20 patch versions back.

==
doc/src/sgml/ref/create_publication.sgml

6abc.
nit - Huh?? Not changed the parameter ID as told in a previous review [1-#6]
nit - Huh?? Not removed paragraph "This option is only available..."
as told in a previous review. See [1-#7]
nit - Huh?? Not removed paragraph "This parameter can only be set" as
told in a previous review. See [1-#8]

==
src/backend/catalog/pg_publication.c

7.
  if (TupleDescAttr(tupdesc, attnum - 1)->attgenerated)
- ereport(ERROR,
+ ereport(WARNING,
  errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
- errmsg("cannot use generated column \"%s\" in publication column list",
+ errmsg("specified generated column \"%s\" in publication column list
for publication with publish_generated_columns as false",
 colname));

I did not understand how this WARNING can know
"publish_generated_columns as false"? Should the code be checking the
function parameter 'pubgencols'?

The errmsg also seemed a bit verbose. How about:
"specified generated column \"%s\" in publication column list when
publish_generated_columns = false"

==
src/backend/replication/logical/proto.c

8.
logicalrep_write_tuple:
logicalrep_write_attrs:

Reminder. I think I have multiple questions about this code from
previous reviews that may be still unanswered. See [2 #4]. Maybe when
you implement Amit's "column list takes precedence" behaviour then
this code is fine as-is (because the replication message might include
gencols or not-gecols regardless of the 'publish_generated_columns'
value). But I don't think that is the current implementation, so
something did not quite seem right. I am not sure. If you say it is
fine then I will believe it, but the question [2 #4] remains
unanswered.

==
src/backend/replication/pgoutput/pgoutput.c

9.
send_relation_and_attrs:

Reminder: Here is another question that was answered from [2 #5]. I
did not really trust it for the current implementation, but for the
"column list takes precedence" behaviour probably it will be ok.

~~~

10.
+/*
+ * Prepare new column list bitmap. This includes all the columns of the table.
+ */
+static Bitmapset *
+prepare_all_columns_bms(PGOutputData *data, RelationSyncEntry *entry,
+ TupleDesc desc)
+{

This function needs a better comment with more explanation about what
this is REALLY doing. e.g. it says "includes all columns of the
table", but tthe implementation is skipping generated cols, so clearly
it is not "all columns of the table".

~~~

11. pgoutput_column_list_init

TBH

Re: Partitioned tables and [un]loggedness

2024-09-23 Thread Michael Paquier

On Fri, Sep 20, 2024 at 09:37:54AM +0900, Michael Paquier wrote:
> What do you think about adding a test with DETACH FINALIZE when
> attempting it on a normal table, its path being under a different
> subcommand than DETACH [CONCURRENTLY]?
> 
> There are no tests for normal tables with DETACH CONCURRENTLY, but as
> it is the same as DETACH under the AT_DetachPartition subcommand, that
> does not seem worth the extra cycles.

Added an extra test for the CONCURRENTLY case at the end, and applied
the simplification.

Hmm.  Should we replace the error messages in transformPartitionCmd()
with some assertions at this stage?  I don't see a high cost in
keeping these even now, and keeping errors is perhaps more useful for
some extension code that builds AT_AttachPartition or
AT_DetachPartition commands?
--
Michael


signature.asc
Description: PGP signature

Re: Pgoutput not capturing the generated columns

2024-09-23 Thread Peter Smith

Hi. Here are my v32-0002 review comments:

==
src/backend/replication/logical/tablesync.c

1. fetch_remote_table_info

  /*
- * Get column lists for each relation.
+ * Get column lists for each relation, and check if any of the
+ * publications have the 'publish_generated_columns' parameter enabled.

I am not 100% sure about this logic anymore. Maybe it is OK, but it
requires careful testing because with Amit's "column lists take
precedence" it is now possible for the publication to say
'publish_generated_columns=false', but the publication can still
publish gencols *anyway* if they were specified in a column list.

~~~

2.
  /*
  * Fetch info about column lists for the relation (from all the
  * publications).
  */
+ StringInfo pub_names = makeStringInfo();
+
+ get_publications_str(MySubscription->publications, pub_names, true);
  resetStringInfo(&cmd);
  appendStringInfo(&cmd,
~

nit - The comment here seems misplaced.

~~~

3.
+ if (server_version >= 12)
+ {
+ has_pub_with_pubgencols = server_version >= 18 && has_pub_with_pubgencols;
+
+ if (!has_pub_with_pubgencols)
+ appendStringInfo(&cmd, " AND a.attgenerated = ''");
+ }

My previous review comment about this [1 #10] was:
Can the 'gencols_allowed' var be removed, and the condition just be
replaced with if (!has_pub_with_pubgencols)? It seems equivalent
unless I am mistaken.

nit - So the current v32 code is not what I was expecting. What I
meant was 'has_pub_with_pubgencols' can only be true if server_version
>= 18, so I thought there was no reason to check it again. For
reference, I've changed it to like I meant in the nitpicks attachment.
Please see if that works the same.

==
[1] my review of v31-0002.
https://www.postgresql.org/message-id/CAHut%2BPusbhvPrL1uN1TKY%3DFd4zu3h63eDebZvsF%3Duy%2BLBKTwgA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/backend/replication/logical/tablesync.c 
b/src/backend/replication/logical/tablesync.c
index 0e34d7c..9fed6b3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -912,13 +912,12 @@ fetch_remote_table_info(char *nspname, char *relname, 
bool **remotegenlist_res,
WalRcvExecResult *pubres;
TupleTableSlot *tslot;
Oid attrsRow[] = {INT2VECTOROID};
+   StringInfo  pub_names = makeStringInfo();
 
/*
 * Fetch info about column lists for the relation (from all the
 * publications).
 */
-   StringInfo  pub_names = makeStringInfo();
-
get_publications_str(MySubscription->publications, pub_names, 
true);
resetStringInfo(&cmd);
appendStringInfo(&cmd,
@@ -1047,13 +1046,8 @@ fetch_remote_table_info(char *nspname, char *relname, 
bool **remotegenlist_res,
 " WHERE a.attnum > 0::pg_catalog.int2"
 "   AND NOT a.attisdropped", 
lrel->remoteid);
 
-   if (server_version >= 12)
-   {
-   has_pub_with_pubgencols = server_version >= 18 && 
has_pub_with_pubgencols;
-
-   if (!has_pub_with_pubgencols)
-   appendStringInfo(&cmd, " AND a.attgenerated = ''");
-   }
+   if (!has_pub_with_pubgencols)
+   appendStringInfo(&cmd, " AND a.attgenerated = ''");
 
appendStringInfo(&cmd,
 "   AND a.attrelid = %u"

Re: Proposal to Enable/Disable Index using ALTER INDEX

On Mon, Sep 23, 2024 at 8:31 PM David Rowley  wrote:

> On Tue, 24 Sept 2024 at 03:14, Peter Eisentraut 
> wrote:
> >
> > On 09.09.24 23:38, Shayon Mukherjee wrote:
> > > ALTER INDEX index_name ENABLE;
> > > ALTER INDEX index_name DISABLE;
> >
> > I think a better approach would be to make the list of disabled indexes
> > a GUC setting, which would then internally have an effect similar to
> > enable_indexscan, meaning it would make the listed indexes unattractive
> > to the planner.
>
> I understand the last discussion went down that route too. For me, it
> seems strange that adding some global variable is seen as cleaner than
> storing the property in the same location as all the other index
> properties.
>
>
That was my first instinct as well. Although, being able to control this
setting on a per session level and as an unprivileged user is somewhat
attractive.


> How would you ensure no cached plans are still using the index after
> changing the GUC?
>

Could we call ResetPlanCache() to invalidate all plan caches from the
assign_ hook on GUC when it's set (and doesn't match the old value).
Something like this (assuming the GUC is called `disabled_indexes`)

void
assign_disabled_indexes(const char *newval, void *extra)
{
if (disabled_indexes != newval)
ResetPlanCache();
}

A bit heavy-handed, but perhaps it's OK, since it's not meant to be used
frequently also ?


> > This seems better than the proposed DDL command, because you'd be able
> > to use this per-session, instead of forcing a global state, and even
> > unprivileged users could use it.
>
> That's true.
>
> > (I think we have had proposals like this before, but I can't find the
> > discussion I'm thinking of right now.)
>
> I think it's the one that was already linked by Nathan.  [1]? The GUC
> seems to have been first suggested on the same thread in [2].
>
> David
>
> [1]
> https://www.postgresql.org/message-id/ed8c9ed7-bb5d-aaec-065b-ad4893645deb%402ndQuadrant.com
> [2] https://www.postgresql.org/message-id/29800.1529359024%40sss.pgh.pa.us
>


Shayon

Re: Proposal to Enable/Disable Index using ALTER INDEX

That's a good point.

+1 for the idea of the GUC setting, especially since, as you mentioned, it 
allows unprivileged users to access it and being per-session..

I am happy to draft a patch for this as well. I think I have a working idea so 
far of where the necessary checks might go. However if you don’t mind, can you 
elaborate further on how the effect would be similar to enable_indexscan? 

I was thinking we could introduce a new GUC option called `disabled_indexes` 
and perform a check against in all places for each index being considered with 
its OID via get_relname_relid through a helper function in the various places 
we need to prompt the planner to not use the index (like in indxpath.c as an 
example).

Curious to learn if you have a different approach in mind perhaps?

Thank you,
Shayon


> On Sep 23, 2024, at 11:14 AM, Peter Eisentraut  wrote:
> 
> On 09.09.24 23:38, Shayon Mukherjee wrote:
>> *Problem*:
>> Adding and removing indexes is a common operation in PostgreSQL. On larger 
>> databases, however, these operations can be resource-intensive. When 
>> evaluating the performance impact of one or more indexes, dropping them 
>> might not be ideal since as a user you may want a quicker way to test their 
>> effects without fully committing to removing & adding them back again. Which 
>> can be a time taking operation on larger tables.
>> *Proposal*:
>> I propose adding an ALTER INDEX command that allows for enabling or 
>> disabling an index globally. This could look something like:
>> ALTER INDEX index_name ENABLE;
>> ALTER INDEX index_name DISABLE;
>> A disabled index would still receive updates and enforce constraints as 
>> usual but would not be used for queries. This allows users to assess whether 
>> an index impacts query performance before deciding to drop it entirely.
> 
> I think a better approach would be to make the list of disabled indexes a GUC 
> setting, which would then internally have an effect similar to 
> enable_indexscan, meaning it would make the listed indexes unattractive to 
> the planner.
> 
> This seems better than the proposed DDL command, because you'd be able to use 
> this per-session, instead of forcing a global state, and even unprivileged 
> users could use it.
> 
> (I think we have had proposals like this before, but I can't find the 
> discussion I'm thinking of right now.)
>

Re: Add llvm version into the version string

2024-09-23 Thread Tom Lane

Dmitry Dolgov <9erthali...@gmail.com> writes:
> On Sun, Sep 22, 2024 at 01:15:54PM GMT, Dmitry Dolgov wrote:
>> About a new function, I think that the llvm runtime version has to be
>> shown in some form by pgsql_version. The idea is to skip an emails
>> exchange like "here is the bug report" -> "can you attach the llvm
>> version?".

I'm not in favor of that, for a couple of reasons:

* People already have expectations about what version() returns.
Some distro and forks even modify it (see eg --extra-version).
I think we risk breaking obscure use-cases if we add stuff onto that.
Having version() return something different from the PG_VERSION_STR
constant could cause other problems too, perhaps.

* Believing that it'll save us questioning a bug submitter seems
fairly naive to me.  Nobody includes the result of version() unless
specifically asked for it.

* I'm not in favor of overloading version() with subsystem-related
version numbers, because it doesn't scale.  Who's to say we're not
going to need the version of ICU, or libxml2, to take a couple of
obvious examples?  When you take that larger view, multiple
subsystem-specific version functions seem to me to make more sense.

Maybe another idea could be a new system view?

=> select * from pg_system_version;
 property | value

 core version | 18.1
 architecture | x86_64-pc-linux-gnu
 word size| 64 bit
 compiler | gcc (GCC) 12.5.0
 ICU version  | 60.3
 LLVM version | 18.1.0
 ...

Adding rows to a view over time seems way less likely to cause
problems than messing with a string people probably have crufty
parsing code for.

>> If it's going to be a new separate function, I guess it won't
>> make much difference to ask either to call the new function or the
>> llvm-config, right?

I do think that if we can get a version number out of the loaded
library, that is worth doing.  I don't trust the llvm-config that
happens to be in somebody's PATH to be the same version that their
PG is actually built with.

regards, tom lane

Re: SQL:2023 JSON simplified accessor support

2024-09-23 Thread Alexandra Wang

Hi Peter,

Thank you so much for helping!

On Mon, Sep 16, 2024 at 12:44 PM Peter Eisentraut  wrote:
>
> On 29.08.24 18:33, Alexandra Wang wrote:
> > I’ve implemented the member and array accessors and attached two
> > alternative patches:
> >
> > 1. v1-0001-Add-JSON-JSONB-simplified-accessor.patch: This patch
> > enables dot access to JSON object fields and subscript access to
> > indexed JSON array elements by converting "." and "[]" indirection
> > into a JSON_QUERY JsonFuncExpr node.
> >
> > 2. v2-0001-Transform-JSON-dot-access-to-arrow-operator.txt: This
> > alternative patch implements dot access to JSON object fields by
> > transforming the "." indirection into a "->" operator.
> >
> > The upside of the v1 patch is that it strictly aligns with the SQL
> > standard, which specifies that the simplified access is equivalent to:
> >
> > JSON_QUERY (VEP, 'lax $.JC' WITH CONDITIONAL ARRAY WRAPPER NULL ON
> > EMPTY NULL ON ERROR)
> >
> > However, the performance of JSON_QUERY might be suboptimal due to
> > function call overhead. Therefore, I implemented the v2 alternative
> > using the "->" operator.
> Using the operator approach would also allow taking advantage of
> optimizations such as
> .

OK, that makes sense.

> > There is some uncertainty about the semantics of conditional array
> > wrappers. Currently, there is at least one subtle difference between
> > the "->" operator and JSON_QUERY, as shown:
>
> That JSON_QUERY bug has been fixed.
>
> I suggest you rebase both of your patches over this, just to double
> check everything.  But then I think you can drop the v1 patch and just
> submit a new version of v2.

Done. I rebased both patches and confirmed they have the same test
outputs. I attached v3, which also adds JSON subscript support on top
of v2.

> The patch should eventually contain some documentation.  It might be
> good starting to look for a good spot where to put that documentation.
> It might be either near the json types documentation or near the general
> qualified identifier syntax, not sure.

Right, I’m not sure either. A third option, I think, would be to
include it in the JSON Functions and Operators section [1].

[1] https://www.postgresql.org/docs/devel/functions-json.html

Best,
Alex


v3-0001-Add-JSON-JSONB-simplified-accessor.patch
Description: Binary data

Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-23 Thread Masahiko Sawada

On Thu, Sep 19, 2024 at 10:44 PM Amit Kapila  wrote:
>
> On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila  wrote:
> > >
> > > On Thu, Sep 19, 2024 at 6:46 AM David Rowley  wrote:
> > > >
> > > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada  
> > > > wrote:
> > > > > I've done some benchmark tests for three different code bases with
> > > > > different test cases. In short, reducing the generation memory context
> > > > > block size to 8kB seems to be promising; it mitigates the problem
> > > > > while keeping a similar performance.
> > > >
> > > > Did you try any sizes between 8KB and 8MB?  1000x reduction seems
> > > > quite large a jump. There is additional overhead from having more
> > > > blocks. It means more malloc() work and more free() work when deleting
> > > > a context. It would be nice to see some numbers with all powers of 2
> > > > between 8KB and 8MB.  I imagine the returns are diminishing as the
> > > > block size is reduced further.
> > > >
> > >
> > > Good idea.
> >
> > Agreed.
> >
> > I've done other benchmarking tests while changing the memory block
> > sizes from 8kB to 8MB. I measured the execution time of logical
> > decoding of one transaction that inserted 10M rows. I set
> > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > this scenario, we allocate many memory chunks while decoding the
> > transaction and resulting in calling more malloc() in smaller memory
> > block sizes. Here are results (an average of 3 executions):
> >
> > 8kB: 19747.870 ms
> > 16kB: 19780.025 ms
> > 32kB: 19760.575 ms
> > 64kB: 19772.387 ms
> > 128kB: 19825.385 ms
> > 256kB: 19781.118 ms
> > 512kB: 19808.138 ms
> > 1MB: 19757.640 ms
> > 2MB: 19801.429 ms
> > 4MB: 19673.996 ms
> > 8MB: 19643.547 ms
> >
> > Interestingly, there were no noticeable differences in the execution
> > time. I've checked the number of allocated memory blocks in each case
> > and more blocks are allocated in smaller block size cases. For
> > example, when the logical decoding used the maximum memory (about
> > 1.5GB), we allocated about 80k blocks in 8kb memory block size case
> > and 80 blocks in 8MB memory block cases.
> >
>
> What exactly do these test results mean? Do you want to prove that
> there is no regression by using smaller block sizes?

Yes, there was no noticeable performance regression at least in this
test scenario.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Proposal to Enable/Disable Index using ALTER INDEX

On Tue, 24 Sept 2024 at 03:14, Peter Eisentraut  wrote:
>
> On 09.09.24 23:38, Shayon Mukherjee wrote:
> > ALTER INDEX index_name ENABLE;
> > ALTER INDEX index_name DISABLE;
>
> I think a better approach would be to make the list of disabled indexes
> a GUC setting, which would then internally have an effect similar to
> enable_indexscan, meaning it would make the listed indexes unattractive
> to the planner.

I understand the last discussion went down that route too. For me, it
seems strange that adding some global variable is seen as cleaner than
storing the property in the same location as all the other index
properties.

How would you ensure no cached plans are still using the index after
changing the GUC?

> This seems better than the proposed DDL command, because you'd be able
> to use this per-session, instead of forcing a global state, and even
> unprivileged users could use it.

That's true.

> (I think we have had proposals like this before, but I can't find the
> discussion I'm thinking of right now.)

I think it's the one that was already linked by Nathan.  [1]? The GUC
seems to have been first suggested on the same thread in [2].

David

[1] 
https://www.postgresql.org/message-id/ed8c9ed7-bb5d-aaec-065b-ad4893645deb%402ndQuadrant.com
[2] https://www.postgresql.org/message-id/29800.1529359024%40sss.pgh.pa.us

Re: Compress ReorderBuffer spill files using LZ4

2024-09-23 Thread Julien Tachoires

Hi Tomas,

Le lun. 23 sept. 2024 à 18:13, Tomas Vondra  a écrit :
>
> Hi,
>
> I've spent a bit more time on this, mostly running tests to get a better
> idea of the practical benefits.

Thank you for your code review and testing!

> Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
> for pglz_compress() to return -1. This can happen if the data is not
> compressible, and would not fit into the output buffer. The code can't
> just do elog(ERROR) in this case, it needs to handle that by storing the
> raw data. The attached fixup patch makes this work for me - I'm not
> claiming this is the best way to handle this, but it works.
>
> FWIW I find it strange the tests included in the patch did not trigger
> this. That probably means the tests are not quite sufficient.
>
>
> Now, to the testing. Attached are two scripts, testing different cases:
>
> test-columns.sh - Table with a variable number of 'float8' columns.
>
> test-toast.sh - Table with a single text column.
>
> The script always sets up a publication/subscription on two instances,
> generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
> waits for it to be replicated to the replica, and measures how much data
> was spilled to disk with the different compression methods (off, pglz
> and lz4). There's a couple more metrics, but that's irrelevant here.

It would be interesting to run the same tests with zstd: in my early
testing I found that zstd was able to provide a better compression
ratio than lz4, but seemed to use more CPU resources/is slower.

> For the "column" test, it looks like this (this is in MB):
>
> rowscolumnsdistributionoff pglzlz4
>   
>   10   1000compressible778  20   9
>  random778 778  16
>   
>   100   100compressible916 116  62
>  random916 916  67
>
> It's very clear that for the "compressible" data (which just copies the
> same value into all columns), both pglz and lz4 can significantly reduce
> the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
> columns it's a bit less efficient, but still good.
>
> For the "random" data (where every column gets a random value, but rows
> are copied), it's a very different story - pglz does not help at all,
> while lz4 still massively reduces the amount of spilled data.
>
> I think the explanation is very simple - for pglz, we compress each row
> on it's own, there's no concept of streaming/context. If a row is
> compressible, it works fine, but when the row gets random, pglz can't
> compress it at all. For lz4, this does not matter, because with the
> streaming mode it still sees that rows are just repeated, and so can
> compress them efficiently.

That's correct.

> For TOAST test, the results look like this:
>
>   distribution repeatstoast   offpglz lz4
>   ===
>   compressible   1lz4  14   2   1
>   pglz 40   4   3
>   1000lz4  32  16   9
>   pglz 54  17  10
> -
> random   1lz4330533053157
>   pglz   330533053157
>   1000lz4316631621580
>   pglz   333433261745
>--
>random2   1lz4330533053157
>   pglz   330533053158
>   1000lz4316031563010
>   pglz   333433263172
>
> The "repeats" value means how long the string is - it's the number of
> "md5" hashes added to the string. The number of rows is calculated to
> keep the total amount of data the same. The "toast" column tracks what
> compression was used for TOAST, I was wondering if it matters.
>
> This time there are three data distributions - compressible means that
> each TOAST value is nicely compressible, "random" means each value is
> random (not compressible), but the rows are just copy of the same value
> (so on the whole there's a lot of redundancy). And "random2" means each
> row is random and unique (so not compressible at all).
>
> The table shows that with compressible TOAST values, compressing the
> spill file is rather useless. The reason is that ReorderBufferCompress
> is handling raw TOAST data, which is already compressed. Yes, it may
> further reduce the amount of data, but it's negligible when co

Re: Increase of maintenance_work_mem limit in 64-bit Windows

On Tue, 24 Sept 2024 at 02:47, Vladlen Popolitov
 wrote:
> I agree, it is better to fix all them together. I also do not like this
> hack, it will be removed from the patch, if I check and change
> all  at once.
> I think, it will take about 1 week to fix and test all changes. I will
> estimate the total volume of the changes and think, how to group them
> in the patch ( I hope, it will be only one patch)

There's a few places that do this:

Size maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;

/* choose the maxBlockSize to be no larger than 1/16 of work_mem */
while (16 * maxBlockSize > work_mem * 1024L)

I think since maxBlockSize is a Size variable, that the above should
probably be:

while (16 * maxBlockSize > (Size) work_mem * 1024)

Maybe there can be a precursor patch to fix all those to get rid of
the 'L' and cast to the type we're comparing to or assigning to rather
than trying to keep the result of the multiplication as a long.

David

Re: Why don't we consider explicit Incremental Sort?

On Tue, 24 Sept 2024 at 02:01, Tomas Vondra  wrote:
>
> On 9/23/24 05:03, Richard Guo wrote:
> > On Sun, Sep 22, 2024 at 1:38 PM David Rowley  wrote:
> >> Just looking at the commit message:
> >>
> >>> The rationale is based on the assumption that incremental sort is
> >>> always faster than full sort when there are presorted keys, a premise
> >>> that has been applied in various parts of the code.  This assumption
> >>> does not always hold, particularly in cases with a large skew in the
> >>> number of rows within the presorted groups.
> >>
> >> My understanding is that the worst case for incremental sort is the
> >> same as sort, i.e. only 1 presorted group, which is the same effort to
> >> sort. Is there something I'm missing?
> >
> > I was thinking that when there’s a large skew in the number of rows
> > per pre-sorted group, incremental sort might underperform full sort.
> > As mentioned in the comments for cost_incremental_sort, it has to
> > detect the sort groups, plus it needs to perform tuplesort_reset after
> > each group.  However, I doubt these factors would have a substantial
> > impact on the performance of incremental sort.  So maybe in the worst
> > skewed groups case, incremental sort may still perform similarly to
> > full sort.
> >
> > Another less obvious reason is that in cases of skewed groups, we may
> > significantly underestimate the cost of incremental sort.  This could
> > result in choosing a more expensive subpath under the sort.  Such as
> > the example in [1], we end up with IndexScan->IncrementalSort rather
> > than SeqScan->FullSort, while the former plan is more expensive to
> > execute.  However, this point does not affect this patch, because for
> > a mergejoin, we only consider outerrel's cheapest-total-cost when we
> > assume that an explicit sort atop is needed, i.e., we do not have a
> > chance to select which subpath to use in this case.
> >
> > [1] 
> > https://postgr.es/m/CAMbWs49+CXsrgbq0LD1at-5jR=ahhn0ytdy9yvgxasmfndz...@mail.gmail.com
> >
>
> You don't need any skew. Consider this perfectly uniform dataset:
>
>  Sort  (cost=127757.34..130257.34 rows=100 width=8)
>(actual time=186.288..275.813 rows=100 loops=1)
>Sort Key: a, b
>Sort Method: external merge  Disk: 17704kB
>->  Seq Scan on t  (cost=0.00..14425.00 rows=100 width=8)
> (actual time=0.005..35.989 rows=100 loops=1)
>  Planning Time: 0.075 ms
>  Execution Time: 301.143 ms
>
> set enable_incremental_sort = on;

>  Incremental Sort  (cost=1.03..68908.13 rows=100 width=8)
> (actual time=0.102..497.143 rows=100 loops=1)
>Sort Key: a, b
>Presorted Key: a
>Full-sort Groups: 27039  Sort Method: quicksort
>  Average Memory: 25kB  Peak Memory: 25kB
>->  Index Scan using t_a_idx on t
>(cost=0.42..37244.25 rows=100 width=8)
> (actual time=0.050..376.403 rows=100 loops=1)
>  Planning Time: 0.057 ms
>  Execution Time: 519.301 ms

Ok, let's first look at the total Seq Scan cost of the first EXPLAIN.
14425.00 units and 35.989 milliseconds to execute. That's about 400.81
units per millisecond. The Index Scan is only being charged 98.94
units per millisecond of execution.  If the Index Scan was costed the
same per unit as the Seq Scan, the total Index Scan cost would be
150868 which is already more than the Seq Scan plan without even
adding the Incremental Sort costs on. To me, that seems to be an
inaccuracy either with the Seq Scan costings coming out too expensive
or Index Scan coming out too cheap.

If you think that the Incremental Sort plan shouldn't be chosen
because the Index Scan costing came out too cheap (or the Seq Scan
costing was too expensive) then I disagree. Applying some penalty to
one node type because some other node type isn't costed accurately is
just not a practice we should be doing. Instead, we should be trying
our best to cost each node type as accurately as possible.  If you
think there's some inaccuracy with Incremental Sort, then let's look
into that. If you want to re-add the penalty because Index Scan
costing isn't good, let's see if we can fix Index Scan costings.

David

Re: Proposal to Enable/Disable Index using ALTER INDEX

I found an old thread here [0].

Also, a question: If we go with the GUC approach, how do we expect 
`pg_get_indexdef` to behave?

I suppose it would behave no differently than it otherwise would, because 
there's no new SQL grammar to support and, given its GUC status, it seems 
reasonable that `pg_get_indexdef` doesn’t reflect whether an index is enabled 
or not. 

If so, then I wonder if using a dedicated `ALTER` command and keeping the state 
in `pg_index` would be better for consistency's sake?

[0]https://postgrespro.com/list/id/20151212.112536.1628974191058745674.t-is...@sraoss.co.jp

Thank you
Shayon

> On Sep 23, 2024, at 4:51 PM, Shayon Mukherjee  wrote:
> 
> That's a good point.
> 
> +1 for the idea of the GUC setting, especially since, as you mentioned, it 
> allows unprivileged users to access it and being per-session..
> 
> I am happy to draft a patch for this as well. I think I have a working idea 
> so far of where the necessary checks might go. However if you don’t mind, can 
> you elaborate further on how the effect would be similar to enable_indexscan? 
> 
> I was thinking we could introduce a new GUC option called `disabled_indexes` 
> and perform a check against in all places for each index being considered 
> with its OID via get_relname_relid through a helper function in the various 
> places we need to prompt the planner to not use the index (like in indxpath.c 
> as an example).
> 
> Curious to learn if you have a different approach in mind perhaps?
> 
> Thank you,
> Shayon
> 
> 
>> On Sep 23, 2024, at 11:14 AM, Peter Eisentraut  wrote:
>> 
>> On 09.09.24 23:38, Shayon Mukherjee wrote:
>>> *Problem*:
>>> Adding and removing indexes is a common operation in PostgreSQL. On larger 
>>> databases, however, these operations can be resource-intensive. When 
>>> evaluating the performance impact of one or more indexes, dropping them 
>>> might not be ideal since as a user you may want a quicker way to test their 
>>> effects without fully committing to removing & adding them back again. 
>>> Which can be a time taking operation on larger tables.
>>> *Proposal*:
>>> I propose adding an ALTER INDEX command that allows for enabling or 
>>> disabling an index globally. This could look something like:
>>> ALTER INDEX index_name ENABLE;
>>> ALTER INDEX index_name DISABLE;
>>> A disabled index would still receive updates and enforce constraints as 
>>> usual but would not be used for queries. This allows users to assess 
>>> whether an index impacts query performance before deciding to drop it 
>>> entirely.
>> 
>> I think a better approach would be to make the list of disabled indexes a 
>> GUC setting, which would then internally have an effect similar to 
>> enable_indexscan, meaning it would make the listed indexes unattractive to 
>> the planner.
>> 
>> This seems better than the proposed DDL command, because you'd be able to 
>> use this per-session, instead of forcing a global state, and even 
>> unprivileged users could use it.
>> 
>> (I think we have had proposals like this before, but I can't find the 
>> discussion I'm thinking of right now.)
>> 
>

Re: Conflict detection for update_deleted in logical replication

2024-09-23 Thread Masahiko Sawada

Hi,

Thank you for considering another idea.

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila  wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that 
> > can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to 
> > detect
> > conflicts when applying *concurrent* transactions from remotes. Any 
> > subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after 
> > confirming
> > that all the concurrent transaction on publisher has been applied locally.

The replication slot used for this purpose will be a physical one or
logical one? And IIUC such a slot doesn't need to retain WAL but if we
do that, how do we advance the LSN of the slot?

> > 2) the apply worker send a new message to walsender to request the latest 
> > wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message 
> > or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)

The apply worker sends a keepalive message when it didn't receive
anything more than wal_receiver_timeout / 2. So in a very active
system, we cannot rely on piggybacking new information to the
keepalive messages to get the latest remote flush LSN.

> > 3) The apply worker can continue to apply changes. After applying all the 
> > WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional 
> > and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to 
> > detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >

I'm still studying this idea but let me confirm the following scenario.

Suppose both Node-A and Node-B have the same row (1,1) in table t, and
XIDs and commit LSNs of T2 and T3 are the following:

Node A
  T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000

Node B
  T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000

Further suppose that it's now 10:05 AM, and the latest XID and the
latest flush WAL position of Node-A and Node-B are following:

Node A
  current XID: 300
  latest flush LSN; 3000

Node B
  current XID: 700
  latest flush LSN: 7000

Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
(i.e., the logical replication is delaying for 5 min).

Consider the following scenario:

1. The apply worker on Node-A calls GetRunningTransactionData() and
gets 301 (set as candidate_xmin).
2. The apply worker on Node-A requests the latest WAL flush position
from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
4. The apply worker on Node-A continues applying changes, and applies
the transactions up to remote (commit) LSN 7100.
5. Now that the apply worker on Node-A applied all changes smaller
than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
301 (candidate_xmin).
6. On Node-A, vacuum runs and physically removes the tuple that was
deleted by T2.

Here, on Node-B, there might be a transition between LSN 7100 and 8000
that might require the tuple that is deleted by T2.

For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
On Node-A, whether we detect "update_deleted" or "update_missing"
still depends on when vacuum removes the tuple deleted by T2.

If applying T4 raises an "update_missing" (i.e. the changes are
applied in the order of T2->T3->(vacuum)->T4), it converts into an
insert, resulting in the table having a row with value = 3.

If applying T4 raises an "update_deleted" (i.e. the changes are
applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
in the table having no row.

On the other hand, in th

Re: attndims, typndims still not enforced, but make the value within a sane threshold

2024-09-23 Thread Andrew Dunstan




On 2024-09-20 Fr 12:38 AM, Tom Lane wrote:

Michael Paquier  writes:

On Fri, Sep 20, 2024 at 11:51:49AM +0800, Junwang Zhao wrote:

Should you also bump the catalog version?

No need to worry about that when sending a patch because committers
take care of that when merging a patch into the tree.  Doing that in
each patch submitted just creates more conflicts and work for patch
authors because they'd need to recolve conflicts each time a
catversion bump happens.  And that can happen on a daily basis
sometimes depending on what is committed.

Right.  Sometimes the committer forgets to do that :-(, which is
not great but it's not normally a big problem either.  We've concluded
it's better to err in that direction than impose additional work
on patch submitters.



FWIW, I have a git pre-commit hook that helps avoid that. Essentially it 
checks to see if there are changes in src/include/catalog but not in 
catversion.h. That's not a 100% check, but it probably catches the vast 
majority of changes that would require a catversion bump.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Add contrib/pg_logicalsnapinspect

2024-09-23 Thread shveta malik

On Mon, Sep 23, 2024 at 7:57 AM Peter Smith  wrote:
>
> My review comments for v8-0001
>
> ==
> contrib/pg_logicalinspect/pg_logicalinspect.c
>
> 1.
> +/*
> + * Lookup table for SnapBuildState.
> + */
> +
> +#define SNAPBUILD_STATE_INCR 1
> +
> +static const char *const SnapBuildStateDesc[] = {
> + [SNAPBUILD_START + SNAPBUILD_STATE_INCR] = "start",
> + [SNAPBUILD_BUILDING_SNAPSHOT + SNAPBUILD_STATE_INCR] = "building",
> + [SNAPBUILD_FULL_SNAPSHOT + SNAPBUILD_STATE_INCR] = "full",
> + [SNAPBUILD_CONSISTENT + SNAPBUILD_STATE_INCR] = "consistent",
> +};
> +
> +/*
>
> nit - the SNAPBUILD_STATE_INCR made this code appear more complicated
> than it is.

I agree.

> Please take a look at the attachment for an alternative
> implementation which includes an explanatory comment. YMMV. Feel free
> to ignore it.
>

+1. I find Peter's version with comments easier to understand.

thanks
Shveta

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'


On 19/9/2024 09:55, Andrei Lepikhov wrote:
This wrong prediction makes things much worse if the query has more 
upper query blocks.
His question was: Why not consider the grouping column unique in the 
upper query block? It could improve estimations.
After a thorough investigation, I discovered that in commit  4767bc8ff2 
most of the work was already done for DISTINCT clauses. So, why not do 
the same for grouping? A sketch of the patch is attached.
As I see it, grouping in this sense works quite similarly to DISTINCT, 
and we have no reason to ignore it. After applying the patch, you can 
see that prediction has been improved:


Hash Right Join  (cost=5.62..162.56 rows=50 width=36)


A regression test is added into new version.
The code looks tiny, simple and non-invasive - it will be easy to commit 
or reject. So I add it to next commitfest.


--
regards, Andrei Lepikhov
From 22b572903b8c1a2459b051f91e0902a2fc243648 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Wed, 18 Sep 2024 19:25:02 +0200
Subject: [PATCH v2] Improve statistics estimation considering GROUP-BY as a
 'uniqueiser'.

This commit follows the idea of the 4767bc8ff2. GROUP-BY makes the grouped
column 'highly likely unique'. We can employ this fact in the statistics to
make more precise estimations in the upper query block.
---
 src/backend/utils/adt/selfuncs.c| 37 +++--
 src/include/utils/selfuncs.h|  2 +-
 src/test/regress/expected/stats_ext.out | 28 +++
 src/test/regress/sql/stats_ext.sql  | 19 +
 4 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 03d7fb5f48..b6ce8978c1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -321,8 +321,8 @@ var_eq_const(VariableStatData *vardata, Oid oproid, Oid 
collation,
}
 
/*
-* If we matched the var to a unique index or DISTINCT clause, assume
-* there is exactly one match regardless of anything else.  (This is
+* If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+* assume there is exactly one match regardless of anything else.  
(This is
 * slightly bogus, since the index or clause's equality operator might 
be
 * different from ours, but it's much more likely to be right than
 * ignoring the information.)
@@ -483,8 +483,8 @@ var_eq_non_const(VariableStatData *vardata, Oid oproid, Oid 
collation,
}
 
/*
-* If we matched the var to a unique index or DISTINCT clause, assume
-* there is exactly one match regardless of anything else.  (This is
+* If we matched the var to a unique index, DISTINCT or GROUP-BY clause,
+* assume there is exactly one match regardless of anything else.  
(This is
 * slightly bogus, since the index or clause's equality operator might 
be
 * different from ours, but it's much more likely to be right than
 * ignoring the information.)
@@ -5005,9 +5005,9 @@ ReleaseDummy(HeapTuple tuple)
  * atttype, atttypmod: actual type/typmod of the "var" expression.  This is
  * commonly the same as the exposed type of the variable argument,
  * but can be different in binary-compatible-type cases.
- * isunique: true if we were able to match the var to a unique index or a
- * single-column DISTINCT clause, implying its values are unique 
for
- * this query.  (Caution: this should be trusted for statistical
+ * isunique: true if we were able to match the var to a unique index, a
+ * single-column DISTINCT or GROUP-BY clause, implying its values 
are
+ * unique for this query.  (Caution: this should be trusted for 
statistical
  * purposes only, since we do not check indimmediate nor verify 
that
  * the exact same definition of equality applies.)
  * acl_ok: true if current user has permission to read the column(s)
@@ -5654,7 +5654,7 @@ examine_simple_variable(PlannerInfo *root, Var *var,
Assert(IsA(subquery, Query));
 
/*
-* Punt if subquery uses set operations or GROUP BY, as these 
will
+* Punt if subquery uses set operations or grouping sets, as 
these will
 * mash underlying columns' stats beyond recognition.  (Set ops 
are
 * particularly nasty; if we forged ahead, we would return stats
 * relevant to only the leftmost subselect...)  DISTINCT is also
@@ -5662,7 +5662,6 @@ examine_simple_variable(PlannerInfo *root, Var *var,
 * of learning something even with it.
 */
if (subquery->setOperations ||
-   subquery->groupClause ||
subquery->groupingSets)
return;
 
@@ -5692,6 +5691,16 @@ examine_s

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-23 Thread Maciek Sakrejda

If one of the use cases is soft-dropping indexes, would a GUC approach
still support that? ALTER TABLE?

Re: ANALYZE ONLY

On Mon, 23 Sept 2024 at 22:04, David Rowley  wrote:
> Can you have a look at the attached and let me know if it's easier to
> understand now?

Pushed.

David

Re: pgbench: Improve result outputs related to failed transactinos

2024-09-23 Thread Yugo NAGATA

On Sun, 22 Sep 2024 17:59:34 +0900 (JST)
Tatsuo Ishii  wrote:

> > I would like to improve the following two points on the result outputs
> > of pgbench related to faild transaction. The patch is attached.
> > 
> > (1) Output per-script statistics even when there are no successful
> > transaction if there is any failed transactions due to serialization
> > or deadlock errors.
> > 
> > Previously, per-script statistics were never output when any transactions
> > are failed. However, it is reasonable to report per-script failed 
> > transactions
> > if they are due to serialization or deadlock errors, since these kinds of
> > failures are now objects to be reported.
> >  
> > This is fixed by modifying the following condition to use "total_cnt <= 0".
> > 
> >/* Remaining stats are nonsensical if we failed to execute any xacts 
> > */
> >if (total->cnt + total->skipped <= 0)
> >return;
> > 
> > (2) Avoid to print "NaN%" in lines on failed transaction reports.
> 
> This itself sounds good. However, in case (1) still "NaN%" are
> printed. This looks inconsistent.
> 
> t-ishii$ src/bin/pgbench/pgbench -p 11002 -c1 -t 1 -f c.sql -f d.sql 
> --failures-detailed test
> pgbench (18devel)
> starting vacuum...end.
> transaction type: multiple scripts
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> maximum number of tries: 1
> number of transactions per client: 1
> number of transactions actually processed: 0/1
> number of failed transactions: 1 (100.000%)
> number of serialization failures: 1 (100.000%)
> number of deadlock failures: 0 (0.000%)
> latency average = 7023.604 ms (including failures)
> initial connection time = 4.964 ms
> tps = 0.00 (without initial connection time)
> SQL script 1: c.sql
>  - weight: 1 (targets 50.0% of total)
>  - 0 transactions (NaN% of total, tps = 0.00)
> SQL script 2: d.sql
>  - weight: 1 (targets 50.0% of total)
>  - 0 transactions (NaN% of total, tps = 0.00)
>  - number of failed transactions: 1 (100.000%)
>  - number of serialization failures: 1 (100.000%)
>  - number of deadlock failures: 0 (0.000%)

I overlooked the "NaN% of total" in per-script results.
I think this NaN also should be avoided.

I fixed  the number of transactions in per-script results to include
skipped and failed transactions. It prevents to print  "total of NaN%"
when any transactions are not successfully  processed. 

Although it breaks the back-compatibility, this seems reasonable
modification because not only succeeded transactions but also skips and
failures ones are now handled and reported for each script.  Also, the
number of transactions actually processed per-script and TPS based on
it are now output explicitly in a separate line.

Here is an example of the results.

$ pgbench  -f sleep.sql -f deadlock.sql --failures-detailed -t 2 -r -c 4 -j 4
pgbench (18devel)
starting vacuum...end.
transaction type: multiple scripts
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
number of transactions per client: 2
number of transactions actually processed: 5/8
number of failed transactions: 3 (37.500%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 3 (37.500%)
latency average = 7532.531 ms (including failures)
initial connection time = 7.447 ms
tps = 0.331894 (without initial connection time)
SQL script 1: sleep.sql
 - weight: 1 (targets 50.0% of total)
 - 2 transactions (25.0% of total)
 - number of transactions actually pocessed: 2 (tps = 0.132758)
 - number of failed transactions: 0 (0.000%)
 - number of serialization failures: 0 (0.000%)
 - number of deadlock failures: 0 (0.000%)
 - latency average = 1002.506 ms
 - latency stddev = 0.320 ms
 - statement latencies in milliseconds and failures:
  1002.506   0  select pg_sleep(1)
SQL script 2: deadlock.sql
 - weight: 1 (targets 50.0% of total)
 - 6 transactions (75.0% of total)
 - number of transactions actually pocessed: 3 (tps = 0.199136)
 - number of failed transactions: 3 (50.000%)
 - number of serialization failures: 0 (0.000%)
 - number of deadlock failures: 3 (50.000%)
 - latency average = 9711.271 ms
 - latency stddev = 466.328 ms
 - statement latencies in milliseconds and failures:
 0.426   0  begin;
  5352.229   0  lock b;
  2003.416   0  select pg_sleep(2);
 0.829   3  lock a;
 8.774   0  end;

I've attached the updated patch.


Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From 9dad5acfafc8fbc6ba2108202e3baa3bb9266b1b Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Thu, 19 Sep 2024 01:37:54 +0900
Subject: [PATCH v2] pgbench: Improve result outputs related to failed
 transactions

Previously, per-script statistics were never output when all
transactions are failed due to serialization or deadlock errors.
However, it is reasonable to report such information if there
are ones even when there are no successfu

RE: Conflict detection for update_deleted in logical replication

2024-09-23 Thread Zhijie Hou (Fujitsu)

On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada  
wrote:
> 
> Thank you for considering another idea.

Thanks for reviewing the idea!

> 
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila  wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value 
> > > for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to 
> > > retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
> 
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

> 
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
> 
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

> 
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
> 
> I'm still studying this idea but let me confirm the following scenario.
> 
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
> 
> Node A
>   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> 
> Node B
>   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
> 
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
> 
> Node A
>   current XID: 300
>   latest flush LSN; 3000
> 
> Node B
>   current XID: 700
>   latest flush LSN: 7000
> 
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
> 
> Consider the following scenario:
> 
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 
> 8000.
> 4. The apply worker on Node-A continues applying changes, and applies
> the transactions up to remote (commit) LSN 7100.
> 5. Now that the apply worker on Node-A applied all changes smaller
> than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> 301 (candidate_xmin).
> 6. On Node-A, vacuum runs and physically removes the tuple that was
> deleted by T2.
> 
> Here, on Node-B, there might be a transition between LSN 7100 and 8000
> that might require the tuple that is deleted by T2.
> 
> For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> executed on Node

Re: Add contrib/pg_logicalsnapinspect

2024-09-23 Thread shveta malik

On Fri, Sep 20, 2024 at 12:22 PM Bertrand Drouvot
 wrote:
>
>
> Please find attached v8, that:
>

Thank You for the patch. In one of my tests, I noticed that I got
negative checksum:

postgres=# SELECT * FROM pg_get_logical_snapshot_meta('0/3481F20');
   magic|  checksum  | version
++-
 1369563137 | -266346460 |   6

But pg_crc32c is uint32. Is it because we are getting it as
Int32GetDatum(ondisk.checksum) in pg_get_logical_snapshot_meta()?
Instead should it be UInt32GetDatum?

Same goes for below:
values[i++] = Int32GetDatum(ondisk.magic);
values[i++] = Int32GetDatum(ondisk.magic);

We need to recheck the rest of the fields in the info() function as well.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

On Tue, Sep 24, 2024 at 2:35 AM Masahiko Sawada  wrote:
>
> >
> > As each apply worker needs a separate slot to retain deleted rows, the
> > requirement for slots will increase. The other possibility is to
> > maintain one slot by launcher or some other central process that
> > traverses all subscriptions, remember the ones marked with
> > retain_dead_rows (let's call this list as retain_sub_list). Then using
> > running_transactions get the oldest running_xact, and then get the
> > remote flush location from the other node (publisher node) and store
> > those as candidate values (candidate_xmin and
> > candidate_remote_wal_lsn) in slot. We can probably reuse existing
> > candidate variables of the slot. Next, we can check the remote_flush
> > locations from all the origins corresponding in retain_sub_list and if
> > all are ahead of candidate_remote_wal_lsn, we can update the slot's
> > xmin to candidate_xmin.
>
> Does it mean that we use one candiate_remote_wal_lsn in a slot for all
> subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
> LSN of one of publishers, so other publishers could have completely
> different LSNs. How do we compare the candidate_remote_wal_lsn to
> remote_flush locations from all the origins?
>

This should be an array/list with one element per publisher. We can
copy candidate_xmin to actual xmin only when the
candiate_remote_wal_lsn's corresponding to all publishers have been
applied aka their remote_flush locations (present in origins) are
ahead. The advantages I see with this are (a) reduces the number of
slots required to achieve the retention of deleted rows for conflict
detection, (b) in some cases we can avoid sending messages to the
publisher because with this we only need to send message to a
particular publisher once rather than by all the apply workers
corresponding to same publisher node.

-- 
With Regards,
Amit Kapila.

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-23 Thread Michael Paquier

On Mon, Sep 23, 2024 at 10:50:21AM -0500, Nathan Bossart wrote:
> I carefully inspected all the code paths this patch touches, and I think
> I've got all the details right, but I would be grateful if someone else
> could take a look.

No objections from here with putting the snapshots pops and pushes
outside the inner routines of reindex/drop concurrently, meaning that
ReindexRelationConcurrently(), DefineIndex() and index_drop() are fine
to do these operations.

Looking at the patch, we could just add an assertion based on
ActiveSnapshotSet() in index_set_state_flags().

Actually, thinking more...  Could it be better to have some more
sanity checks in the stack outside the toast code for catalogs with
toast tables?  For example, I could imagine adding a check in
CatalogTupleUpdate() so as all catalog accessed that have a toast 
relation require an active snapshot.  That would make checks more
aggressive, because we would not need any toast data in a catalog to
make sure that there is a snapshot set.  This strikes me as something
we could do better to improve the detection of failures like the one
reported by Alexander when updating catalog tuples as this can be
triggered each time we do a CatalogTupleUpdate() when dirtying a
catalog tuple.  The idea is then to have something before the
HaveRegisteredOrActiveSnapshot() in the toast internals, for catalogs,
and we would not require toast data to detect problems.
--
Michael


signature.asc
Description: PGP signature

Re: Conflict detection for update_deleted in logical replication

2024-09-23 Thread Masahiko Sawada

On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada 
>  wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila  wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise 
> > > > value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to 
> > > > retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied 
> > > > locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>
> >
> > > > 2) the apply worker send a new message to walsender to request the 
> > > > latest
> > wal
> > > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> > message or
> > > > extend the existing keepalive message(e,g extends the requestReply bit 
> > > > in
> > > > keepalive message to add a 'request_wal_position' value)
> >
> > The apply worker sends a keepalive message when it didn't receive
> > anything more than wal_receiver_timeout / 2. So in a very active
> > system, we cannot rely on piggybacking new information to the
> > keepalive messages to get the latest remote flush LSN.
>
> Right. I think we need to send this new message at some interval independent 
> of
> wal_receiver_timeout.
>
> >
> > > > 3) The apply worker can continue to apply changes. After applying all 
> > > > the
> > WALs
> > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > > slot.xmin to 'candidate_xmin'.
> > > >
> > > > This approach ensures that dead tuples are not removed until all
> > concurrent
> > > > transactions have been applied. It can be effective for both 
> > > > bidirectional
> > and
> > > > non-bidirectional replication cases.
> > > >
> > > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > > control whether this feature is enabled. Each subscription intending to
> > detect
> > > > update-delete conflicts should set retain_dead_tuples to true.
> > > >
> >
> > I'm still studying this idea but let me confirm the following scenario.
> >
> > Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> > XIDs and commit LSNs of T2 and T3 are the following:
> >
> > Node A
> >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> >
> > Node B
> >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > commit-LSN:5000
> >
> > Further suppose that it's now 10:05 AM, and the latest XID and the
> > latest flush WAL position of Node-A and Node-B are following:
> >
> > Node A
> >   current XID: 300
> >   latest flush LSN; 3000
> >
> > Node B
> >   current XID: 700
> >   latest flush LSN: 7000
> >
> > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > (i.e., the logical replication is delaying for 5 min).
> >
> > Consider the following scenario:
> >
> > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > gets 301 (set as candidate_xmin).
> > 2. The apply worker on Node-A requests the latest WAL flush position
> > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 
> > 8000.
> > 4. The apply worker on Node-A continues applying changes, and applies
> > the transactions up to remote (commit) LSN 7100.
> > 5. Now that the apply worker on Node-A applied all changes smaller
> > than candidate_remote_wal_lsn (7000), it increases the s

Re: Why don't we consider explicit Incremental Sort?

2024-09-23 Thread Richard Guo

On Mon, Sep 23, 2024 at 10:01 PM Tomas Vondra  wrote:
> You don't need any skew. Consider this perfectly uniform dataset:
>
> create table t (a int, b int);
> insert into t select 10 * random(), 100 * random()
>   from generate_series(1,100) s(i);
> create index on t (a);
> vacuum analyze;
> checkpoint;
>
> explain analyze select * from t order by a, b;

I think if we want to compare the performance of incremental sort vs.
full sort on this dataset, we need to ensure that other variables are
kept constant, ie we need to ensure that both methods use the same
subpath, whether it's Index Scan or Seq Scan.

This is especially true in the scenario addressed by this patch, as
for a mergejoin, we only consider outerrel's cheapest_total_path when
we assume that an explicit sort atop is needed.  That is to say, the
subpath has already been chosen when it comes to determine whether to
use incremental sort or full sort.

According to this theory, here is what I got on this same dataset:

-- incremental sort
explain analyze select * from t order by a, b;
 QUERY PLAN

 Incremental Sort  (cost=1.02..68838.65 rows=100 width=8) (actual
time=1.292..1564.265 rows=100 loops=1)
   Sort Key: a, b
   Presorted Key: a
   Full-sort Groups: 27022  Sort Method: quicksort  Average Memory:
25kB  Peak Memory: 25kB
   ->  Index Scan using t_a_idx on t  (cost=0.42..37244.20
rows=100 width=8) (actual time=0.408..1018.785 rows=100
loops=1)
 Planning Time: 0.998 ms
 Execution Time: 1602.861 ms
(7 rows)

-- full sort
set enable_incremental_sort to off;
set enable_seqscan to off;
explain analyze select * from t order by a, b;
 QUERY PLAN

 Sort  (cost=150576.54..153076.54 rows=100 width=8) (actual
time=1760.090..1927.598 rows=100 loops=1)
   Sort Key: a, b
   Sort Method: external merge  Disk: 17704kB
   ->  Index Scan using t_a_idx on t  (cost=0.42..37244.20
rows=100 width=8) (actual time=0.531..1010.931 rows=100
loops=1)
 Planning Time: 0.980 ms
 Execution Time: 1970.287 ms
(6 rows)

So incremental sort outperforms full sort on this dataset.

Thanks
Richard

Re: installcheck-world concurrency issues

2024-09-23 Thread Michael Paquier

On Thu, Oct 06, 2022 at 01:52:46PM +0900, Michael Paquier wrote:
> For the archives' sake: this has been applied as of 6a20b04 and
> c3315a7.

Corey (added in CC.), has noticed that the issue fixed by c3315a7 in
16~ for advisory locks is not complicated to reach, leading to
failures in some of our automated internal stuff.  A cherry-pick of
c3315a7 works cleanly across 12~15.  Would there be any objections if
I were to backpatch this part down to 12?

The problems fixed by 6a20b04 have not really been an issue here,
hence I'd rather let things be as they are for the conflicting role
names.
--
Michael

signature.asc
Description: PGP signature

Re: Conflict detection for update_deleted in logical replication

On Tue, Sep 24, 2024 at 9:02 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada 
>  wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila  wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise 
> > > > value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to 
> > > > retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied 
> > > > locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>

As per my understanding, one of the main reasons to keep it logical is
to allow syncing it to standbys (slotsync functionality). It is
required because after promotion the subscriptions replicated to
standby could be enabled to make it a subscriber. If that is not
possible due to any reason then we can consider it to be a physical
slot as well.

-- 
With Regards,
Amit Kapila.

Re: Add new COPY option REJECT_LIMIT

2024-09-23 Thread torikoshia

Updated the patch.

On 2024-07-23 23:06, torikoshia wrote:

On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao
wrote:

Thanks for your review.

On 2024/07/22 21:37, torikoshia wrote:
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao
wrote:

Thanks for the comment.

In patch 0002, the ratio is calculated by the already
skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can
tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad
row in the
first 100 rows will fail the entire command, this might surprise the
user.

Since the ratio is calculated after all data is processed, the case
"one bad row in the first 100 rows will fail the entire command"
doesn't happen:

Yes, but is this the desired behavior when using the ratio threshold?
I was thinking that COPY should fail once the error ratio (errors vs.
total rows in the input file) reaches the threshold.

Users might expect like you.
However, implementing it would need a row number counting feature as
you pointed out, and it seems an overkill.

Describing the difference between ratio and number in the manual might
help, but

it might be better to make REJECT_LIMIT support only the number of
errors and leave it to the user to calculate the number from the
ratio.

I'd like to hear if anyone has an opinion on the need for supporting
ratio.

Since there was no opinion about it, attached a patch only for
REJECT_LIMIT specifying number.

+ This option must be used with ON_ERROR to be
set to

+ other than stop.

Regarding the ON_ERROR option, now it only has two values.
Instead of saying "other than stop," we should simply use "ignore"
for clarity and intuition?

I'll Modify it.
The reason for the roundabout wording was the expectation that
on_error would support values other than these in the future, but as
you point out, there are currently only two.

+ When specified INFINITY,
COPY ignores all
+ the errors. This is a synonym for ON_ERROR
ignore.

For the INFINITY value, the description is a bit unclear.
As I understand it, INFINITY is the default for REJECT_LIMIT.
So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is
used.

In line with my previous suggestion, if we support only REJECT_LIMIT
without ON_ERROR, having INFINITY makes sense. However,
with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
Users can just set ON_ERROR=ignore to ignore all errors,
making INFINITY unnecessary for REJECT_LIMIT. This is open for
discussion, but I believe it's better to remove INFINITY from
the REJECT_LIMIT options.

Agreed.
Unless there are opposing opinions, I'm going to remove 'INFINITY' in
the next patch.

As there are no opposing opinions, removed 'INFINITY' as well.

+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel,
pstate);

+ if (!opts_out->on_error)
+ ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("REJECT_LIMIT
requires ON_ERROR to be set to other than stop")));

Using "ignore" instead of "other than stop" in the error message
is clearer and more intuitive.

Agreed.

Checking if ON_ERROR and REJECT_LIMIT are specified should be
done after the foreach() processing of the options. Otherwise,
if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can
occur.

---
=# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
---

Ugh, I'll modify it.

+ ereport(ERROR,
+
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+errmsg("exceeded the
number specified by REJECT_LIMIT \"%lld\"",
+ (long
long) cstate->opts.reject_limits.num_err)));

The error message isn't clear about what exceeded REJECT_LIMIT.
How about saying "skipped more than REJECT_LIMIT rows" or something
instead?

Agreed.

+/*
+ * A struct to hold reject_limit options, in a parsed form.
+ * More values to be added in another patch.
+ */

The latter comment doesn't seem necessary or helpful.

Agreed.

[1]
https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com

On Tue, Jul 23, 2024 at 11:10 PM torikoshia
wrote:

On 2024-07-23 02:06, Kirill Reshke wrote:

If nothing is specified, then the maximum tolerable number of errors
is one, right? Should we state this explicitly in the documentation?

REJECT_LIMIT now can be used wonly when on_error=ignore, I think the
defaul

Re: Cleaning up ERRCODE usage in our XML code

2024-09-23 Thread Daniel Gustafsson

> On 23 Sep 2024, at 19:17, Tom Lane  wrote:
> Daniel Gustafsson  writes:
> 
>> There is an ERRCODE_INTERNAL_ERROR in xml_out_internal() which seems a tad 
>> odd
>> given that any error would be known to be parsing related and b) are caused 
>> by
>> libxml and not internally.  Not sure if it's worth bothering with but with 
>> the
>> other ones improved it stood out.
> 
> Yeah, I looked at that but wasn't sure what to do with it.  We should
> have validated the decl header when the XML value was created, so if
> we get here then either the value got corrupted on-disk or in-transit,
> or something forgot to do that validation, or libxml has changed its
> mind since then about what's a valid decl.  At least some of those
> cases are probably legitimately called INTERNAL_ERROR.  I thought for
> awhile about ERRCODE_DATA_CORRUPTED, but I'm not convinced that's a
> better fit.

I agree that it might not be an obvious better fit, but also not an obvious
worse fit.  It will make it easier to filter on during fleet analysis so I
would be inclined to change it, but the main value of the patch are other hunks
so feel free to ignore.

--
Daniel Gustafsson

Re: not null constraints, again