Re: Support to define custom wait events for extensions

2023-06-16 Thread Michael Paquier
On Fri, Jun 16, 2023 at 11:14:05AM +0900, Masahiro Ikeda wrote:
> I tried to query on pg_stat_activity to check the background worker
> invoked by pg_prewarm. But, I found that pg_stat_activity doesn't show
> it although I may be missing something...
> 
> So, I tried to implement TAP tests. But I have a problem with it.
> I couldn't find the way to check the status of another backend
> while the another backend wait with custom wait events.

Hmm.  Right.  It seems to me that BGWORKER_BACKEND_DATABASE_CONNECTION
is required in this case, with BackgroundWorkerInitializeConnection()
to connect to a database (or not, like the logical replication
launcher if only access to shared catalogs is wanted).

I have missed that the leader process of pg_prewarm does not use that,
because it has no need to connect to a database, but its workers do.
So it is not going to show up in pg_stat_activity.
--
Michael


signature.asc
Description: PGP signature


Re: add non-option reordering to in-tree getopt_long

2023-06-16 Thread Kyotaro Horiguchi
At Thu, 15 Jun 2023 21:58:28 -0700, Nathan Bossart  
wrote in 
> On Fri, Jun 16, 2023 at 10:30:09AM +0900, Michael Paquier wrote:
> > On Thu, Jun 15, 2023 at 05:09:59PM -0700, Nathan Bossart wrote:
> >> I've attached a new version of the patch that omits the
> >> POSIXLY_CORRECT stuff.
> > 
> > This looks OK at quick glance, though you may want to document at the
> > top of getopt_long() the reordering business with the non-options?
> 
> I added a comment to this effect in v3.  I also noticed that '-' wasn't
> properly handled as a non-option, so I've tried to fix that as well.

(Honestly, the rearrangement code looks somewhat tricky to grasp..)

It doesn't work properly if '--' is involved.

For example, consider the following options (even though they don't
work for the command).

psql -t -T hoho -a hoge -- -1 hage hige huge

After the function returns -1, the argv array looks like this. The
"[..]" indicates the position of optind.

psql -t -T hoho -a -- [-1] hage hige huge hoge

This is clearly incorrect since "hoge" gets moved to the end.  The
rearrangement logic needs to take into account the '--'.  For your
information, the glibc version yields the following result for the
same options.

psql -t -T hoho -a -- [hoge] -1 hage hige huge

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Use the enum value CRS_EXPORT_SNAPSHOT instead of "0"

2023-06-16 Thread Masahiko Sawada
On Fri, Jun 16, 2023 at 3:10 PM Wei Wang (Fujitsu)
 wrote:
>
> Hi,
>
> In the function WalReceiverMain, when the function walrcv_create_slot is 
> called,
> the fourth parameter is assigned the value "0" instead of the enum value
> "CRS_EXPORT_SNAPSHOT". I think it would be better to use the corresponding 
> enum
> value.

The walreceiver process doesn't use CRS_EXPORT_SNAPSHOT actually,
right? I think replacing it with CRS_EXPORT_SNAPSHOT would rather
confuse me

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

2023-06-16 Thread Quan Zongliang


We have a small table with only 23 rows and 21 values.

The resulting MCV and histogram is as follows
stanumbers1 | {0.08695652,0.08695652}
stavalues1  | {v1,v2}
stavalues2  | 
{v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}


An incorrect number of rows was estimated when HashJoin was done with 
another large table (about 2 million rows).


Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual 
time=1.943..1528.983 rows=3902 loops=1)


The reason is that the MCV of the small table excludes values with rows 
of 1. Put them in the MCV in the statistics to get the correct result.


Using the conservative samplerows <= attstattarget doesn't completely 
solve this problem. It can solve this case.


After modification we get statistics without histogram:
stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
stavalues1  | {v,v2, ... }

And we have the right estimates:
Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual 
time=1.447..1268.385 rows=3902 loops=1)



Regards,

--
Quan Zongliang
Beijing Vastdata Co., LTDdiff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 52ef462dba..08ea4243f5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2518,7 +2518,7 @@ compute_scalar_stats(VacAttrStatsP stats,
{
/* Reached end of duplicates of this value */
ndistinct++;
-   if (dups_cnt > 1)
+   if (dups_cnt > 1 || samplerows <= num_mcv)
{
nmultiple++;
if (track_cnt < num_mcv ||


Re: Use the enum value CRS_EXPORT_SNAPSHOT instead of "0"

2023-06-16 Thread Peter Smith
On Fri, Jun 16, 2023 at 6:17 PM Masahiko Sawada  wrote:
>
> On Fri, Jun 16, 2023 at 3:10 PM Wei Wang (Fujitsu)
>  wrote:
> >
> > Hi,
> >
> > In the function WalReceiverMain, when the function walrcv_create_slot is 
> > called,
> > the fourth parameter is assigned the value "0" instead of the enum value
> > "CRS_EXPORT_SNAPSHOT". I think it would be better to use the corresponding 
> > enum
> > value.
>
> The walreceiver process doesn't use CRS_EXPORT_SNAPSHOT actually,
> right? I think replacing it with CRS_EXPORT_SNAPSHOT would rather
> confuse me
>

Passing some number (0) which has the same value as an enum, while at
the same time not intending it to have the same meaning as that enum
smells strange to me.

If none of the existing enums is meaningful here, then perhaps there
ought to be another enum added  (CRS_UNUSED?) and pass that instead.

~

Alternatively, maybe continue to pass 0, but ensure the existing enums
do not include any value of 0.

e.g.
typedef enum
{
CRS_EXPORT_SNAPSHOT = 1,
CRS_NOEXPORT_SNAPSHOT,
CRS_USE_SNAPSHOT
} CRSSnapshotAction;

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Use the enum value CRS_EXPORT_SNAPSHOT instead of "0"

2023-06-16 Thread Alvaro Herrera
On 2023-Jun-16, Masahiko Sawada wrote:

> The walreceiver process doesn't use CRS_EXPORT_SNAPSHOT actually,
> right? I think replacing it with CRS_EXPORT_SNAPSHOT would rather
> confuse me

libpqwalreceiver.c does use it.  But I agree -- I think it would be
better to not use the enum in walreceiver at all.  IIRC if we stopped
use of that enum in {libpq}walreceiver, then we wouldn't need
walsender.h inclusion by walreceiver files.

However, changing it means a change of the walrcv_create_slot API, so
it's not all that trivial.  But we could have a walreceiver-side enum
instead (with the same values).  I think this would be worth doing,
because it'll all end up cleaner.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Synchronizing slots from primary to standby

2023-06-16 Thread Amit Kapila
On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand
 wrote:
>
> Please find attached V5 (a rebase of V4 posted up-thread).
>
> In addition to the "rebasing" work, the TAP test adds a test about conflict 
> handling (logical slot invalidation)
> relying on the work done in the "Minimal logical decoding on standby" patch 
> series.
>
> I did not look more at the patch (than what's was needed for the rebase) but 
> plan to do so.
>

Are you still planning to continue working on this? Some miscellaneous
comments while going through this patch are as follows?

1. Can you please try to explain the functionality of the overall
patch somewhere in the form of comments and or commit message?

2. It seems that the initially synchronized list of slots is only used
to launch a per-database worker to synchronize all the slots
corresponding to that database. If so, then why do we need to fetch
all the slot-related information via LIST_SLOTS command?

3. As mentioned in the initial email, I think it would be better to
replace LIST_SLOTS command with a SELECT query.

4. How the limit of sync_slot workers is decided? Can we document such
a piece of information? Do we need a new GUC to decide the number of
workers? Ideally, it would be better to avoid GUC, can we use any
existing logical replication workers related GUC?

5. Can we separate out the functionality related to standby_slot_names
in a separate patch, probably the first one? I think that will patch
easier to review.

6. In libpqrcv_list_slots(), two-phase related slot information is not
retrieved. Is there a reason for the same?

7.
+static void
+wait_for_standby_confirmation(XLogRecPtr commit_lsn)

Some comments atop this function would make it easier to review.

8.
+/*-
+ * slotsync.c
+ *PostgreSQL worker for synchronizing slots to a standby from primary
+ *
+ * Copyright (c) 2016-2018, PostgreSQL Global Development Group
+ *

The copyright notice is out-of-date.

9. Why synchronize_one_slot() compares
MyReplicationSlot->data.restart_lsn with the value of
confirmed_flush_lsn passed to it? Also, why it does only for new slots
but not existing slots?

10. Can we somehow test if the restart_lsn is advanced properly after
sync? I think it is important to ensure that because otherwise after
standby's promotion, the subscriber can start syncing from the wrong
position.

-- 
With Regards,
Amit Kapila.




Re: Pluggable toaster

2023-06-16 Thread Nikita Malakhov
Hi,

Seems that I missed the thread mentioned above. I strongly disagree
with such statement, Pluggable TOAST could not be a part or Compression
Dictionaries thread because the TOAST improvement is a more general subject,
it involves much deeper and tricky changes in the core. And also is much
more
promising in terms of performance and storage improvements.

We already have a lot of changes in Pluggable TOAST that were not committed
to the main GIT branch of this thread, so it seems that I have to merge
them and
reopen it.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Do we want a hashset type?

2023-06-16 Thread Joel Jacobson
New patch attached:

Add customizable params to int4hashset() and collision count function

This commit enhances int4hashset() by introducing adjustable capacity,
load, and growth factors, providing flexibility for performance optimization.

Also added is a new function, hashset_collisions(), to report collision
counts, aiding in performance tuning.

Aggregate functions are renamed to hashset_agg() for consistency with
array_agg() and range_agg().

A new test file, test/sql/benchmark.sql, is added for evaluating the
performance of hash functions. It's not run automatically by
make installcheck.

The adjustable parameters and the naive hash function are useful for testing
and performance comparison. However, to keep things simple and streamlined
for users, these features are likely to be removed in the final release,
emphasizing the use of well-optimized default settings.

SQL-function indentation is also adjusted to align with the PostgreSQL
source repo, improving readability.

In the benchmark results below, it was a bit surprising the naive hash
function had no collisions, but that only held true when the input
elements were sequential integers. When tested with random integers,
all three hash functions caused collisions.

Timing results not statistical significant, the purpose is just to
give an idea of the execution times.

*** Elements in sequence 1..10
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_collisions: 31195
DO
Time: 1342.564 ms (00:01.343)
- Testing Murmurhash32
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_collisions: 30879
DO
Time: 1297.823 ms (00:01.298)
- Testing naive hash function
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_collisions: 0
DO
Time: 1400.936 ms (00:01.401)
*** Testing 10 random ints
setseed
-

(1 row)

Time: 3.591 ms
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_collisions: 30919
DO
Time: 1415.497 ms (00:01.415)
setseed
-

(1 row)

Time: 1.282 ms
- Testing Murmurhash32
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_collisions: 30812
DO
Time: 2079.202 ms (00:02.079)
setseed
-

(1 row)

Time: 0.122 ms
- Testing naive hash function
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_count: 10
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_collisions: 30822
DO
Time: 1613.965 ms (00:01.614)

/Joel

hashset-0.0.1-184a18a.patch
Description: Binary data


Re: Support to define custom wait events for extensions

2023-06-16 Thread Masahiro Ikeda

On 2023-06-16 16:46, Michael Paquier wrote:

On Fri, Jun 16, 2023 at 11:14:05AM +0900, Masahiro Ikeda wrote:

I tried to query on pg_stat_activity to check the background worker
invoked by pg_prewarm. But, I found that pg_stat_activity doesn't show
it although I may be missing something...

So, I tried to implement TAP tests. But I have a problem with it.
I couldn't find the way to check the status of another backend
while the another backend wait with custom wait events.


Hmm.  Right.  It seems to me that BGWORKER_BACKEND_DATABASE_CONNECTION
is required in this case, with BackgroundWorkerInitializeConnection()
to connect to a database (or not, like the logical replication
launcher if only access to shared catalogs is wanted).

I have missed that the leader process of pg_prewarm does not use that,
because it has no need to connect to a database, but its workers do.
So it is not going to show up in pg_stat_activity.


Yes. Thanks to your advice, I understood that
BGWORKER_BACKEND_DATABASE_CONNECTION is the reason.

I could make the TAP test that invokes a background worker waiting 
forever
and checks its custom wait event in pg_stat_activity. So, I'll make 
patches

including test codes next week.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: Do we want a hashset type?

2023-06-16 Thread jian he
similar to (int[] || int4) and (int4 || int[])
should we expect ('{1,2}'::int4hashset || 3) == (3 ||
 '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); *?*

The following is the general idea on how to make it work by looking at
similar code
CREATE OPERATOR || (
leftarg = int4hashset,
rightarg = int4,
function = int4hashset_add,
commutator = ||
);

CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
RETURNS int4hashset
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN $2 || $1;

CREATE OPERATOR || (
leftarg = int4,
rightarg = int4hashset,
function = int4_add_int4hashset,
commutator = ||
);
while creating an operator. I am not sure how to specify
NEGATOR,RESTRICT,JOIN clause.
-
also. I think the following query should return one row only? but now it
doesn't.
select hashset_cmp('{1,2}','{2,1}')
union
select hashset_cmp('{1,2}','{1,2,1}')
union
select hashset_cmp('{1,2}','{1,2}');
--
similar to elem_contained_by_range, range_contains_elem. we should already
consider the operator *<@* and @*>? *

CREATE OR REPLACE FUNCTION elem_contained_by_hashset(int4, int4hashset)
RETURNS bool
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN hashset_contains ($2,$1);

Is the integer contained in the int4hashset?
integer  <@ int4hashset → boolean
1 <@ int4hashset'{1,7}' → t

CREATE OPERATOR <@ (
leftarg = integer,
rightarg = int4hashset,
function = elem_contained_by_hashset
);

int4hashset @> integer → boolean
Does the int4hashset contain the element?
int4hashset'{1,7}' @> 1 → t

CREATE OPERATOR @> (
leftarg = int4hashset,
rightarg = integer,
function = hashset_contains
);
---


Add some more corruption error codes to relcache

2023-06-16 Thread Andrey M. Borodin
Hi hackers,

Relcache errors from time to time detect catalog corruptions. For example, 
recently I observed following:
1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This type 
of corruption passes through data_checksums.
2. RelationBuildTupleDesc() was failing with "catalog is missing 1 attribute(s) 
for relid 2662".
3. We monitor corruption error codes and alert on-call DBAs when see one, but 
the message is not marked as XX001 or XX002. It's XX000 which happens from time 
to time due to less critical reasons than data corruption.
4. High-availability automation switched primary to other host and other 
monitoring checks did not ring too.

This particular case is not very illustrative. In fact we had index corruption 
that looked like catalog corruption.
But still it looks to me that catalog inconsistencies (like relnatts != number 
of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED.
This particular error code in my experience proved to be a good indicator for 
early corruption detection.

What do you think?
What other subsystems can be improved in the same manner?

Best regards, Andrey Borodin.




v1-0001-Add-corruption-error-codes-to-relcache-entries.patch
Description: Binary data


Re: Pluggable toaster

2023-06-16 Thread Aleksander Alekseev
Hi Nikita,

> We already have a lot of changes in Pluggable TOAST that were not committed
> to the main GIT branch of this thread, so it seems that I have to merge them 
> and
> reopen it.

Pretty sure that reopening an already rejected patch that is competing
with compression dictionaries (which the rest of us are currently
focusing on) will achieve anything. Consider joining the compression
dictionaries effort instead [1]. During the discussion with the
community it ended up being a TOAST improvement after all. So we could
use your expertise in this area.

[1]: https://commitfest.postgresql.org/43/3626/

-- 
Best regards,
Aleksander Alekseev




Re: RFC: logical publication via inheritance root?

2023-06-16 Thread Amit Kapila
On Sat, Apr 1, 2023 at 5:06 AM Jacob Champion  wrote:
>
> On Fri, Mar 31, 2023 at 3:17 PM Peter Smith  wrote:
>
> > Outside the scope of special TimescaleDB tables and the speculated
> > pg_partman old-style table migration, will this proposed new feature
> > have any other application? In other words, do you know if this
> > proposal will be of any benefit to the *normal* users who just have
> > native PostgreSQL inherited tables they want to replicate?
>
> I think it comes down to why an inheritance scheme was used. If it's
> because you want to group rows into named, queryable subsets (e.g. the
> "cities/capitals" example in the docs [1]), I don't think this has any
> utility, because I assume you'd want to replicate your subsets as-is.
>

I also think so and your idea to have a function like
pg_set_logical_root() seems to make the inheritance hierarchy behaves
as a declarative partitioning scheme for the purpose of logical
replication.

> But if it's because you've implemented a partitioning scheme of your
> own (the docs still list reasons you might want to [2], even today),
> and all you ever really do is interact with the root table, I think
> this feature will give you some of the same benefits that
> publish_via_partition_root gives native partition users. We're very
> much in that boat, but I don't know how many others are.
>

I agree that there may still be cases as pointed out by you where
people want to use inheritance as a mechanism for partitioning but I
feel those would still be in the minority. Personally, I am not very
excited about this idea.

-- 
With Regards,
Amit Kapila.




Re: [DOC] Update ALTER SUBSCRIPTION documentation v3

2023-06-16 Thread Peter Eisentraut

On 15.06.23 04:49, Amit Kapila wrote:

I wanted to backpatch the following change which provides somewhat
accurate information about what a user needs to do when it faces an
error.
To proceed in
-   this situation, disassociate the subscription from the replication slot by
-   executing ALTER SUBSCRIPTION ... SET (slot_name = NONE).
+   this situation, first DISABLE the subscription, and then
+   disassociate it from the replication slot by executing
+   ALTER SUBSCRIPTION ... SET (slot_name = NONE).

Now, along with this change, there is a change in errhint as well
which I am not sure about whether to backpatch or not. I think we have
the following options (a) commit both doc and code change in HEAD (b)
commit both doc and code change in v17 when the next version branch
opens (c) backpatch the doc change and commit the code change in HEAD
only (d) backpatch the doc change and commit the code change in v17
(e) backpatch both the doc and code change.


Reading the thread again now, I think this is essentially a bug fix, so 
I don't mind backpatching it.


I wish the errhint would show the actual command to disable the 
subscription.  It already shows the command to detach the replication 
slot, so it would only be consistent to also show the other command.





Re: [17] collation provider "builtin"

2023-06-16 Thread Peter Eisentraut

On 15.06.23 00:55, Jeff Davis wrote:

The locale "C" (and equivalently, "POSIX") is not really a libc locale;
it's implemented internally with memcmp for collation and
pg_ascii_tolower, etc., for ctype.

The attached patch implements a new collation provider, "builtin",
which only supports "C" and "POSIX". It does not change the initdb
default provider, so it must be requested explicitly. The user will be
guaranteed that collations with provider "builtin" will never change
semantics; therefore they need no version and indexes are not at risk
of corruption. See previous discussion[1].


What happens if after this patch you continue to specify provider=libc 
and locale=C?  Do you then get the "slow" path?


Should there be some logic in pg_dump to change the provider if locale=C?

What is the transition plan?





Parent/child context relation in pg_get_backend_memory_contexts()

2023-06-16 Thread Melih Mutlu
Hi hackers,

pg_get_backend_memory_contexts() (and pg_backend_memory_contexts view)
does not display parent/child relation between contexts reliably.
Current version of this function only shows the name of parent context
for each context. The issue here is that it's not guaranteed that
context names are unique. So, this makes it difficult to find the
correct parent of a context.

How can knowing the correct parent context be useful? One important
use-case can be that it would allow us to sum up all the space used by
a particular context and all other subcontexts which stem from that
context.
Calculating this sum is helpful since currently
(total/used/free)_bytes returned by this function does not include
child contexts. For this reason, only looking into the related row in
pg_backend_memory_contexts does not help us to understand how many
bytes that context is actually taking.

Simplest approach to solve this could be just adding two new fields,
id and parent_id, in pg_get_backend_memory_contexts() and ensuring
each context has a unique id. This way allows us to build a correct
memory context "tree".

Please see the attached patch which introduces those two fields.
Couldn't find an existing unique identifier to use. The patch simply
assigns an id during the execution of
pg_get_backend_memory_contexts() and does not store those id's
anywhere. This means that these id's may be different in each call.

With this change, here's a query to find how much space used by each
context including its children:

> WITH RECURSIVE cte AS (
> SELECT id, total_bytes, id as root, name as root_name
> FROM memory_contexts
> UNION ALL
> SELECT r.id, r.total_bytes, cte.root, cte.root_name
> FROM memory_contexts r
> INNER JOIN cte ON r.parent_id = cte.id
> ),
> memory_contexts AS (
> SELECT * FROM pg_backend_memory_contexts
> )
> SELECT root as id, root_name as name, sum(total_bytes)
> FROM cte
> GROUP BY root, root_name
> ORDER BY sum DESC;


You should see that TopMemoryContext is the one with highest allocated
space since all other contexts are simply created under
TopMemoryContext.


Also; even though having a correct link between parent/child contexts
can be useful to find out many other things as well by only writing
SQL queries, it might require complex recursive queries similar to the
one in case of total_bytes including children. Maybe, we can also
consider adding such frequently used and/or useful information as new
fields in pg_get_backend_memory_contexts() too.


I appreciate any comment/feedback on this.

Thanks,
-- 
Melih Mutlu
Microsoft


0001-Adding-id-parent_id-into-pg_backend_memory_contexts.patch
Description: Binary data


[BUG] recovery of prepared transactions during promotion can fail

2023-06-16 Thread Julian Markwort
Hey everyone,

I've discovered a serious bug that leads to a server crash upon promoting an 
instance that crashed previously and did
recovery in standby mode.

The bug is present in PostgreSQL versions 13 and 14 (and in earlier versions, 
though it doesn't manifest itself so
catastrophically).
The circumstances to trigger the bug are as follows:
- postgresql is configured for hot_standby, archiving, and prepared transactions
- prepare a transaction
- crash postgresql
- create standby.signal file
- start postgresql, wait for recovery to finish
- promote

The promotion will fail with a FATAL error, stating that "requested WAL segment 
.* has already been removed".
The FATAL error causes the startup process to exit, so postmaster shuts down 
again.

Here's an exemplary log output, maybe this helps people to find this issue when 
they search for it online:

LOG:  consistent recovery state reached at 0/15D8AB0
LOG:  database system is ready to accept read only connections
LOG:  received promote request
LOG:  redo done at 0/15D89B8
LOG:  last completed transaction was at log time 2023-06-16 13:09:53.71118+02
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
FATAL:  requested WAL segment pg_wal/00010001 has already been 
removed
LOG:  startup process (PID 1650358) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  database system is shut down


The cause of this failure is an oversight (rather obvious in hindsight):
The renaming of the WAL file (that was last written to before the crash 
happened) to .partial is done *before* PostgreSQL
might have to read this very file to recover prepared transactions from it.
The relevant function calls here are durable_rename() and 
RecoverPreparedTransactions() in xlog.c .

Note that it is important that the PREPARE entry is in the WAL file that 
PostgreSQL is writing to prior to the inital
crash.
This has happened repeatedly in production already with a customer that uses 
prepared transactions quite frequently.
I assume that this has happened for others too, but the circumstances of the 
crash and the cause are very dubious, and
troubleshooting it is pretty difficult.


This behaviour has - apparently unintentionally - been fixed in PG 15 and 
upwards (see commit 811051c ), as part of a
general restructure and reorganization of this portion of xlog.c (see commit 
6df1543 ).

Furthermore, it seems this behaviour does not appear in PG 12 and older, due to 
another possible bug:
In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead() before 
reading WAL in XlogReadTwoPhaseData() in
twophase.c .
In the older releases (PG <= 12), this reset is not done, so the requested LSN 
containing the prepared transaction can
(by happy coincidence) be read from in-memory buffers, and PostgreSQL 
consequently manages to come up just fine (as the
WAL has already been read into buffers prior to the .partial rename).
If the older releases also where to properly reset the XLogReaderState, they 
would also fail to find the LSN on disk, and
hence PostgreSQL would crash again.

I've attached patches for PG 14 and PG 13 that mimic the change in PG15 (commit 
811051c ) and reorder the crucial events,
placing the recovery of prepared transactions *before* renaming the file.
I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can be 
used to verify that promote after recovery
with prepared transactions works.

A note for myself in the future and whomever may find it useful:
The test can be copied to src/test/recovery/t/ and selectively run (after 
you've ./configure'd for TAP testing and
compiled everything) from within the src/test/recovery directory using 
something like:
make check PROVE_TESTS='t/PG_geq_12_promote_prepare_xact.pl'


My humble opinion is that this fix should be backpatched to PG 14 and PG 13.
It's debatable whether the fix needs to be brought back to 12 and older also, 
as those do not exhibit this issue, but the
order of renaming is still wrong.
I'm not sure if there could be cases where the in-memory buffers of the 
walreader are too small to cover a whole WAL
file.
There could also be other issues from operations that require reading WAL that 
happen after the .partial rename, I
haven't checked in depth what else happens in the affected codepath.
Please let me know if you think this should also be fixed in PG 12 and earlier, 
so I can produce the patches for those
versions as well.


Kind regards
Julian

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a05a82119e..e0d9b89d78 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7858,6 +7858,60 @@ StartupXLOG(void)
 			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
 	}
 
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* st

Re: Pluggable toaster

2023-06-16 Thread Aleksander Alekseev
Hi,

> Pretty sure that reopening an already rejected patch that is competing
> with compression dictionaries (which the rest of us are currently
> focusing on) will achieve anything.

Ooops, I didn't mean to be sarcastic:

s/will achieve/will not achieve/

My apologies.

> Consider joining the compression
> dictionaries effort instead [1]. During the discussion with the
> community it ended up being a TOAST improvement after all. So we could
> use your expertise in this area.
>
> [1]: https://commitfest.postgresql.org/43/3626/

-- 
Best regards,
Aleksander Alekseev




Re: Order changes in PG16 since ICU introduction

2023-06-16 Thread Peter Eisentraut

On 14.06.23 23:24, Jeff Davis wrote:

On Mon, 2023-06-12 at 23:04 +0200, Peter Eisentraut wrote:

Patch 0003:

Makes LOCALE apply to all providers. The overall feel after this
patch
is that "locale" now means the collation locale, and
LC_COLLATE/LC_CTYPE are for the server environment. When using
libc,
LC_COLLATE and LC_CTYPE still work as they did before, but their
relationship to database collation feels more like a special case
of
the libc provider. I believe most people favor this patch and I
haven't
seen recent objections.


This seems reasonable.


Attached a clean patch for this.

It seems to have widespread agreement so I plan to commit to v16 soon.

To clarify, this affects both initdb and CREATE DATABASE.


This looks good to me.

Attached is small fixup patch with some documentation tweaks and 
simplifying some test code (also includes pgperltidy).
From 0cd2154f364999091aba52136a139df75f58d1b7 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 16 Jun 2023 16:46:36 +0200
Subject: [PATCH] fixup! CREATE DATABASE: make LOCALE apply to all collation
 providers.

---
 doc/src/sgml/ref/create_database.sgml | 12 ++--
 src/test/icu/t/010_database.pl| 23 +--
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ref/create_database.sgml 
b/doc/src/sgml/ref/create_database.sgml
index dab05950ed..b2c8aef1ad 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -147,9 +147,9 @@ Parameters

 Sets the default collation order and character classification in the
 new database.  Collation affects the sort order applied to strings,
-e.g., in queries with ORDER BY, as well as the order used in indexes
+e.g., in queries with ORDER BY, as well as the 
order used in indexes
 on text columns.  Character classification affects the categorization
-of characters, e.g., lower, - upper and digit.  Also sets the
+of characters, e.g., lower, upper, and digit.  Also sets the
 associated aspects of the operating system environment,
 LC_COLLATE and LC_CTYPE.  The
 default is the same setting as the template database.  See Parameters

 Sets LC_COLLATE in the database server's operating
 system environment.  The default is the setting of  if specified; otherwise the same
+linkend="create-database-locale"/> if specified, otherwise the same
 setting as the template database.  See below for additional
 restrictions.

@@ -198,7 +198,7 @@ Parameters

 Sets LC_CTYPE in the database server's operating
 system environment.  The default is the setting of  if specified; otherwise the same
+linkend="create-database-locale"/> if specified, otherwise the same
 setting as the template database.  See below for additional
 restrictions.

@@ -218,8 +218,8 @@ Parameters
 Specifies the ICU locale (see ) for the database default
 collation order and character classification, overriding the setting
-.  The  must be ICU.  The default
+.  The locale provider must 
be ICU.  The default
 is the setting of  if
 specified; otherwise the same setting as the template database.

diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index b417b1a409..cbe5467f3c 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -52,27 +52,30 @@
 
 
 # Test that LOCALE='C' works for ICU
-my $ret1 = $node1->psql('postgres',
-   q{CREATE DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE 
template0 ENCODING UTF8}
-);
-is($ret1, 0,
+is( $node1->psql(
+   'postgres',
+   q{CREATE DATABASE dbicu1 LOCALE_PROVIDER icu LOCALE 'C' 
TEMPLATE template0 ENCODING UTF8}
+   ),
+   0,
"C locale works for ICU");
 
 # Test that LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE
 # are specified
-my $ret2 = $node1->psql('postgres',
-   q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 
'@colStrength=primary'
+is( $node1->psql(
+   'postgres',
+   q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 
'@colStrength=primary'
   LC_COLLATE='C' LC_CTYPE='C' TEMPLATE template0 ENCODING UTF8}
-);
-is($ret2, 0,
+   ),
+   0,
"LOCALE works for ICU locales if LC_COLLATE and LC_CTYPE are 
specified");
 
 # Test that ICU-specific LOCALE without LC_COLLATE and LC_CTYPE must
 # be specified with ICU_LOCALE
-my ($ret3, $stdout, $stderr) = $node1->psql('postgres',
+my ($ret, $stdout, $stderr) = $node1->psql(
+   'postgres',
q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE 
'@colStrength=primary'
   TEMPLATE template0 ENCODING UTF8});
-isnt($ret3, 0,
+isnt($ret, 0,
"ICU-specific locale must be specified with ICU_LOCALE: exit code not 
0");
 like(
$stderr,
-- 

Re: Deleting prepared statements from libpq.

2023-06-16 Thread Jelte Fennema
On Fri, 16 Jun 2023 at 16:26, Craig Ringer  wrote:
> Nobody's implemented it.
>
> A patch to add PQclosePrepared and PQsendClosePrepared would be welcome. At 
> least, I think so...

This might have been a pretty old thread. But I just took it upon me
to implement these functions (or well I mostly copied the
PQsendDescribe related code and did s/describe/close). I haven't
tested this code yet but I'm pretty sure it should just work (it
compiles at least).

The main reason I'm interested in this is because we're actively
working on implementing named prepared statement support for PgBouncer
in transaction pooling mode. It works with lots of client libraries
already. But sadly it doesn't work with psycopg at the moment, or at
least the closing part does not. And the reason is that psycopg closes
statements using a DEALLOCATE query instead of the Close protocol
message, because libpq does not support sending the Close protocol
message.

Afaict this is not just a problem for PgBouncer its implementation. As
far as I can tell the same is true for the Odyssey connection pooler
(which implemented named prepared statement support first).


v1-0001-Support-sending-Close-messages-from-libpq.patch
Description: Binary data


Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

2023-06-16 Thread Tomas Vondra



On 6/16/23 11:25, Quan Zongliang wrote:
> 
> We have a small table with only 23 rows and 21 values.
> 
> The resulting MCV and histogram is as follows
> stanumbers1 | {0.08695652,0.08695652}
> stavalues1  | {v1,v2}
> stavalues2  |
> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
> 
> An incorrect number of rows was estimated when HashJoin was done with
> another large table (about 2 million rows).
> 
> Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
> time=1.943..1528.983 rows=3902 loops=1)
> 

That's interesting. I wonder how come the estimate gets this bad simply
by skipping values entries with a single row in the sample, which means
we know the per-value selectivity pretty well.

I guess the explanation has to be something strange happening when
estimating the join condition selectivity, where we combine MCVs from
both sides of the join (which has to be happening here, otherwise it
would not matter what gets to the MCV).

It'd be interesting to know what's in the other MCV, and what are the
other statistics for the attributes (ndistinct etc.).

Or even better, a reproducer SQL script that builds two tables and then
joins them.

> The reason is that the MCV of the small table excludes values with rows
> of 1. Put them in the MCV in the statistics to get the correct result.
> 
> Using the conservative samplerows <= attstattarget doesn't completely
> solve this problem. It can solve this case.
> 
> After modification we get statistics without histogram:
> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
> stavalues1  | {v,v2, ... }
> 
> And we have the right estimates:
> Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
> time=1.447..1268.385 rows=3902 loops=1)
> 

I'm not against building a "complete" MCV, but I guess the case where
(samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
complete whenever we decide (ndistinct <= num_mcv)?

That would need to happen later, because we don't have the ndistinct
estimate yet at this point - we'd have to do the loop a bit later (or
likely twice).

FWIW the patch breaks the calculation of nmultiple (and thus likely the
ndistinct estimate).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Do we want a hashset type?

2023-06-16 Thread Joel Jacobson
On Fri, Jun 16, 2023, at 13:57, jian he wrote:
> similar to (int[] || int4) and (int4 || int[])
> should we expect ('{1,2}'::int4hashset || 3) == (3 ||  
> '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); 
> *?*

Good idea, makes sense to support it.
Implemented in attached patch.

> CREATE OPERATOR || (
> leftarg = int4,
> rightarg = int4hashset,
> function = int4_add_int4hashset,
> commutator = ||
> );
> while creating an operator. I am not sure how to specify 
> NEGATOR,RESTRICT,JOIN clause.

I don't think we need those for this operator, might be wrong though.

> -
> also. I think the following query should return one row only? but now 
> it doesn't.
> select hashset_cmp('{1,2}','{2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2}');

Good point.

I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:

SELECT '{1,2}'::int4hashset = '{2,1}'::int4hashset;
 ?column?
--
 t
(1 row)

SELECT hashset_hash('{1,2}'::int4hashset);
 hashset_hash
--
990882385
(1 row)

SELECT hashset_hash('{2,1}'::int4hashset);
 hashset_hash
--
996377797
(1 row)

Do we have any ideas on how to fix this without sacrificing performance?

We of course want to avoid having to sort the hashsets,
which is the naive solution.

To understand why this is happening, consider this example:

SELECT '{1,2}'::int4hashset;
 int4hashset
-
 {1,2}
(1 row)

SELECT '{2,1}'::int4hashset;
 int4hashset
-
 {2,1}
(1 row)

If the hash of `1` and `2` modulus the capacity results in the same value,
they will be attempted to be inserted into the same position,
and since the input text is parsed left-to-right, in the first case `1` will win
the first position, and `2` will get a collision and try the next position.

In the second case, the opposite happens.

Since we do modulus capacity, the position depends on the capacity,
which is why the output string can be different for the same input.

SELECTint4hashset() || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=1) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=2) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=3) || 1 || 2 || 3;
 {3,2,1}

SELECTint4hashset(capacity:=4) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=5) || 1 || 2 || 3;
 {1,2,3}

SELECTint4hashset(capacity:=6) || 1 || 2 || 3;
 {1,3,2}


> --
> similar to elem_contained_by_range, range_contains_elem. we should 
> already consider the operator *<@* and @*>? *

That could perhaps be nice.
Apart from possible syntax convenience,
are there any other benefits than just using the function 
hashset_contains(int4hashset, integer) directly?

/Joel

hashset-0.0.1-d83bdef.patch
Description: Binary data


Re: Support to define custom wait events for extensions

2023-06-16 Thread Tristan Partin
I will take a look at your V2 when it is ready! I will also pass along
that this is wanted by Neon customers :).

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Order changes in PG16 since ICU introduction

2023-06-16 Thread Jeff Davis
On Fri, 2023-06-16 at 16:50 +0200, Peter Eisentraut wrote:
> This looks good to me.
> 
> Attached is small fixup patch with some documentation tweaks and 
> simplifying some test code (also includes pgperltidy).

Thank you. Committed with your fixups.

Regards,
Jeff Davis





Re: Support to define custom wait events for extensions

2023-06-16 Thread Masahiro Ikeda
On 2023/06/17 1:16, Tristan Partin wrote:
> I will take a look at your V2 when it is ready! I will also pass along
> that this is wanted by Neon customers :).
Thanks!

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION




Re: add non-option reordering to in-tree getopt_long

2023-06-16 Thread Nathan Bossart
On Fri, Jun 16, 2023 at 04:51:38PM +0900, Kyotaro Horiguchi wrote:
> (Honestly, the rearrangement code looks somewhat tricky to grasp..)

Yeah, I think there's still some room for improvement here.

> It doesn't work properly if '--' is involved.
> 
> For example, consider the following options (even though they don't
> work for the command).
> 
> psql -t -T hoho -a hoge -- -1 hage hige huge
> 
> After the function returns -1, the argv array looks like this. The
> "[..]" indicates the position of optind.
> 
> psql -t -T hoho -a -- [-1] hage hige huge hoge
> 
> This is clearly incorrect since "hoge" gets moved to the end.  The
> rearrangement logic needs to take into account the '--'.  For your
> information, the glibc version yields the following result for the
> same options.
> 
> psql -t -T hoho -a -- [hoge] -1 hage hige huge

Ah, so it effectively retains the non-option ordering, even if there is a
'--'.  I think this behavior is worth keeping.  I attempted to fix this in
the attached patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 51e1033374c464ed90484f641d31b0ab705672f2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 9 Jun 2023 15:51:58 -0700
Subject: [PATCH v4 1/1] Teach in-tree getopt_long() to move non-options to the
 end of argv.

---
 src/bin/scripts/t/040_createuser.pl | 10 +++---
 src/port/getopt_long.c  | 50 +
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/src/bin/scripts/t/040_createuser.pl b/src/bin/scripts/t/040_createuser.pl
index 9ca282181d..ba40ab11a3 100644
--- a/src/bin/scripts/t/040_createuser.pl
+++ b/src/bin/scripts/t/040_createuser.pl
@@ -42,9 +42,9 @@ $node->issues_sql_like(
 	'add a role as a member with admin option of the newly created role');
 $node->issues_sql_like(
 	[
-		'createuser', '-m',
-		'regress_user3', '-m',
-		'regress user #4', 'REGRESS_USER5'
+		'createuser', 'REGRESS_USER5',
+		'-m', 'regress_user3',
+		'-m', 'regress user #4'
 	],
 	qr/statement: CREATE ROLE "REGRESS_USER5" NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS ROLE regress_user3,"regress user #4";/,
 	'add a role as a member of the newly created role');
@@ -73,11 +73,13 @@ $node->issues_sql_like(
 	qr/statement: CREATE ROLE regress_user11 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--role');
 $node->issues_sql_like(
-	[ 'createuser', '--member-of', 'regress_user1', 'regress_user12' ],
+	[ 'createuser', 'regress_user12', '--member-of', 'regress_user1' ],
 	qr/statement: CREATE ROLE regress_user12 NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT LOGIN NOREPLICATION NOBYPASSRLS IN ROLE regress_user1;/,
 	'--member-of');
 
 $node->command_fails([ 'createuser', 'regress_user1' ],
 	'fails if role already exists');
+$node->command_fails([ 'createuser', 'regress_user1', '-m', 'regress_user2', 'regress_user3' ],
+	'fails for too many non-options');
 
 done_testing();
diff --git a/src/port/getopt_long.c b/src/port/getopt_long.c
index c989276988..a4bdc6c8f0 100644
--- a/src/port/getopt_long.c
+++ b/src/port/getopt_long.c
@@ -50,8 +50,9 @@
  * This implementation does not use optreset.  Instead, we guarantee that
  * it can be restarted on a new argv array after a previous call returned -1,
  * if the caller resets optind to 1 before the first call of the new series.
- * (Internally, this means we must be sure to reset "place" to EMSG before
- * returning -1.)
+ *
+ * Note that this routine reorders the pointers in argv (despite the const
+ * qualifier) so that all non-options will be at the end when -1 is returned.
  */
 int
 getopt_long(int argc, char *const argv[],
@@ -60,38 +61,59 @@ getopt_long(int argc, char *const argv[],
 {
 	static char *place = EMSG;	/* option letter processing */
 	char	   *oli;			/* option letter list index */
+	static int	nonopt_start = -1;
+	static bool force_nonopt = false;
 
 	if (!*place)
 	{			/* update scanning pointer */
+retry:
 		if (optind >= argc)
 		{
 			place = EMSG;
+			nonopt_start = -1;
+			force_nonopt = false;
 			return -1;
 		}
 
 		place = argv[optind];
 
-		if (place[0] != '-')
+		/* non-options, including '-' and anything after '--' */
+		if (place[0] != '-' || place[1] == '\0' || force_nonopt)
 		{
-			place = EMSG;
-			return -1;
-		}
+			char	  **args = (char **) argv;
 
-		place++;
+			/*
+			 * If only non-options remain, return -1.  Else, move the
+			 * non-option to the end and try again.
+			 */
+			if (optind == nonopt_start)
+			{
+place = EMSG;
+nonopt_start = -1;
+force_nonopt = false;
+return -1;
+			}
 
-		if (!*place)
-		{
-			/* treat "-" as not being an option */
-			place = EMSG;
-			return -1;
+			for (int i = optind; i < argc - 1; i++)
+args[i] = args[i + 1];
+			args[argc - 1] = place;
+
+			if (nonopt_start == -1)
+nonopt_start = argc - 1;
+			else
+nonopt_start--;
+
+			goto retry;
 		}
 
+		place++;
+
 		if (place[0] == '-' 

Re: [PATCH] Missing dep on Catalog.pm in meson rules

2023-06-16 Thread Andrew Dunstan


On 2023-06-14 We 15:32, Andres Freund wrote:

Hi,

On 2023-06-09 11:43:54 -0700, Andres Freund wrote:

On 2023-06-02 10:13:44 -0500, Tristan Partin wrote:

On Fri Jun 2, 2023 at 8:47 AM CDT, Andres Freund wrote:

Hi,

On 2023-06-02 08:10:43 -0500, Tristan Partin wrote:

I wonder if we instead could just make perl output the files it loads and
handle dependencies automatically that way? But that's more work, so it's
probably the right thing to go for the manual path for now.

I am not familar with Perl enough (at all haha) to know if that is
possible. I don't know exactly what these Perl files do, but perhaps it
might make sense to have some global lookup table that is setup near the
beginning of the script.

It'd be nice to have something more general - there are other perl modules we
load, e.g.
./src/backend/catalog/Catalog.pm
./src/backend/utils/mb/Unicode/convutils.pm
./src/tools/PerfectHash.pm



perl_files = {
   'Catalog.pm': files('path/to/Catalog.pm'),
   ...
}

I think you got it, but just to make sure: I was thinking of generating a
depfile from within perl. Something like what you propose doesn't quite seems
like a sufficient improvement.

Whatever I am proposing is definitely subpar to generating a depfile. So
if that can be done, that is the best option!

I looked for a bit, but couldn't find an easy way to do so. I would still like
to pursue going towards dep files for the perl scripts, even if that requires
explicit support in the perl scripts, but that's a change for later.

Took a second look - sure looks like just using values %INC should suffice?

Ilmari, you're the perl expert, is there an issue with that?

Tristan, any chance you're interested hacking that up for a bunch of the
scripts? Might be worth adding a common helper for, I guess?

Something like

for (values %INC)
{
print STDERR "$kw_def_file: $_\n";
}

seems to roughly do the right thing for gen_keywordlist.pl. Of course for
something real it'd need an option where to put that data, instead of printing
to stderr.



Unless I'm misunderstanding, this doesn't look terribly feasible to me. 
You can only get at %INC by loading the module, which in many cases will 
have side effects. And then you would also need to filter out things 
loaded that are not our artefacts (e.g. Catalog.pm loads File::Compare).



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: RFC: logical publication via inheritance root?

2023-06-16 Thread Jacob Champion
On Fri, Jun 16, 2023 at 6:26 AM Amit Kapila  wrote:
> On Sat, Apr 1, 2023 at 5:06 AM Jacob Champion  wrote:
> > I think it comes down to why an inheritance scheme was used. If it's
> > because you want to group rows into named, queryable subsets (e.g. the
> > "cities/capitals" example in the docs [1]), I don't think this has any
> > utility, because I assume you'd want to replicate your subsets as-is.
>
> I also think so and your idea to have a function like
> pg_set_logical_root() seems to make the inheritance hierarchy behaves
> as a declarative partitioning scheme for the purpose of logical
> replication.

Right.

> > But if it's because you've implemented a partitioning scheme of your
> > own (the docs still list reasons you might want to [2], even today),
> > and all you ever really do is interact with the root table, I think
> > this feature will give you some of the same benefits that
> > publish_via_partition_root gives native partition users. We're very
> > much in that boat, but I don't know how many others are.
> >
>
> I agree that there may still be cases as pointed out by you where
> people want to use inheritance as a mechanism for partitioning but I
> feel those would still be in the minority.

(Just to clarify -- timescaledb is one of those cases. They definitely
still exist.)

> Personally, I am not very
> excited about this idea.

Yeah, "exciting" isn't how I'd describe this feature either :D But I
think we're probably locked out of logical replication without the
ability to override publish_as_relid for our internal tables, somehow.
And I don't think DDL replication will help, just like it wouldn't
necessarily help existing publish_via_partition_root use cases,
because we don't want to force the source table's hierarchy on the
target table. (A later version of timescaledb may not even use the
same internal layout.)

Is there an alternative implementation I'm missing, maybe, or a way to
make this feature more generally applicable? "We have table Y and want
it to be migrated as part of table X" seems to fall squarely under the
logical replication umbrella.

Thanks,
--Jacob




test_extensions: fix inconsistency between meson.build and Makefile

2023-06-16 Thread Jeff Davis
Patch attached. Currently, the Makefile specifies NO_LOCALE=1, and the
meson.build does not.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 1775c98badb94a2ee185d7a6bd11482a4e5db58a Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 16 Jun 2023 11:51:00 -0700
Subject: [PATCH v1] test_extensions: make meson.build consistent with
 Makefile.

Specify --no-locale and --encoding=UTF8 to be consistent with the
Makefile, which specifies NO_LOCALE=1.
---
 src/test/modules/test_extensions/meson.build | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/test/modules/test_extensions/meson.build b/src/test/modules/test_extensions/meson.build
index 29e5bb2fb5..698775b28d 100644
--- a/src/test/modules/test_extensions/meson.build
+++ b/src/test/modules/test_extensions/meson.build
@@ -47,5 +47,6 @@ tests += {
   'test_extensions',
   'test_extdepend',
 ],
+'regress_args': ['--no-locale', '--encoding=UTF8'],
   },
 }
-- 
2.34.1



Default client_connection_check_interval to 10s on supported systems

2023-06-16 Thread Tristan Partin
Hi,

I am proposing that the default value of
client_connection_check_interval be moved to a non-zero value on
systems where it is supported. I think it would be a nice quality of
life improvement. I have run into a problem where this would have been
useful before with regard to pgbench not currently handling SIGINT
corrently[0]. I basically picked 10s out of thin air and am happy to
change it to what others feel would be more appropriate. This doesn't
need to be a check that happens often because it should just be a
backstop for unusual scenarios or poorly programmed clients.

The original thread where Thomas committed these changes seemed to
indicate no performance impact[1]. The only reason that I can think of
this being turned off by default is that it isn't available on all
systems that Postgres supports. When this was committed however, the
only system that seemed to support EPOLLRDHUP was Linux. Seems like in
recent years this story has changed given the description of the
parameter.

> This option relies on kernel events exposed by Linux, macOS, illumos and
> the BSD family of operating systems, and is not currently available on
> other systems.

[0]: https://www.postgresql.org/message-id/CSSWBAX56CVY.291H6ZNNHK7EO@c3po
[1]: 
https://www.postgresql.org/message-id/ca+hukg++kitznuoxw2-kob1pkwd2cyuqa9vlj5bf0g_i7l1...@mail.gmail.com

-- 
Tristan Partin
Neon (https://neon.tech)
From d38cc95dceda5fcf98c8aae6379085e9c38a3155 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Fri, 16 Jun 2023 12:47:26 -0500
Subject: [PATCH v1] Default client_connection_check_interval to 10s on
 supported systems

client_connection_check_interval is an inherently valuable parameter
when supported. It can help cancel long running queries server-side when
a connection drops client-side.
---
 configure.ac|  3 +++
 doc/src/sgml/config.sgml| 10 ++
 meson.build |  2 ++
 src/backend/utils/misc/guc_tables.c |  7 ++-
 src/include/pg_config.h.in  |  6 ++
 src/tools/msvc/Solution.pm  |  2 ++
 6 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/configure.ac b/configure.ac
index 09558ada0f..3df7bbeb4f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1846,6 +1846,9 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include ])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include ])
 
+AC_CHECK_DECLS(EPOLLHUP, [], [], [#include ])
+AC_CHECK_DECLS(EPOLLRDHUP, [], [], [#include ])
+
 # This is probably only present on macOS, but may as well check always
 AC_CHECK_DECLS(F_FULLFSYNC, [], [], [#include ])
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6262cb7bb2..24f69ebd15 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1044,10 +1044,12 @@ include_dir 'conf.d'


 If the value is specified without units, it is taken as milliseconds.
-The default value is 0, which disables connection
-checks.  Without connection checks, the server will detect the loss of
-the connection only at the next interaction with the socket, when it
-waits for, receives or sends data.
+The default value is 0 on systems not previously
+mentioned, such as Windows, which disables connection checks, and a
+default value of 1, otherwise.  Without
+connection checks, the server will detect the loss of the connection
+only at the next interaction with the socket, when it waits for,
+receives or sends data.


 For the kernel itself to detect lost TCP connections reliably and within
diff --git a/meson.build b/meson.build
index 82f2782673..8de6a7a7f1 100644
--- a/meson.build
+++ b/meson.build
@@ -2163,6 +2163,8 @@ endforeach
 
 
 decl_checks = [
+  ['EPOLLHUP', 'sys/epoll.h'],
+  ['EPOLLRDHUP', 'sys/epoll.h'],
   ['F_FULLFSYNC', 'fcntl.h'],
   ['fdatasync', 'unistd.h'],
   ['posix_fadvise', 'fcntl.h'],
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 71e27f8eb0..8cfd6b61bb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3478,7 +3478,12 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_MS
 		},
 		&client_connection_check_interval,
-		0, 0, INT_MAX,
+#if defined(HAVE_DECL_EPOLLHUP) || defined(HAVE_DECL_EPOLLRDHUP)
+		1,
+#else
+		0,
+#endif
+		0, INT_MAX,
 		check_client_connection_check_interval, NULL, NULL
 	},
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6d572c3820..7ce5004550 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -91,6 +91,12 @@
 /* Define to 1 if you have the `CRYPTO_lock' function. */
 #undef HAVE_CRYPTO_LOCK
 
+/* Define to 1 if you have the EPOLLHUP flag */
+#undef HAVE_DECL_EPOLLHUP
+
+/* Define to 1 if you have the EPOLLRDHUP flag */
+#undef HAVE_DECL_EPOLLRDHUP
+

Re: test_extensions: fix inconsistency between meson.build and Makefile

2023-06-16 Thread Tristan Partin
On Fri Jun 16, 2023 at 3:29 PM CDT, Jeff Davis wrote:
> Patch attached. Currently, the Makefile specifies NO_LOCALE=1, and the
> meson.build does not.

Looks alright to me, but it might be nicer to change the order of
arguments to match contrib/unaccent/meson.build:40. Might help with
grepping in the future.

-- 
Tristan Partin
Neon (https://neon.tech)




Adding further hardening to nbtree page deletion

2023-06-16 Thread Peter Geoghegan
Attached patch adds additional hardening to nbtree page deletion. It
makes nbtree VACUUM tolerate a certain sort of cross-page
inconsistencies in the structure of an index (corruption). VACUUM can
press on, avoiding an eventual wraparound/xidStopLimit failure in
environments where nobody notices the problem for an extended period.

This is very similar to my recent commit 5abff197 (though it's even
closer to commit 5b861baa). Once again we're demoting an ERROR to a
LOG message, and pressing on with vacuuming. I propose that this patch
be backpatched all the way, too. The hardening added by the patch
seems equally well targeted and low risk. It's a parent/child
inconsistency, as opposed to a sibling inconsistency. Very familiar
stuff, overall.

I have seen an internal report of the ERROR causing issues for a
production instance, so this definitely can fail in the field on
modern Postgres versions. Though this particular inconsistency ("right
sibling is not next child...") has a long history. It has definitely
been spotted in the field several times over many years. This 2006
thread about problems with a Wisconsin courts database is one example
of that:

https://www.postgresql.org/message-id/flat/3355.1144873721%40sss.pgh.pa.us#b0a89b2d9e7f6a3c818fdf723b8fa29b

At the time the ERROR was a PANIC. A few years later (in 2010), it was
demoted to an ERROR (see commit 8fa30f90). And now I want to demote it
to a LOG -- which is much easier now that we have a robust approach to
page deletion (after 2014 commit efada2b8e9).

-- 
Peter Geoghegan


v1-0001-nbtree-VACUUM-cope-with-topparent-inconsistencies.patch
Description: Binary data


Re: [PATCH] Missing dep on Catalog.pm in meson rules

2023-06-16 Thread Andres Freund
Hi,

On 2023-06-16 16:20:14 -0400, Andrew Dunstan wrote:
> Unless I'm misunderstanding, this doesn't look terribly feasible to me. You
> can only get at %INC by loading the module, which in many cases will have
> side effects.

I was envisioning using %INC after the use/require block - I don't think our
scripts load additional modules after that point?


> And then you would also need to filter out things loaded that
> are not our artefacts (e.g. Catalog.pm loads File::Compare).

I don't think we would need to filter the output. This would just be for a
build dependency file. I don't see a problem with rerunning genbki.pl et al 
after
somebody updates File::Compare?

Greetings,

Andres Freund




Re: [PATCH] pg_regress.c: Fix "make check" on Mac OS X: Pass DYLD_LIBRARY_PATH

2023-06-16 Thread David Zhang
I have applied the patch to the latest master branch and successfully executed 
'./configure && make && make check' on macOS Ventura. However, during the 
process, a warning was encountered: "mixing declarations and code is 
incompatible with standards before C99 [-Wdeclaration-after-statement]". Moving 
the declaration of 'result' to the beginning like below can resolve the 
warning, and it would be better to use a unique variable instead of 'result'. 

#ifdef __darwin__
static char extra_envvars[4096];
+int result = -1;
... ...
-int result = snprintf(extra_envvars, sizeof(extra_envvars), 
"DYLD_LIBRARY_PATH=%s",
+result = snprintf(extra_envvars, sizeof(extra_envvars), "DYLD_LIBRARY_PATH=%s",

Re: [17] collation provider "builtin"

2023-06-16 Thread Jeff Davis
On Fri, 2023-06-16 at 16:01 +0200, Peter Eisentraut wrote:
> What happens if after this patch you continue to specify
> provider=libc 
> and locale=C?  Do you then get the "slow" path?

Users can continue to use the libc provider as they did before and the
fast path will still work.

> Should there be some logic in pg_dump to change the provider if
> locale=C?

That's not a part of this proposal.

> What is the transition plan?

The built-in provider is for users who want to choose a provider that
is guaranteed not to have the problems of an external provider
(versioning, tracking affected objects, corrupt indexes, and slow
performance). If they initialize with --locale-provider=builtin and --
locale=C, and they want to choose a different locale for another
database, they'll need to specifically choose libc or ICU. Of course
they can still use specific collations attached to columns or queries
as required.

It also acts as a nice complement to ICU (which doesn't support the C
locale) or a potential replacement for many uses of the libc provider
with the C locale. We can discuss later exactly how that will look, but
even if the builtin provider needs to be explicitly requested (as in
the current patch), it's still useful, so I don't think we need to
decide now.

We should also keep in mind that whatever provider is selected at
initdb time also becomes the default for future databases.

Regards,
Jeff Davis





Fixing tab-complete for dollar-names

2023-06-16 Thread Mikhail Gribkov
Hi hackers,

In modern versions of Postgres the dollar sign is a totally legal character
for identifiers (except for the first character), but tab-complete do not
treat such identifiers well.
For example if one try to create an Oracle-style view like this:

create view v$activity as select * from pg_stat_activity;

, he will get a normally functioning view, but psql tab-complete will not
help him. Type "v", "v$" or "v$act" and press  - nothing will be
suggested.

Attached is a small patch fixing this problem.
Honestly I'm a little surprised that this was not done before. Maybe, there
are some special considerations I am not aware of, and the patch will break
something?
What would you say?
--
 best regards,
Mikhail A. Gribkov


v001_fix_dollar_names_tab_complete.patch
Description: Binary data


Re: pg_collation.collversion for C.UTF-8

2023-06-16 Thread Jeff Davis
On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:
> Hmm, OK let's explore that.  What could we do that would be helpful
> here, without affecting users of the "true" C.UTF-8 for the rest of
> time?

Where is the "true" C.UTF-8 defined?

I assume you mean that the collation order can't (shouldn't, anyway)
change. But what about the ctype (upper/lower/initcap) behavior? Is
that also locked down for all time, or could it change if some new
unicode characters are added?

Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
but leave LC_CTYPE=C.UTF-8 as-is?

Regards,
Jeff Davis





Re: Default client_connection_check_interval to 10s on supported systems

2023-06-16 Thread Tom Lane
"Tristan Partin"  writes:
> I am proposing that the default value of
> client_connection_check_interval be moved to a non-zero value on
> systems where it is supported. I think it would be a nice quality of
> life improvement.

I doubt that we need this, and I *really* doubt that it's appropriate
to use a timeout as short as 10s.

One reason not to try to enable it by default is exactly that the
default behavior would then be platform-dependent.  That's a
documentation problem we could do without.

regards, tom lane




Re: [PATCH] Missing dep on Catalog.pm in meson rules

2023-06-16 Thread Dagfinn Ilmari Mannsåker
Andres Freund  writes:

> Hi,
>
> On 2023-06-16 16:20:14 -0400, Andrew Dunstan wrote:
>> Unless I'm misunderstanding, this doesn't look terribly feasible to me. You
>> can only get at %INC by loading the module, which in many cases will have
>> side effects.
>
> I was envisioning using %INC after the use/require block - I don't think our
> scripts load additional modules after that point?

I was thinking of a module for writing depfile entries that would append
`values %INC` to the list of source files for each target specified by
the script.

>> And then you would also need to filter out things loaded that
>> are not our artefacts (e.g. Catalog.pm loads File::Compare).
>
> I don't think we would need to filter the output. This would just be for a
> build dependency file. I don't see a problem with rerunning genbki.pl et al 
> after
> somebody updates File::Compare?

As long as mason doesn't object to dep files outside the source tree.
Otherwise, and option would be to pass in @SOURCE_ROOT@ and only include
`grep /^\Q$source_root\E\b/, values %INC` in the depfile.

> Greetings,
>
> Andres Freund

- ilmari




Re: [PATCH] Missing dep on Catalog.pm in meson rules

2023-06-16 Thread Tristan Partin
On Fri Jun 16, 2023 at 5:10 PM CDT, Dagfinn Ilmari Mannsåker wrote:
> Andres Freund  writes:
>
> > Hi,
> >
> > On 2023-06-16 16:20:14 -0400, Andrew Dunstan wrote:
> >> Unless I'm misunderstanding, this doesn't look terribly feasible to me. You
> >> can only get at %INC by loading the module, which in many cases will have
> >> side effects.
> >
> > I was envisioning using %INC after the use/require block - I don't think our
> > scripts load additional modules after that point?
>
> I was thinking of a module for writing depfile entries that would append
> `values %INC` to the list of source files for each target specified by
> the script.
>
> >> And then you would also need to filter out things loaded that
> >> are not our artefacts (e.g. Catalog.pm loads File::Compare).
> >
> > I don't think we would need to filter the output. This would just be for a
> > build dependency file. I don't see a problem with rerunning genbki.pl et al 
> > after
> > somebody updates File::Compare?
>
> As long as mason doesn't object to dep files outside the source tree.
> Otherwise, and option would be to pass in @SOURCE_ROOT@ and only include
> `grep /^\Q$source_root\E\b/, values %INC` in the depfile.

Meson has no such restrictions.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Default client_connection_check_interval to 10s on supported systems

2023-06-16 Thread Tristan Partin
On Fri Jun 16, 2023 at 5:10 PM CDT, Tom Lane wrote:
> "Tristan Partin"  writes:
> > I am proposing that the default value of
> > client_connection_check_interval be moved to a non-zero value on
> > systems where it is supported. I think it would be a nice quality of
> > life improvement.
>
> I doubt that we need this, and I *really* doubt that it's appropriate
> to use a timeout as short as 10s.

Sure. Like I said, 10s is just a number pulled from thin air. The
original patches on the mailing list had this as low as 1s.

> One reason not to try to enable it by default is exactly that the
> default behavior would then be platform-dependent.  That's a
> documentation problem we could do without.

Totally fine with me if this is the reason for rejection. Just wanted to
put it out there for discussion since I don't think the original thread
covered the default value very much.

-- 
Tristan Partin
Neon (https://neon.tech)




Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

2023-06-16 Thread Quan Zongliang




On 2023/6/16 23:39, Tomas Vondra wrote:



On 6/16/23 11:25, Quan Zongliang wrote:


We have a small table with only 23 rows and 21 values.

The resulting MCV and histogram is as follows
stanumbers1 | {0.08695652,0.08695652}
stavalues1  | {v1,v2}
stavalues2  |
{v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}

An incorrect number of rows was estimated when HashJoin was done with
another large table (about 2 million rows).

Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
time=1.943..1528.983 rows=3902 loops=1)



That's interesting. I wonder how come the estimate gets this bad simply
by skipping values entries with a single row in the sample, which means
we know the per-value selectivity pretty well.

I guess the explanation has to be something strange happening when
estimating the join condition selectivity, where we combine MCVs from
both sides of the join (which has to be happening here, otherwise it
would not matter what gets to the MCV).

It'd be interesting to know what's in the other MCV, and what are the
other statistics for the attributes (ndistinct etc.).

Or even better, a reproducer SQL script that builds two tables and then
joins them.

The other table is severely skewed. Most rows cannot JOIN the small 
table. This special case causes the inaccuracy of cost calculation.



The reason is that the MCV of the small table excludes values with rows
of 1. Put them in the MCV in the statistics to get the correct result.

Using the conservative samplerows <= attstattarget doesn't completely
solve this problem. It can solve this case.

After modification we get statistics without histogram:
stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
stavalues1  | {v,v2, ... }

And we have the right estimates:
Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
time=1.447..1268.385 rows=3902 loops=1)



I'm not against building a "complete" MCV, but I guess the case where
(samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
complete whenever we decide (ndistinct <= num_mcv)?

That would need to happen later, because we don't have the ndistinct
estimate yet at this point - we'd have to do the loop a bit later (or
likely twice).

FWIW the patch breaks the calculation of nmultiple (and thus likely the
ndistinct estimate).

It's not just a small table. If a column's value is nearly unique. It 
also causes the same problem because we exclude values that occur only 
once. samplerows <= num_mcv just solves one scenario.

Perhaps we should discard this (dups cnt > 1) restriction?



regards







Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

2023-06-16 Thread Tom Lane
Quan Zongliang  writes:
> Perhaps we should discard this (dups cnt > 1) restriction?

That's not going to happen on the basis of one test case that you
haven't even shown us.  The implications of doing it are very unclear.
In particular, I seem to recall that there are bits of logic that
depend on the assumption that MCV entries always represent more than
one row.  The nmultiple calculation Tomas referred to may be failing
because of that, but I'm worried about there being other places.

Basically, you're proposing a rather fundamental change in the rules
by which Postgres has gathered statistics for decades.  You need to
bring some pretty substantial evidence to support that.  The burden
of proof is on you, not on the status quo.

regards, tom lane




Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

2023-06-16 Thread Quan Zongliang




On 2023/6/17 06:46, Tom Lane wrote:

Quan Zongliang  writes:

Perhaps we should discard this (dups cnt > 1) restriction?


That's not going to happen on the basis of one test case that you
haven't even shown us.  The implications of doing it are very unclear.
In particular, I seem to recall that there are bits of logic that
depend on the assumption that MCV entries always represent more than
one row.  The nmultiple calculation Tomas referred to may be failing
because of that, but I'm worried about there being other places.



The statistics for the other table look like this:
stadistinct | 6
stanumbers1 | {0.50096667,0.49736667,0.0012}
stavalues1  | {v22,v23,v5}

The value that appears twice in the small table (v1 and v2) does not 
appear here. The stadistinct's true value is 18 instead of 6 (three 
values in the small table do not appear here).


When calculating the selectivity:
if (nd2 > sslot2->nvalues)
  totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);

totalsel1 = 0
nd2 = 21
sslot2->nvalues = 2
unmatchfreq1 = 0.0002016420476
otherfreq2 = 0.82608695328235626

result: totalsel1 = 0.043473913749706022
rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800



Basically, you're proposing a rather fundamental change in the rules
by which Postgres has gathered statistics for decades.  You need to
bring some pretty substantial evidence to support that.  The burden
of proof is on you, not on the status quo.

regards, tom lane






Re: Do we want a hashset type?

2023-06-16 Thread Joel Jacobson
On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> I realise int4hashset_hash() is broken,
> since two int4hashset's that are considered equal,
> can by coincidence get different hashes:
...
> Do we have any ideas on how to fix this without sacrificing performance?

The problem was due to hashset_hash() function accumulating the hashes
of individual elements in a non-commutative manner. As a consequence, the
final hash value was sensitive to the order in which elements were inserted
into the hashset. This behavior led to inconsistencies, as logically
equivalent sets (i.e., sets with the same elements but in different orders)
produced different hash values.

Solved by modifying the hashset_hash() function to use a commutative operation
when combining the hashes of individual elements. This change ensures that the
final hash value is independent of the element insertion order, and logically
equivalent sets produce the same hash.

An somewhat unfortunate side-effect of this fix, is that we can no longer
visually sort the hashset output format, since it's not lexicographically 
sorted.
I think this is an acceptable trade-off for a hashset type,
since the only alternative I see would be to sort the elements,
but then it wouldn't be a hashset, but a treeset, which different
Big-O complexity.

New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.

/Joel

hashset-0.0.1-8da4aa8.patch
Description: Binary data


Re: RFC: logical publication via inheritance root?

2023-06-16 Thread Amit Kapila
On Sat, Jun 17, 2023 at 1:51 AM Jacob Champion  wrote:
>
> On Fri, Jun 16, 2023 at 6:26 AM Amit Kapila  wrote:
>
> > > But if it's because you've implemented a partitioning scheme of your
> > > own (the docs still list reasons you might want to [2], even today),
> > > and all you ever really do is interact with the root table, I think
> > > this feature will give you some of the same benefits that
> > > publish_via_partition_root gives native partition users. We're very
> > > much in that boat, but I don't know how many others are.
> > >
> >
> > I agree that there may still be cases as pointed out by you where
> > people want to use inheritance as a mechanism for partitioning but I
> > feel those would still be in the minority.
>
> (Just to clarify -- timescaledb is one of those cases. They definitely
> still exist.)
>

Noted, but I think that can't be the reason to accept this feature in core.

> > Personally, I am not very
> > excited about this idea.
>
> Yeah, "exciting" isn't how I'd describe this feature either :D But I
> think we're probably locked out of logical replication without the
> ability to override publish_as_relid for our internal tables, somehow.
> And I don't think DDL replication will help, just like it wouldn't
> necessarily help existing publish_via_partition_root use cases,
> because we don't want to force the source table's hierarchy on the
> target table. (A later version of timescaledb may not even use the
> same internal layout.)
>
> Is there an alternative implementation I'm missing, maybe, or a way to
> make this feature more generally applicable? "We have table Y and want
> it to be migrated as part of table X" seems to fall squarely under the
> logical replication umbrella.
>

Are you talking about this w.r.t inheritance/partition hierarchy? I
don't see any other way except "publish_via_partition_root" because we
expect the same schema and relation name on the subscriber to
replicate. You haven't explained why exactly you have such a
requirement of replicating via inheritance root aka why you want
inheritance hierarchy to be different on target db.

The other idea that came across my mind was to provide some schema
mapping kind of feature on subscribers where we could route the tuples
from table X to table Y provided they have the same or compatible
schema. I don't know if this is feasible or how generally it will be
useful and whether any other DB (replication solution) provides such a
feature but I guess something like that would have helped your use
case.

-- 
With Regards,
Amit Kapila.




Re: pg_collation.collversion for C.UTF-8

2023-06-16 Thread Thomas Munro
On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis  wrote:
> On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:
> > Hmm, OK let's explore that.  What could we do that would be helpful
> > here, without affecting users of the "true" C.UTF-8 for the rest of
> > time?
>
> Where is the "true" C.UTF-8 defined?

By "true" I just meant glibc's official one, in contrast to the
imposter from Debian oldstable's patches.  It's not defined by any
standard, but we only know how to record versions for glibc, FreeBSD
and Windows, and we know what the first two of those do for that
locale because they tell us (see below).  For Windows, the manual's
BNF-style description of acceptable strings doesn't appear to accept
C.UTF-8 (but I haven't tried it).

> I assume you mean that the collation order can't (shouldn't, anyway)
> change. But what about the ctype (upper/lower/initcap) behavior? Is
> that also locked down for all time, or could it change if some new
> unicode characters are added?

Fair point.  Considering that our collversion effectively functions as
a proxy for ctype version too, Daniel's patch makes a certain amount
of sense.

Our versioning is nominally based only on the collation category, not
locales more generally or any other category they contain (nominally,
as in: we named it collversion, and our code and comments and
discussions so far only contemplated collations in this context).
But, clearly, changes to underlying ctype data could also cause a
constraint CHECK (x ~ '[[:digit:]]') or a partial index with WHERE
(upper(x) <> 'ẞ') to be corrupted, which I'd considered to be a
separate topic, but Daniel's patch would cover with the same
mechanism.  (Actually I just learned that [[:digit:]] is a bad example
on a glibc system, because they appear to have hardcoded a test for
[0-9] into their iswdigit_l() implementation, but FreeBSD gives the
Unicode answer, which is subject to change, and other classes may work
better on glibc.)

> Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
> but leave LC_CTYPE=C.UTF-8 as-is?

Yes.  The basic idea, at least for these two OSes, is that every
category behaves as if set to C, except LC_CTYPE.  For implementation
reasons the glibc people don't quite describe it that way[1]: for
LC_COLLATE, they decode to codepoints first and then compare those
using a new codepath they had to write for release 2.35, while FreeBSD
skips that useless step and compares raw UTF-8 bytes like
LC_COLLATE=C[2].  Which is the same, as RFC 3692 tells us:

   o  The byte-value lexicographic sorting order of UTF-8 strings is the
  same as if ordered by character numbers.  Of course this is of
  limited interest since a sort order based on character numbers is
  almost never culturally valid.

It is interesting to note that LC_COLLATE=C, LC_CTYPE=C.UTF-8 is
equivalent, but would not get version warnings with Daniel's patch,
revealing that it's only a proxy.  But recording ctype version
separately would be excessive.

For completeness, Solaris also has C.UTF-8.  I can't read about what
it does, the release notes are behind a signup thing.  *shrug*  I
can't find any other systems that have it.

[1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8
[2] https://reviews.freebsd.org/D17833




Re: [DOC] Update ALTER SUBSCRIPTION documentation v3

2023-06-16 Thread Amit Kapila
On Fri, Jun 16, 2023 at 7:15 PM Peter Eisentraut  wrote:
>
> On 15.06.23 04:49, Amit Kapila wrote:
> >
> > Now, along with this change, there is a change in errhint as well
> > which I am not sure about whether to backpatch or not. I think we have
> > the following options (a) commit both doc and code change in HEAD (b)
> > commit both doc and code change in v17 when the next version branch
> > opens (c) backpatch the doc change and commit the code change in HEAD
> > only (d) backpatch the doc change and commit the code change in v17
> > (e) backpatch both the doc and code change.
>
> Reading the thread again now, I think this is essentially a bug fix, so
> I don't mind backpatching it.
>
> I wish the errhint would show the actual command to disable the
> subscription.  It already shows the command to detach the replication
> slot, so it would only be consistent to also show the other command.
>

Fair enough. I updated the errhint and slightly adjusted the docs as
well in the attached.

-- 
With Regards,
Amit Kapila.


logical_replication_slot_disable_doc_update_v5.patch
Description: Binary data