date:20241024

On Thu, Oct 17, 2024 at 7:14 PM Peter Eisentraut  wrote:
>
>
> So I took the v5 patch you had posted and started working from there.
> The rule that you had picked isn't quite what we want, I think.  It's
> okay to have nondeterministic collations on foreign keys, as long as the
> collation is the same on both sides.  That's what I have implemented.
> See attached.
>
> This approach also allows cleaning up a bunch of hackiness in
> ri_triggers.c, which feels satisfying.
>
>
yech, i missed FK, PK both are nondeterministic but with same collation OID.
your work is more neat.
However FK, PK same nondeterministic collation OID have implications
for ri_KeysEqual.

ri_KeysEqual definitely deserves some comments.
for rel_is_pk, the equality is collation agnostic;
for rel_is_pk is false, the equality is collation aware.

for example:
DROP TABLE IF EXISTS fktable, pktable;
CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY);
CREATE TABLE fktable (x text collate case_insensitive REFERENCES
pktable on update restrict on delete restrict);
INSERT INTO pktable VALUES ('A'), ('Å');
INSERT INTO fktable VALUES ('a');
update pktable set x = 'a' where x = 'A';
ERROR:  update or delete on table "pktable" violates foreign key
constraint "fktable_x_fkey" on table "fktable"
DETAIL:  Key (x)=(A) is still referenced from table "fktable".
this should not happen?
If so, the below change can solve the problem.

@@ -2930,6 +2915,16 @@ ri_KeysEqual(Relation rel, TupleTableSlot
*oldslot, TupleTableSlot *newslot,
  */
 Form_pg_attribute att =
TupleDescAttr(oldslot->tts_tupleDescriptor, attnums[i] - 1);
+Oidcollation =  RIAttCollation(rel, attnums[i]);
+if (OidIsValid(collation) &&
!get_collation_isdeterministic(collation))
+{
+Oideq_opr;
+boolresult;
+eq_opr = riinfo->pp_eq_oprs[i];
+result = ri_CompareWithCast(eq_opr, RIAttType(rel, attnums[i]),
+collation, newvalue, oldvalue);
+return result;
+}
 if (!datum_image_eq(oldvalue, newvalue, att->attbyval,
att->attlen))
 return false;

The above change will make the ri_KeysEqual equality coalition aware
regardless rel_is_pk's value.
to see the effect, we can test it BEFORE and AFTER applying the above
ri_KeysEqual changes
with the attached sql script.

pk_fk_inderministic_collation.sql
Description: application/sql

Re: altering a column's collation leaves an invalid foreign key

> So for the moment this is a master-only patch.  I think once we have
> tied down the behavior we want for the future

 /*
  * ri_GenerateQualCollation --- add a COLLATE spec to a WHERE clause
  *
- * At present, we intentionally do not use this function for RI queries that
- * compare a variable to a $n parameter.  Since parameter symbols always have
- * default collation, the effect will be to use the variable's collation.
- * Now that is only strictly correct when testing the referenced column, since
- * the SQL standard specifies that RI comparisons should use the referenced
- * column's collation.  However, so long as all collations have the same
- * notion of equality (which they do, because texteq reduces to bitwise
- * equality), there's no visible semantic impact from using the referencing
- * column's collation when testing it, and this is a good thing to do because
- * it lets us use a normal index on the referencing column.  However, we do
- * have to use this function when directly comparing the referencing and
- * referenced columns, if they are of different collations; else the parser
- * will fail to resolve the collation to use.
+ * We only have to use this function when directly comparing the referencing
+ * and referenced columns, if they are of different collations; else the
+ * parser will fail to resolve the collation to use.  We don't need to use
+ * this function for RI queries that compare a variable to a $n parameter.
+ * Since parameter symbols always have default collation, the effect will be
+ * to use the variable's collation.
+ *
+ * Note that we require that the collations of the referencing and the
+ * referenced column have the some notion of equality: Either they have to
+ * both be deterministic or else they both have to be the same.
  */
" some notion of equality" should it be "same notion of equality"?
maybe we can add "see ATAddForeignKeyConstraint also"
at the end of the whole comment.


/*
 * transformColumnNameList - transform list of column names
 *
 * Lookup each name and return its attnum and, optionally, type OID
 *
 * Note: the name of this function suggests that it's general-purpose,
 * but actually it's only used to look up names appearing in foreign-key
 * clauses.  The error messages would need work to use it in other cases,
 * and perhaps the validity checks as well.
 */
static int
transformColumnNameList(Oid relId, List *colList,
int16 *attnums, Oid *atttypids, Oid *attcollids)
comments need slight adjustment?
comments in transformFkeyGetPrimaryKey also need slight change?
ri_CompareWithCast, we can aslo add comments saying the output is
collation aware.


+ /*
+ * This shouldn't be possible, but better check to make sure we have a
+ * consistent state for the check below.
+ */
+ if ((pkcoll && !fkcoll) || (!pkcoll && fkcoll))
+ elog(ERROR, "key columns are not both collatable");
+
+ if (pkcoll && fkcoll)

cosmetic. pkcoll can change to OidIsValid(pkcoll)
fkcoll change to OidIsValid(fkcoll).



ereport(ERROR,
(errcode(ERRCODE_COLLATION_MISMATCH),
errmsg("foreign key constraint \"%s\" cannot be implemented",
fkconstraint->conname),
errdetail("Key columns \"%s\" and \"%s\" have incompatible
collations: \"%s\" and \"%s\"."
"If either collation is nondeterministic, then both collations
have to be the same.",
strVal(list_nth(fkconstraint->fk_attrs, i)),
strVal(list_nth(fkconstraint->pk_attrs, i)),
get_collation_name(fkcoll),
get_collation_name(pkcoll;

ERROR:  foreign key constraint "a_fk_col1_fkey" cannot be implemented
DETAIL:  Key columns "col1" and "col1" have incompatible collations:
"default" and "case_insensitive".  If either collation is
nondeterministic, then both collations have to be the same.

"DETAIL:  Key columns "col1" and "col1"" may be confusing.
we can be more verbose. like
DETAIL:  Key columns "col1" in relation "X" and "col1" in relation "Y"
have incompatible collations..
(just a idea, don't have much preference)


old_check_ok = (new_pathtype == old_pathtype &&
new_castfunc == old_castfunc &&
(!IsPolymorphicType(pfeqop_right) ||
new_fktype == old_fktype) &&
(new_fkcoll == old_fkcoll ||
(get_collation_isdeterministic(old_fkcoll) &&
get_collation_isdeterministic(new_fkcoll;

I am wondering if one of the new_fkcoll, old_fkcoll is zero, the other is not.
turns out that would n't happen.

before "if (old_check_ok)" we already did extensionsive check that
new_fkcoll is ok
for the job.
in ATAddForeignKeyConstraint, we can leave the old_check_ok part as is.
what do you think of the following:

old_check_ok = (new_pathtype == old_pathtype &&
new_castfunc == old_castfunc &&
(!IsPolymorphicType(pfeqop_right) ||
 new_fktype == old_fktype));
if (OidIsValid(new_fkcoll) && OidIsValid(old_fkcoll) &&
new_fkcoll != old_fkcoll)
{
if(!get_collat

Re: Pgoutput not capturing the generated columns

2024-10-24 Thread Amit Kapila

On Thu, Oct 24, 2024 at 8:50 PM vignesh C  wrote:
>
> The v42 version patch attached at [1] has the changes for the same.
>

Some more comments:
1.
@@ -1017,7 +1089,31 @@ pgoutput_column_list_init(PGOutputData *data,
List *publications,
 {
  ListCell   *lc;
  bool first = true;
+ Bitmapset  *relcols = NULL;
  Relation relation = RelationIdGetRelation(entry->publish_as_relid);
+ TupleDesc desc = RelationGetDescr(relation);
+ MemoryContext oldcxt = NULL;
+ bool collistpubexist = false;
+
+ pgoutput_ensure_entry_cxt(data, entry);
+
+ oldcxt = MemoryContextSwitchTo(entry->entry_cxt);
+
+ /*
+ * Prepare the columns that will be published for FOR ALL TABLES and
+ * FOR TABLES IN SCHEMA publication.
+ */
+ for (int i = 0; i < desc->natts; i++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i);
+
+ if (att->attisdropped || (att->attgenerated && !entry->pubgencols))
+ continue;
+
+ relcols = bms_add_member(relcols, att->attnum);
+ }
+
+ MemoryContextSwitchTo(oldcxt);

This code is unnecessary for cases when the table's publication has a
column list. So, I suggest to form this list only when required. Also,
have an assertion that pubgencols value for entry and publication
matches.

2.
@@ -1115,10 +1186,17 @@ pgoutput_column_list_init(PGOutputData *data,
List *publications,
  ereport(ERROR,
  errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot use different column lists for table \"%s.%s\" in
different publications",
-get_namespace_name(RelationGetNamespace(relation)),
-RelationGetRelationName(relation)));
+ get_namespace_name(RelationGetNamespace(relation)),
+ RelationGetRelationName(relation)));

Is there a reason to make the above change? It appears to be a spurious change.

3.
+ /* Check if there is any generated column present */
+ for (int i = 0; i < desc->natts; i++)
+ {
+ Form_pg_attribute att = TupleDescAttr(desc, i);
+ if (att->attgenerated)

Add one empty line between the above two lines.

4.
+ else if (entry->pubgencols != pub->pubgencols)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot use different values of publish_generated_columns for
table \"%s.%s\" in different publications",
+ get_namespace_name(RelationGetNamespace(relation)),
+ RelationGetRelationName(relation)));

The last two lines are not aligned.

-- 
With Regards,
Amit Kapila.

Re: [PoC] Partition path cache

2024-10-24 Thread Andy Fan

Bykov Ivan  writes:

> Our customers use databases (1-10 TB) with big tables. Often these tables are 
> big and split on sections. For example, we have tables with almost 
> thousand sections. In most cases, those sections have a similar set of 
> indexes 
> and contain similar data. Often this partitioned table has multilevel 
> structure 
> (for example, a per-year section has a per-quarter section, and a per-quarter 
> section in turn has a per-month simple relation sections). 
>
> During the analysis of the planning procedure, we found that the planner 
> we found that the planner in PostgreSQL 15.7 spents a lot of time building 
> access paths.
..
> We backported new access path build algorithms from PostgreSQL 17 (which 
> optimizes match_pathkeys_to_index()) and it takes effect: 
> planner spent 1090 ms for planning query at first time and 320 ms for 
> second time.
>
> But we still think that planners make unnecessary jobs when building all 
> types of paths for every section. So we implemented a feature named 
> “partition path cache” (see next section), and now planner spent 970 ms for 
> planning query at the first time and 240 ms for the second time.

>
> Partition path cache
> 
> The partition path cache aims to speed up planning for partition scan paths.
>
> Path cache doesn't copy and transform Path nodes directly due to the absence 
> of
> Path nodes copy functions. The main aim of this patch is to prove the 
> assumption
> that partitions of the same relation that have similar sets of indexes may use
> similar access path types.

This sounds like an interesting idea, I like it because it omit the needs
for "global statistics" effort for partitioned table since it just use
the first partition it knows. Of couse it has its drawback that "first"
partition can't represent other partitions.

One of the Arguments of this patch might be "What if other partitions
have a pretty different statistics from the first partition?". If I were
you, I might check all the used statistics on this stage and try to find
out a similar algorithms to prove that the best path would be similar
too. This can happens once when the statistics is gathered. However this
might be not easy. 

-- 
Best Regards
Andy Fan

Re: Changing the default random_page_cost value

2024-10-24 Thread Greg Sabino Mullane

On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian  wrote:

> I am not a fan of this patch.  I don't see why _removing_ the magnetic
> part helps because you then have no logic for any 1.2 was chosen.

Okay, but we have no documented logic on why 4.0 was chosen either. :)

I would put the magnetic in a separate paragraph perhaps, and recommend
> 4.0 for it.

Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly
standard practice, but I'm fine with a recommendation of 4. Partly because
I doubt anyone will use it much.

 Also, per-tablespace makes sense because of physical media
> differences, but what purpose would per-database and per-role serve?
> Also, per-tablespace is not a connection-activated item like the other
> two.
>

Good point, I withdraw that part.

Cheers,
Greg

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-10-24 Thread Jacob Champion

On Fri, Oct 18, 2024 at 4:38 AM Daniel Gustafsson  wrote:
> In validate() it seems to me we should clear out ret->authn_id on failure to
> pair belts with suspenders. Fixed by calling explicit_bzero on it in the error
> path.

The new hunk says:

> cleanup:
> /*
>  * Clear and free the validation result from the validator module once
>  * we're done with it to avoid accidental re-use.
>  */
> if (ret->authn_id != NULL)
> {
> explicit_bzero(ret->authn_id, strlen(ret->authn_id));
> pfree(ret->authn_id);
> }
> pfree(ret);

But I'm not clear on what's being protected against. Which code would
reuse this result?

Thanks,
--Jacob

Re: Refactor to use common function 'get_publications_str'.

On Fri, Oct 25, 2024 at 09:28:47AM +1100, Peter Smith wrote:
> I've attached the patch v4.

Looks OK to me.  Thanks.  I'll see to get that done through the day.
--
Michael


signature.asc
Description: PGP signature

Re: Add isolation test template in injection_points for wait/wakeup/detach

On Mon, Oct 21, 2024 at 11:13:59AM +0900, Michael Paquier wrote:
> Used "basic" as test name at the end, tweaked the comment close to
> what you have suggested, and applied the result.

I have spotted a race condition with this test in some CF bot runs.
Here are some of them:
https://github.com/michaelpq/postgres/runs/31984161747
https://cirrus-ci.com/task/6741385749987328
https://cirrus-ci.com/task/6056918689513472
https://cirrus-ci.com/task/5863397261049856
https://cirrus-ci.com/task/5565235765968896
https://cirrus-ci.com/task/6238594598174720

All of them show up as follows in regression.diffs:
diff -U3
/tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out
/tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out
---
/tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out
2024-10-24 02:36:24.06859 +
+++
/tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out
2024-10-24 02:40:09.179695000 +
@@ -13,16 +13,16 @@

 (1 row)

-step wait1: <... completed>
-injection_points_run
-
-
-(1 row)
-
 step detach2: SELECT injection_points_detach('injection-points-wait');
 injection_points_detach
 ---

+(1 row)
+
+step wait1: <... completed>
+injection_points_run
+
+
 (1 row)

This is in the first permutation of the test done with "wait1 wakeup2
detach2", and the diff means that the backend running the "wait"
callback is reported as finished after the detach is done,
injection_points_run being only used for the waits.  Hence the wait is
so slow to finish that the detach has time to complete and finish,
breaking the output.

And here comes the puzzling part: all of failures involve FreeBSD 13
in the CI.  Reproducing this failure would not be difficult, I thought
first; we can add a hardcoded pg_usleep() to delay the end of
injection_wait() so as we make sure that the backend doing the wait
reports later than the detach.  Or just do the same at the end of
injection_points_run() once the callback exits.  I've sure done that,
placing some strategic pg_usleep() calls on Linux to make the paths
that matter in the wait slower, but the test remains stable.  The CI
on Linux is stable as well: 3rd and 4th columns of the cfbot are
green, I did not spot any failures related to this isolation test in
injection_points.  Only the second column about FreeBSD is going rogue
on a periodic basis.

One fix would is to remove this first permutation test, still that's
hiding a problem rather than solving it, and something looks wrong
with conditional variables, specific to FreeBSD?  Hence, I'm puzzled
by that.

I am going to remove the first permutation anyway to stabilize the CI
reports.  But it feels like we have a different problem here, and I am
not sure what.

Any thoughts or comments?
--
Michael

signature.asc
Description: PGP signature

Re: pgbench: Improve result outputs related to failed transactinos

2024-10-24 Thread Tatsuo Ishii

Hello Alexander,

> Hello Tatsuo-san,
> 
> 11.10.2024 07:54, Tatsuo Ishii wrote:
> ...
>   - number of transactions actually pocessed: 1 (tps = 410.846343)
> ...
>> Patch pushed.
> 
> Please consider fixing a typo sneaked in that commit: pocessed ->
> processed?

Thank you for the report! I have pushed a fix.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

RE: Conflict detection for update_deleted in logical replication

2024-10-24 Thread Hayato Kuroda (Fujitsu)

Dear Hou,

> Here is the V5 patch set which addressed above comments.

Thanks for updating the patch! While reviewing yours, I found a corner case that
a recently deleted tuple cannot be detected when index scan is chosen.
This can happen when indices are re-built during the replication.
Unfortunately, I don't have any solutions for it.

Found issue

When indices are built with the CONCURRENTLY option, a standard MVCC snapshot
is used to list up the tuples of the table, which means the new index ignores
recently deleted tuples.

This can cause the update_deleted to be wrongly detected as update_missing.
Assuming that we have a bidirectional cluster like case 1 [1] (id is a primary
key),
and REINDEX CONCURRENTLY happens after executing DELETE but before receiving
the UPDATE.
A primary key of t will be re-built by the REINDEX command but the dead tuples
by
DELETE will be missed because it is invisible from the transaction.
Then, the apply worker receives the UPDATE and recognizes the target tuple is
removed.
It scans with snapshotany to find the deleted tuple via the index, but it fails.
This event is reported as update_missing.

Reproduce steps
===

This can be reproduced with v5 patch set:

1. constructed a 2-way replication by running attached.
2. stopped an apply worker on node2
3. executed `UPDATE tab SET a = 2;` on node1.
4. executed `DELETE FROM tab;` on node 2
5. executed `REINDEX TABLE CONCURRENTLY tab;` on node2
6. resumed the stopped apply worker
7. the worker should detect update_deleted, but it detected update_missing

```
LOG: conflict detected on relation "public.tab": conflict=update_missing
DETAIL: Could not find the row to be updated.
Remote tuple (2); replica identity (a)=(1).
``

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716A78EE3D6F2F4754E1209946C2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

test_1025.sh
Description: test_1025.sh

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

Alexander Lakhin  于2024年10月24日周四 22:00写道：

> Hello Alvaro,
>
> 22.10.2024 17:32, Alvaro Herrera wrote:
> > Yeah.  I pushed these patches finally, thanks!
>
> Please look at a new anomaly introduced with 53af9491a. When running the
> following script:
> CREATE TABLE t (a int, b int, PRIMARY KEY (a, b));
> CREATE TABLE pt (a int, b int, FOREIGN KEY (a, b) REFERENCES t(a, b))
>   PARTITION BY LIST (a);
>
> CREATE TABLE tp1 (x int, a int, b int);
> ALTER TABLE tp1 DROP COLUMN x;
>
> ALTER TABLE pt ATTACH PARTITION tp1 FOR VALUES IN (1);
>
> ALTER TABLE pt DETACH PARTITION tp1;
>
> I get a memory access error detected by Valgrind:
> 2024-10-24 12:05:04.645 UTC [1079077] LOG:  statement: ALTER TABLE pt
> DETACH PARTITION tp1;
> ==00:00:00:07.887 1079077== Invalid read of size 2
> ==00:00:00:07.887 1079077==at 0x4A61DD: DetachPartitionFinalize
> (tablecmds.c:19545)
> ==00:00:00:07.887 1079077==by 0x4A5C11: ATExecDetachPartition
> (tablecmds.c:19386)
> ==00:00:00:07.887 1079077==by 0x48561E: ATExecCmd (tablecmds.c:5540)
> ==00:00:00:07.887 1079077==by 0x4845DE: ATRewriteCatalogs
> (tablecmds.c:5203)
> ==00:00:00:07.887 1079077==by 0x4838EC: ATController (tablecmds.c:4758)
> ==00:00:00:07.887 1079077==by 0x4834F1: AlterTable (tablecmds.c:4404)
> ==00:00:00:07.887 1079077==by 0x7D6D52: ProcessUtilitySlow
> (utility.c:1318)
> ==00:00:00:07.887 1079077==by 0x7D65F7: standard_ProcessUtility
> (utility.c:1067)
> ==00:00:00:07.887 1079077==by 0x7D54F7: ProcessUtility (utility.c:523)
> ==00:00:00:07.887 1079077==by 0x7D3D70: PortalRunUtility
> (pquery.c:1158)
> ==00:00:00:07.887 1079077==by 0x7D3FE7: PortalRunMulti (pquery.c:1316)
> ==00:00:00:07.887 1079077==by 0x7D3431: PortalRun (pquery.c:791)
>
> Reproduced on REL_15_STABLE .. master.
>

Sorry, I can't reproduce this leak on master.

-- 
Thanks,
Tender Wang

cache lookup failed when \d t concurrent with DML change column data type

hi. I think I found a bug.
PostgreSQL 18devel_debug_build_45188c2ea2 on x86_64-linux, compiled by
gcc-14.1.0, 64-bit
commit at 45188c2ea2.
Ubuntu 22.04.4 LTS


setup:
drop table t cascade;
create table t(a int PRIMARY key);

IN session1:
step "change data type" {begin; alter table t alter column a set data
type int4;}
step "s1" {commit;}

IN session2:
step "psql_another_session" {\d t}

permutation "change data type" "psql_another_session" "s1"


 \set ECHO_HIDDEN on
/ QUERY */
SELECT c2.relname, i.indisprimary, i.indisunique, i.indisclustered,
i.indisvalid, pg_catalog.pg_get_indexdef(i.indexrelid, 0, true),
  pg_catalog.pg_get_constraintdef(con.oid, true), contype,
condeferrable, condeferred, i.indisreplident, c2.reltablespace,
con.conperiod
FROM pg_catalog.pg_class c, pg_catalog.pg_class c2, pg_catalog.pg_index i
  LEFT JOIN pg_catalog.pg_constraint con ON (conrelid = i.indrelid AND
conindid = i.indexrelid AND contype IN ('p','u','x'))
WHERE c.oid = '34405' AND c.oid = i.indrelid AND i.indexrelid = c2.oid
ORDER BY i.indisprimary DESC, c2.relname;
//

ERROR:  cache lookup failed for attribute 1 of relation 34418

Re: general purpose array_sort

2024-10-24 Thread Aleksander Alekseev

Hi,

> I can accept this outcome though an optional three-valued boolean sort order 
> (ascending and descending only) I'd argue is worth keeping.  null value 
> placement too I guess, three-valued boolean (nulls_first).

Perhaps these optional arguments deserve separate discussions. I
suggest merging something everyone agrees on first. This will simplify
the review process and allow us to deliver value to the users quickly.
Arguments like `reverse => true` and `nulls_first => true` can always
be implemented and added as separate patches.

-- 
Best regards,
Aleksander Alekseev

Re: Enhancing Memory Context Statistics Reporting

2024-10-24 Thread torikoshia


On 2024-10-24 14:59, Rahila Syed wrote:

Hi Torikoshia,

Thank you for reviewing the patch!

On Wed, Oct 23, 2024 at 9:28 AM torikoshia
 wrote:


On 2024-10-22 03:24, Rahila Syed wrote:

Hi,

PostgreSQL provides following capabilities for reporting memory
contexts statistics.
1. pg_get_backend_memory_contexts(); [1]
2. pg_log_backend_memory_contexts(pid); [2]

[1]  provides a view of memory context statistics for a local

backend,

while [2] prints the memory context statistics of any backend or
auxiliary
process to the PostgreSQL logs. Although [1] offers detailed
statistics,
it is limited to the local backend, restricting its use to

PostgreSQL

client backends only.
On the other hand, [2] provides the statistics for all backends

but

logs them in a file,
which may not be convenient for quick access.

I propose enhancing memory context statistics reporting by

combining

these
capabilities and offering a view of memory statistics for all
PostgreSQL backends
and auxiliary processes.


Thanks for working on this!

I originally tried to develop something like your proposal in [2],
but
there were some difficulties and settled down to implement
pg_log_backend_memory_contexts().


Yes. I am revisiting this problem :)


Attached is a patch that implements this functionality. It

introduces

a SQL function
that takes the PID of a backend as an argument, returning a set of
records,
each containing statistics for a single memory context. The

underlying

C function
sends a signal to the backend and waits for it to publish its

memory

context statistics
before returning them to the user. The publishing backend copies
these statistics
during the next CHECK_FOR_INTERRUPTS call.


I remember waiting for dumping memory contexts stats could cause
trouble
considering some erroneous cases.

For example, just after the target process finished dumping stats,
pg_get_remote_backend_memory_contexts() caller is terminated before
reading the stats, calling pg_get_remote_backend_memory_contexts()
has
no response any more:

[session1]$ psql
(40699)=#

$ kill -s SIGSTOP 40699

[session2] psql
(40866)=# select * FROM
pg_get_remote_backend_memory_contexts('40699', false); -- waiting

$ kill -s SIGSTOP 40866

$ kill -s SIGCONT 40699

[session3] psql
(47656) $ select pg_terminate_backend(40866);

$ kill -s SIGCONT 40866 -- session2 terminated

[session3] (47656)=# select * FROM
pg_get_remote_backend_memory_contexts('47656', false); -- no
response

It seems the reason is memCtxState->in_use is now and
memCtxState->proc_id is 40699.
We can continue to use pg_get_remote_backend_memory_contexts() after

specifying 40699, but it'd be hard to understand for users.


Thanks for testing and reporting. While I am not able to reproduce
this problem,
I think this may be happening because the requesting backend/caller is
terminated
before it gets a chance to mark  memCtxState->in_use as false.


Yeah, when I attached a debugger to 47656 when it was waiting on 
pg_get_remote_backend_memory_contexts('47656', false), 
memCtxState->in_use was true as you suspected:


  (lldb) p memCtxState->in_use
  (bool) $1 = true
  (lldb) p memCtxState->proc_id
  (int) $2 = 40699
  (lldb) p pid
  (int) $3 = 47656


In this case memCtxState->in_use should be marked as
'false' possibly during the processing of ProcDiePending in
ProcessInterrupts().


This approach facilitates on-demand publication of memory

statistics

for a specific backend, rather than collecting them at regular
intervals.
Since past memory context statistics may no longer be relevant,
there is little value in retaining historical data. Any collected
statistics
can be discarded once read by the client backend.

A fixed-size shared memory block, currently accommodating 30

records,

is used to store the statistics. This number was chosen

arbitrarily,

as it covers all parent contexts at level 1 (i.e., direct

children of

the top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The

client

backend
first reads from the shared memory, and if necessary, retrieves

the

remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

The statistics are reported in a breadth-first search order of the
memory context tree,
with parent contexts reported before their children. This

provides a

cumulative summary
before diving into the details of each child context's

consumption.


The rationale behind the shared memory chunk is to ensure that the
majority of contexts which are the direct children of
TopMemoryContext,
fit into memory
This allows a client to request a summary of memory statistics,
which can be served from memory without the overhead of file

access,

unless necessary.

A publishing back

Re: cache lookup failed when \d t concurrent with DML change column data type

2024-10-24 Thread Andrei Lepikhov


On 10/24/24 22:30, jian he wrote:

hi. I think I found a bug.
PostgreSQL 18devel_debug_build_45188c2ea2 on x86_64-linux, compiled by
gcc-14.1.0, 64-bit
commit at 45188c2ea2.
Ubuntu 22.04.4 LTS


setup:
drop table t cascade;
create table t(a int PRIMARY key);

IN session1:
step "change data type" {begin; alter table t alter column a set data
type int4;}
step "s1" {commit;}

IN session2:
step "psql_another_session" {\d t}

permutation "change data type" "psql_another_session" "s1"



ERROR:  cache lookup failed for attribute 1 of relation 34418
Yes, it looks like a bug existing for a long time, at least since PG11 
(I didn't trace further down).
It seems that the backend didn't apply invalidation messages before 
touching system caches. Backtrace:


in get_attoptions (relid=16388, attnum=1) at lsyscache.c:982
in pg_get_indexdef_worker (indexrelid=16388, colno=0, excludeOps=0x0, 
attrsOnly=false, keysOnly=false, showTblSpc=false, inherits=false, 
prettyFlags=7, missing_ok=true) at ruleutils.c:1458

in pg_get_indexdef_ext (fcinfo=0x55a15acc1c18) at ruleutils.c:1202
in ExecInterpExpr (state=0x55a15acc2a10, econtext=0x55a15ac62930, 
isnull=0x7fffd66a5bcf) at execExprInterp.c:770
in ExecInterpExprStillValid (state=0x55a15acc2a10, 
econtext=0x55a15ac62930, isNull=0x7fffd66a5bcf) at execExprInterp.c:2035
in ExecEvalExprSwitchContext (state=0x55a15acc2a10, 
econtext=0x55a15ac62930, isNull=0x7fffd66a5bcf) at 
../../../src/include/executor/executor.h:367
in ExecProject (projInfo=0x55a15acc2a08) at 
../../../src/include/executor/executor.h:401


--
regards, Andrei Lepikhov

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

Alexander Lakhin  于2024年10月24日周四 22:00写道：

> Hello Alvaro,
>
> 22.10.2024 17:32, Alvaro Herrera wrote:
> > Yeah.  I pushed these patches finally, thanks!
>
> Please look at a new anomaly introduced with 53af9491a. When running the
> following script:
> CREATE TABLE t (a int, b int, PRIMARY KEY (a, b));
> CREATE TABLE pt (a int, b int, FOREIGN KEY (a, b) REFERENCES t(a, b))
>   PARTITION BY LIST (a);
>
> CREATE TABLE tp1 (x int, a int, b int);
> ALTER TABLE tp1 DROP COLUMN x;
>
> ALTER TABLE pt ATTACH PARTITION tp1 FOR VALUES IN (1);
>
> ALTER TABLE pt DETACH PARTITION tp1;
>
> I get a memory access error detected by Valgrind:
> 2024-10-24 12:05:04.645 UTC [1079077] LOG:  statement: ALTER TABLE pt
> DETACH PARTITION tp1;
> ==00:00:00:07.887 1079077== Invalid read of size 2
> ==00:00:00:07.887 1079077==at 0x4A61DD: DetachPartitionFinalize
> (tablecmds.c:19545)
> ==00:00:00:07.887 1079077==by 0x4A5C11: ATExecDetachPartition
> (tablecmds.c:19386)
> ==00:00:00:07.887 1079077==by 0x48561E: ATExecCmd (tablecmds.c:5540)
> ==00:00:00:07.887 1079077==by 0x4845DE: ATRewriteCatalogs
> (tablecmds.c:5203)
> ==00:00:00:07.887 1079077==by 0x4838EC: ATController (tablecmds.c:4758)
> ==00:00:00:07.887 1079077==by 0x4834F1: AlterTable (tablecmds.c:4404)
> ==00:00:00:07.887 1079077==by 0x7D6D52: ProcessUtilitySlow
> (utility.c:1318)
> ==00:00:00:07.887 1079077==by 0x7D65F7: standard_ProcessUtility
> (utility.c:1067)
> ==00:00:00:07.887 1079077==by 0x7D54F7: ProcessUtility (utility.c:523)
> ==00:00:00:07.887 1079077==by 0x7D3D70: PortalRunUtility
> (pquery.c:1158)
> ==00:00:00:07.887 1079077==by 0x7D3FE7: PortalRunMulti (pquery.c:1316)
> ==00:00:00:07.887 1079077==by 0x7D3431: PortalRun (pquery.c:791)
>
> Reproduced on REL_15_STABLE .. master.
>

Thanks for reporting. I can reproduce this memory invalid access on HEAD.
After debuging codes, I found the root cause.
 In DetachPartitionFinalize(), below code:
att = TupleDescAttr(RelationGetDescr(partRel),
   attmap->attnums[conkey[i] - 1] - 1);

We should use confkey[i] -1 not conkey[i] - 1;
In this case, the length of attmap is 2, conkey= {2,3}, confkey={1,2},
attmap->attnums[conkey[1] -1] will trigger
memory invalid access.

In the block, we process referenced side, so wo should use confkey.
I attach a patch to fix this.

--
Thanks,
Tender Wang


0001-Fix-memory-invalid-access-due-to-using-wrong-conkey.patch
Description: Binary data

Useless field ispartitioned in CreateStmtContext

2024-10-24 Thread hugo

Hi!

   When looking at the partition-related code, I found that the 
ispartitioned 

field in CreateStmtContext is not used. It looks like we can safely remove it 
and

avoid invalid assignment logic.

 

Here's a very simple fix， any suggestion?

 

diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c

index 1e15ce10b48..6dea85cc2dc 100644

--- a/src/backend/parser/parse_utilcmd.c

+++ b/src/backend/parser/parse_utilcmd.c

@@ -89,7 +89,6 @@ typedef struct

    List   *alist;  /* "after list" of things to do 
after creating

 * the table */

    IndexStmt  *pkey;   /* PRIMARY KEY index, if any */

-   bool    ispartitioned;  /* true if table is partitioned */

    PartitionBoundSpec *partbound;  /* transformed FOR VALUES */

    bool    ofType; /* true if statement contains 
OF typename */

 } CreateStmtContext;

@@ -246,7 +245,6 @@ transformCreateStmt(CreateStmt *stmt, const char 
*queryString)

    cxt.blist = NIL;

    cxt.alist = NIL;

    cxt.pkey = NULL;

-   cxt.ispartitioned = stmt->partspec != NULL;

    cxt.partbound = stmt->partbound;

    cxt.ofType = (stmt->ofTypename != NULL);

 

@@ -3401,7 +3399,6 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,

    cxt.blist = NIL;

    cxt.alist = NIL;

    cxt.pkey = NULL;

-   cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);

    cxt.partbound = NULL;

    cxt.ofType = false;

 

--

Best regards,

hugozhang

Re: execute prepared statement passing parameter expression with COLLATE clause

jian he  于2024年10月24日周四 16:56写道：

> hi.
>
> $Subject setup
>
> CREATE COLLATION case_insensitive (provider = icu, locale =
> '@colStrength=secondary', deterministic = false);
> CREATE COLLATION ignore_accents (provider = icu, locale =
> '@colStrength=primary;colCaseLevel=yes', deterministic = false);
> DROP TABLE IF EXISTS pktable cascade;
> CREATE TABLE pktable (x text COLLATE case_insensitive);
> INSERT INTO pktable VALUES ('A');
> DEALLOCATE q6;
> PREPARE q6 AS SELECT * FROM pktable WHERE x = $1;
>
>
> select * from pktable where x = 'Å' collate ignore_accents;
> --return one row
>
> execute q6('Å' collate ignore_accents);
> --return zero rows
>
> not sure return zero rows is desired.
>
>
postgres=# explain execute q6('Å' collate ignore_accents);
   QUERY PLAN
-
 Seq Scan on pktable  (cost=0.00..27.00 rows=7 width=32)
   Filter: (x = 'Å'::text)
(2 rows)

postgres=# explain select * from pktable where x = 'Å' collate
ignore_accents;
   QUERY PLAN
-
 Seq Scan on pktable  (cost=0.00..27.00 rows=7 width=32)
   Filter: (x = 'Å'::text COLLATE ignore_accents)
(2 rows)

The filter expr in the two queries is different.  And I debug the texteq;
the collid is also different.
So the result of the two queries is different.  I don't look execute more
in details.


-- 
Thanks,
Tender Wang

Re: Fix for consume_xids advancing XIDs incorrectly

2024-10-24 Thread Fujii Masao





On 2024/10/24 5:23, Masahiko Sawada wrote:

 if (xids_left > 2000 &&
 consumed - last_reported_at < REPORT_INTERVAL &&
 MyProc->subxidStatus.overflowed)
 {
 int64   consumed_by_shortcut = consume_xids_shortcut();

 if (consumed_by_shortcut > 0)
 {
 consumed += consumed_by_shortcut;
 continue;
 }
 }

By the way, this isn't directly related to the proposed patch, but while reading
the xid_wraparound code, I noticed that consume_xids_common() could potentially
return an unexpected XID if consume_xids_shortcut() returns a value greater
than 2000. Based on the current logic, consume_xids_common() should always 
return
a value less than 2000, so the issue I'm concerned about shouldn't occur.


Good point. Yes, the function doesn't return a value greater than 2000
as long as we do "distance = Min(distance, COMMIT_TS_XACTS_PER_PAGE -
rem);". But it's true with <= 16KB block sizes.


Still,
would it be worth adding an assertion to ensure that consume_xids_common() never
returns a value greater than 2000?


While adding an assertion makes sense to me, another idea is to set
last_xid even in the shortcut path. That way, it works even with 32KB
block size.


Agreed on making xid_wraparound compatible with a 32k block size. As you 
pointed out,
with 32k blocks, XidSkip() can return values greater than 2000. So, the first 
step is
to adjust the following if-condition by increasing "2000" to a higher value.
Since XidSkip() can return up to 3276 (based on COMMIT_TS_XACTS_PER_PAGE 
(BLCKSZ / 10) with 32k blocks),
we could, for instance, update "2000" to "4000."

if (xids_left > 2000 &&
consumed - last_reported_at < REPORT_INTERVAL &&
MyProc->subxidStatus.overflowed)


To ensure lastxid is set even in the shortcut case, what about modifying 
XidSkip()
so it returns "distance - 1" instead of "distance"? This change would make
consume_xids_common() run at least one more loop cycle,
triggering GetNewTransactionId() and setting lastxid correctly.


consumed = XidSkip(nextXid);
if (consumed > 0)
TransamVariables->nextXid.value += (uint64) consumed;

BTW, the code snippet above in consume_xids_shortcut() could potentially set
the next XID to a value below FirstNormalTransactionId? If yes, we should 
account for
FirstNormalFullTransactionId when increasing the next XID, similar to
how FullTransactionIdAdvance() handles it.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Better error reporting from extension scripts (Was: Extend ALTER OPERATOR)

In the no-good-deed-goes-unpunished department: buildfarm member
hamerkop doesn't like this patch [1].  The diffs look like

@@ -77,7 +77,7 @@
 ERROR:  syntax error at or near "FUNCTIN"
 LINE 1: CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE S...
^
-QUERY:  CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE SQL
+QUERY:  CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE SQL
 AS $$ SELECT $1 + 1 $$;
 CONTEXT:  extension script file "test_ext7--2.0--2.1bad.sql", near line 10
 alter extension test_ext7 update to '2.2bad';

It's hard to be totally sure from the web page, but I suppose what
is happening is that the newline within the quoted query fragment
is represented as "\r\n" not just "\n".  (I wonder why the cfbot
failed to detect this; there must be more moving parts involved
than just "it's Windows".)

The reason why this is happening seems fairly clear: extension.c's
read_whole_file() opens the script file with PG_BINARY_R, preventing
Windows' libc from hiding DOS-style newlines from us, even though the
git checkout on that machine is evidently using DOS newlines.

That seems like a rather odd choice.  Elsewhere in the same module,
parse_extension_control_file() opens control files with plain "r",
so that those act differently.  I checked the git history and did
not learn much.  The original extensions commit d9572c4e3 implemented
reading with a call to read_binary_file(), and we seem to have just
carried that behavioral decision forward through various refactorings.
I don't recall if there was an intentional choice to use binary read
or that was just a random decision to use an available function.

So what I'd like to do to fix this is to change

-   if ((file = AllocateFile(filename, PG_BINARY_R)) == NULL)
+   if ((file = AllocateFile(filename, "r")) == NULL)

The argument against that is it creates a nonzero chance of breaking
somebody's extension script file on Windows.  But there's a
counter-argument that it might *prevent* bugs on Windows, by making
script behavior more nearly identical to what it is on not-Windows.
So I think that's kind of a wash.

Other approaches we could take with perhaps-smaller blast radii
include making script_error_callback() trim \r's out of the quoted
text (ugly) or creating a variant expected-file (hard to maintain,
and I wonder if it'd survive git-related newline munging).

Thoughts?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-10-23%2011%3A00%3A37

Re: Useless field ispartitioned in CreateStmtContext

2024-10-24 Thread Alena Rybakina


Hi!

On 24.10.2024 16:07, hugo wrote:


Hi!

   When looking at the partition-related code, I found that the 
ispartitioned


field in CreateStmtContext is not used. It looks like we can safely 
remove it and


avoid invalid assignment logic.

Here's a very simple fix， any suggestion?

diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c


index 1e15ce10b48..6dea85cc2dc 100644

--- a/src/backend/parser/parse_utilcmd.c

+++ b/src/backend/parser/parse_utilcmd.c

@@ -89,7 +89,6 @@ typedef struct

    List *alist;  /* "after list" of things to 
do after creating


* the table */

    IndexStmt *pkey;   /* PRIMARY KEY index, 
if any */


-   bool ispartitioned;  /* true if table is partitioned */

PartitionBoundSpec *partbound;  /* transformed FOR VALUES */

    bool ofType; /* true if statement contains OF 
typename */


} CreateStmtContext;

@@ -246,7 +245,6 @@ transformCreateStmt(CreateStmt *stmt, const char 
*queryString)


    cxt.blist = NIL;

    cxt.alist = NIL;

    cxt.pkey = NULL;

- cxt.ispartitioned = stmt->partspec != NULL;

    cxt.partbound = stmt->partbound;

    cxt.ofType = (stmt->ofTypename != NULL);

@@ -3401,7 +3399,6 @@ transformAlterTableStmt(Oid relid, 
AlterTableStmt *stmt,


    cxt.blist = NIL;

    cxt.alist = NIL;

    cxt.pkey = NULL;

- cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);

    cxt.partbound = NULL;

    cxt.ofType = false;

I absolutely agree with you. I found that ispartitioned parameter has 
been defined but it is never used during optimization.


I also noticed that its local copy is being redefined in the 
ATExecDropIdentity, ATExecSetIdentity and ATExecAddIdentity functions.


So, I'm willing to agree with you.

BTW, the regression tests were successful.

--
Regards,
Alena Rybakina
Postgres Professional

Re: execute prepared statement passing parameter expression with COLLATE clause

jian he  writes:
> select * from pktable where x = 'Å' collate ignore_accents;
> --return one row

> execute q6('Å' collate ignore_accents);
> --return zero rows

> not sure return zero rows is desired.

The parameter symbol just represents a value, which does not
carry any collation information.  The collation to use was
determined when the prepared statement was parsed, and is not
going to change on the basis of what you write in EXECUTE.
We could have a discussion about whether this is desirable,
but it's prett much moot, because this is how the SQL committee
designed SQL's collation feature.  It's not going to change.

regards, tom lane

Re: general purpose array_sort

"David G. Johnston"  writes:
> Composing function calls here seems quite undesirable from a performance
> standpoint, but maybe I over-estimate the cost of
> exploding-manipulating-freezing an array datum.  Also, while I'm not in a
> good position to judge the challenge of implementing the sort parameters I
> would rather have them than not since the order by API has them (plus
> performance).  I also, maybe unreasonably, believe that our extensible type
> system has already limited the maintenance burden.

Oh!  I had not actually looked at the details of what was being
proposed here.  I imagined "array_sort(anyarray)" and the sort would
use the default sort order for the array's element type.  This
business with a textual representation of a sort clause seems like
over-engineering ... except that it's also under-engineered, because
the parsing is lame and incomplete.  (No USING option, and the fact
that collation comes from somewhere else seems impossibly confusing.)
Let's drop that.  As already noted, if you need a non-default sort
order you can build what you want with a sub-select.  The ambition
here should be to make easy cases easy.

regards, tom lane

vacuumdb --analyze-only (e.g., after a major upgrade) vs. partitioned tables: pg_statistic missing stats for the partitioned table itself

2024-10-24 Thread Nikolay Samokhvalov

Nikolay Samokhvalov 
2:47 PM (0 minutes ago)
to pglsql-hackers
I just learned that vacuumdb --analyze-only doesn't update stats for the
partitioned table itself, taking care only about individual partitions:

(DDL doesn't matter here)

# vacuumdb --analyze-only -U postgres test --verbose
...
INFO:  analyzing "public.measurement_2023_01"
INFO:  "measurement_2023_01": scanned 6370 of 6370 pages, containing
100 live rows and 0 dead rows; 3 rows in sample, 100 estimated
total rows
INFO:  "measurement_2023_02": scanned 6257 of 6257 pages, containing 982279
live rows and 0 dead rows; 3 rows in sample, 982279 estimated total rows
INFO:  analyzing "public.measurement_2023_03"
INFO:  "measurement_2023_03": scanned 6483 of 6483 pages, containing
1017721 live rows and 0 dead rows; 3 rows in sample, 1017721 estimated
total rows
...

test=# select starelid::regclass, count(*) from pg_statistic where
starelid::regclass::text ~ 'measurement' group by 1 order by 1;
  starelid   | count
-+---
 measurement_2023_01 | 4
 measurement_2023_02 | 4
 measurement_2023_03 | 4
(3 rows)

While for the single-threaded SQL-level ANALYZE:

test=# analyze verbose measurement;
...
test=# select starelid::regclass, count(*) from pg_statistic where
starelid::regclass::text ~ 'measurement' group by 1 order by 1;
  starelid   | count
-+---
 measurement | 4
 measurement_2023_01 | 4
 measurement_2023_02 | 4
 measurement_2023_03 | 4
(4 rows)

This means that if, after running pg_upgrade, we use vacuumdb to update
stats faster, some stats may be missing, potentially leading to suboptimal
performance.

Additionally, it doesn't help that pg_stat_all_tables doesn't show
counters/timestamps for partitioned table, even after SQL-level ANALYZE:

test=# select relname, analyze_count, autoanalyze_count, last_analyze,
last_autoanalyze from pg_stat_user_tables where relname ~ 'measurement';
   relname   | analyze_count | autoanalyze_count |
last_analyze  | last_autoanalyze
-+---+---+---+--
 measurement_2023_01 | 2 | 0 | 2024-10-24
21:25:47.979958+00 |
 measurement_2023_02 | 2 | 0 | 2024-10-24
21:25:48.070355+00 |
 measurement_2023_03 | 2 | 0 | 2024-10-24
21:25:48.154613+00 |
(3 rows)

This is also discussed in
https://www.postgresql.org/message-id/flat/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA%40mail.gmail.com

I propose considering 3 fixes:

1) vacuumdb --analyze / --analyze-only to update stats for the partitioned
table, so people using pg_upgrade are not in trouble
2) present the ANALYZE metadata for partitioned tables in pg_stat_all_tables
3) for old versions, either backpatch with fix (1) OR just add to the docs
(and maybe to the final words pg_upgrade prints), suggesting something like
this in addition to vacuumdb analyze-only:

-- psql snippet
select format(
  'analyze verbose %I.%I;',
  relnamespace::oid::regnamespace,
  oid::regclass
) as vacuum_command
from pg_class
where relkind = 'p' \gexec

Additionally, I do like the idea of ANALYZE ONLY from the -general
discussion above (though, there might be confusion with logic of --analyze
and --analyze-only in vacuumdb).

Does it make sense?

Nik

Re: Using read_stream in index vacuum

2024-10-24 Thread Rahila Syed

Hi Andrey,

I ran the following test with
v7-0001-Prototype-B-tree-vacuum-streamlineing.patch
to measure the performance improvement.

--Table size of approx 2GB (Fits in RAM)
postgres=# create unlogged table x_big as select i from
generate_series(1,6e7) i;
SELECT 6000
postgres=# create index on x_big(i);
CREATE INDEX
-- Perform updates to create dead tuples.
postgres=# do $$
declare
var int := 0;
begin
  for counter in 1 .. 1e7 loop
var := (SELECT floor(random() * (1e7 - 1 + 1) * 1));
UPDATE x_big SET i = i + 5 WHERE i = var;
  end loop;
end;
$$;
postgres=# CREATE EXTENSION pg_buffercache;
CREATE EXTENSION
-- Evict Postgres buffer cache for this relation.
postgres=# SELECT DISTINCT pg_buffercache_evict(bufferid)
  FROM pg_buffercache
 WHERE relfilenode = pg_relation_filenode('x_big');
 pg_buffercache_evict
--
 t
(1 row)

postgres=# \timing on
Timing is on.
postgres=# vacuum x_big;
VACUUM

The timing does not seem to have improved with the patch.
Timing with the patch:  Time: 9525.696 ms (00:09.526)
Timing without the patch:  Time: 9468.739 ms (00:09.469)

While writing this email, I realized I evicted buffers for the table
and not the index. I will perform the test again. However,
I would like to know your opinion on whether this looks like
a valid test.

Thank you,
Rahila Syed

On Thu, Oct 24, 2024 at 4:45 PM Andrey M. Borodin 
wrote:

>
>
> > On 24 Oct 2024, at 10:15, Andrey M. Borodin 
> wrote:
> >
> > I've also added GiST vacuum to the patchset.
>
> I decided to add up a SP-GiST while having launched on pgconf.eu.
>
>
> Best regards, Andrey Borodin.
>

Re: Conflict detection for update_deleted in logical replication

2024-10-24 Thread Peter Smith

Hi Hou-san, here are my review comments for patch v5-0001.

==
General

1.
Sometimes in the commit message and code comments the patch refers to
"transaction id" and other times to "transaction ID". The patch should
use the same wording everywhere.

==
Commit message.

2.
"While for concurrent remote transactions with earlier timestamps,..."

I think this means:
"But, for concurrent remote transactions with earlier timestamps than
the DELETE,..."

Maybe expressed this way is clearer.

~~~

3.
... the resolution would be to convert the update to an insert.

Change this to uppercase like done elsewhere:
"... the resolution would be to convert the UPDATE to an INSERT.

==
doc/src/sgml/protocol.sgml

4.
+   
+Primary WAL status update (B)
+
+ 
+  
+   Byte1('s')
+   
+
+ Identifies the message as a primary WAL status update.
+
+   
+  

I felt it would be better if this is described as just a "Primary
status update" instead of a "Primary WAL status update". Doing this
makes it more flexible in case there is a future requirement to put
more status values in here which may not be strictly WAL related.

~~~

5.
+   
+Standby WAL status request (F)
+
+ 
+  
+   Byte1('W')
+   
+
+ Identifies the message as a request for the WAL status
on the primary.
+
+   
+  
+ 
+
+   

5a.
Ditto the previous comment #4. Perhaps you should just call this a
"Primary status request".

~

5b.
Also, The letter 'W' also seems chosen because of WAL. But it might be
more flexible if those identifiers are more generic.

e.g.
's' = the request for primary status update
'S' = the primary status update

==
src/backend/replication/logical/worker.c

6.
+ else if (c == 's')
+ {
+ TimestampTz timestamp;
+
+ remote_lsn = pq_getmsgint64(&s);
+ timestamp = pq_getmsgint64(&s);
+
+ maybe_advance_nonremovable_xid(&remote_lsn, &phase);
+ UpdateWorkerStats(last_received, timestamp, false);
+ }

Since there's no equivalent #define or enum value, IMO it is too hard
to know the intent of this code without already knowing the meaning of
the magic letter 's'. At least there could be a comment here to
explain that this is for handling an incoming "Primary status update"
message.

~~~

maybe_advance_nonremovable_xid:

7.
+ * The oldest_nonremovable_xid is maintained in shared memory to prevent dead
+ * rows from being removed prematurely when the apply worker still needs them
+ * to detect update-delete conflicts.

/update-delete/update_deleted/

~

8.
+ * applied and flushed locally. The process involves:
+ *
+ * DTR_REQUEST_WALSENDER_WAL_POS - Call GetOldestActiveTransactionId() to get
+ * the candidate xmin and send a message to request the remote WAL position
+ * from the walsender.
+ *
+ * DTR_WAIT_FOR_WALSENDER_WAL_POS - Wait for receiving the WAL position from
+ * the walsender.
+ *
+ * DTR_WAIT_FOR_LOCAL_FLUSH - Advance the non-removable transaction ID if the
+ * current flush location has reached or surpassed the received WAL position.

8a.
This part would be easier to read if those 3 phases were indented from
the rest of this function comment.

~

8b.
/Wait for receiving/Wait to receive/

~

9.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, which involves
+ * converting an UPDATE to an INSERT and applying it if remote transactions

The commit message referred to a "latest_timestamp_wins". I suppose
that is the same as what this function comment called
"last-update-wins". The patch should use consistent terminology.

It would be better if the commit message and (parts of) this function
comment were just cut/pasted to be identical. Currently, they seem to
be saying the same thing, but using slightly different wording.

~

10.
+ static TimestampTz xid_advance_attemp_time = 0;
+ static FullTransactionId candidate_xid;

typo in var name - "attemp"

~

11.
+ *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+
+ /*
+ * Do not return here because the apply worker might have already
+ * applied all changes up to remote_wal_pos. Proceeding to the next
+ * phase to check if we can immediately advance the transaction ID.
+ */

11a.
IMO this comment should be above the *phase assignment.

11b.
/Proceeding to the next phase to check.../Instead, proceed to the next
phase to check.../

~

12.
+ /*
+ * Advance the non-removable transaction id if the remote wal position
+ * has been received, and all transactions up to that position on the
+ * publisher have been applied and flushed locally.
+ */

Some minor re-wording would help clarify this comment.

SUGGESTION
Reaching here means the remote wal position has been received, and all
transactions up to that position on the
publisher have been applied and flushed locally. So, now we can
advance the

Re: Inconsistent use of relpages = -1

2024-10-24 Thread Laurenz Albe

On Thu, 2024-10-24 at 08:03 -0700, Jeff Davis wrote:
> On Thu, 2024-10-24 at 05:01 +0300, Laurenz Albe wrote:
> > What you write above indicates that "relpages" = 0 and "reltuples" >
> > 0
> > would also be acceptable.
> 
> As Tom pointed out, that creates a risk that it's interpreted as
> infinite tuple denisity.
> 
> The only functional change in my patch is to create the partitioned
> table with relpages=-1 to be more consistent with the value after it's
> analyzed.

Great, then it cannot be wrong.  Sorry for the noise!

Yours,
Laurenz Albe

Re: Commutation of array SOME/ANY and ALL operators

2024-10-24 Thread Matthew Morrissette Vance

> Inventing commutator operators for LIKE etc could be a path of
> much less resistance (unless the operator names get bikeshedded
> to death).  Are there really that many that people need?
> A quick query of pg_operator suggests that the LIKE/regex family
> is the bulk of the problem for real-world cases.

Sadly, I was afraid that might be the case.  It would be so nice if it
wasn't but if we have to go the path of least resistance.

We'd need a standard for the reversed version of a command that didn't
intersect with any other command or SQL keyword.
We would also need a standard for the reversed version of operators that
don't have a commutation that we want to support the opposite of.

The only commands/operators that I think are probably really wanted in that
category would be:
* LIKE
* ILIKE
* ~ (regex match)
* ~* (regex case insensitive match)
* !~ (not regex match)
* !~* (not regex case insensitive match)

Proposed: (prefix the command/operator with ~ and put it before the ! if
it's a "not" expression)
* EKIL (LIKE backwards) or RLIKE or CLIKE or ~LIKE
* EKILI (ILIKE backwards) or RILIKE or CILIKE or ~ILIKE
* ~~
* ~~* or ~*
* !~~
* !~~* or !~* or *~!

On Wed, Oct 23, 2024 at 6:57 PM Tom Lane  wrote:

> Matthew Morrissette Vance  writes:
> > If instead, PostgreSQL could support the commutation of the `SOME/ANY`
> and
> > `ALL` operators so that the `ANY(array)` could be on both sides of the
> > provided operator, it would allow for this kind of searching natively.
>
> > Firstly, would a PR that enhanced PostgreSQL in this manner be accepted?
>
> My gut feeling is you'll run into insurmountable grammar-ambiguity
> problems.  I might be wrong, but I have an idea that this has
> already been tried and failed on that point.
>
> Inventing commutator operators for LIKE etc could be a path of
> much less resistance (unless the operator names get bikeshedded
> to death).  Are there really that many that people need?
> A quick query of pg_operator suggests that the LIKE/regex family
> is the bulk of the problem for real-world cases.
>
> regards, tom lane
>

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-24 Thread Melanie Plageman

Thanks for the reply, Dilip!

On Thu, Oct 24, 2024 at 12:50 AM Dilip Kumar  wrote:
>
> On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman  
> wrote:
>>
>> HeapScanDescData->rs_cindex (the current index into the array of
>> visible tuples stored in the heap scan descriptor) is used for
>> multiple scan types, but for bitmap heap scans, AFAICT, it should
>> never be < 0.
>
> You are right it can never be < 0 in this case at least.

Cool, thanks for confirming. I will commit this change then.

> In fact you don't need to explicitly set it to 0 in initscan[1], because 
> before calling heapam_scan_bitmap_next_tuple() we must call 
> heapam_scan_bitmap_next_block() and this function is initializing this to 0 
> (hscan->rs_cindex = 0;).  Anyway no object even if you prefer to initialize 
> in initscan(), just the point is that we are already doing it for each block 
> before fetching tuples from the block.

I am inclined to still initialize it to 0 in initscan(). I was
refactoring the bitmap heap scan code to use the read stream API and
after moving some things around, this field was no longer getting
initialized in heapam_scan_bitmap_next_block(). While that may not be
a good reason to change it in this patch, most of the other fields in
the heap scan descriptor (like rs_cbuf and rs_cblock) are
re-initialized in initscan(), so it seems okay to do that there even
though it isn't strictly necessary on master.

- Melanie

Re: Pgoutput not capturing the generated columns

2024-10-24 Thread Amit Kapila

On Thu, Oct 24, 2024 at 12:15 PM vignesh C  wrote:
>
> The attached v41 version patch has the changes for the same.
>

Please find comments for the new version as follows:
1.
+  Generated columns may be skipped during logical replication
according to the
+  CREATE PUBLICATION option
+  
+  include_generated_columns.

The above statement doesn't sound to be clear. Can we change it to:
"Generated columns are allowed to be replicated during logical
replication according to the CREATE PUBLICATION
option .."?

2.
 static void publication_invalidation_cb(Datum arg, int cacheid,
  uint32 hashvalue);
-static void send_relation_and_attrs(Relation relation, TransactionId xid,
- LogicalDecodingContext *ctx,
- Bitmapset *columns);
 static void send_repl_origin(LogicalDecodingContext *ctx,
...
...
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
  Relation relation);
+static void send_relation_and_attrs(Relation relation, TransactionId xid,
+ LogicalDecodingContext *ctx,
+ RelationSyncEntry *relentry);

Why the declaration of this function is changed?

3.
+ /*
+ * Skip publishing generated columns if the option is not specified or
+ * if they are not included in the column list.
+ */
+ if (att->attgenerated && !relentry->pubgencols && !columns)

In the comment above, shouldn't "specified or" be "specified and"?

4.
+pgoutput_pubgencol_init(PGOutputData *data, List *publications,
+ RelationSyncEntry *entry)

{
...
+ foreach(lc, publications)
+ {
+ Publication *pub = lfirst(lc);
+
+ /* No need to check column list publications */
+ if (is_column_list_publication(pub, entry->publish_as_relid))

Are we ignoring column_list publications because for such publications
the value of column_list prevails and we ignore
'publish_generated_columns' value? If so, it is not clear from the
comments.

5.
  /* Initialize the column list */
  pgoutput_column_list_init(data, rel_publications, entry);
+
+ /* Initialize publish generated columns value */
+ pgoutput_pubgencol_init(data, rel_publications, entry);
+
+ /*
+ * Check if there is conflict with the columns selected for the
+ * publication.
+ */
+ check_conflicting_columns(rel_publications, entry);
  }

It looks odd to check conflicting column lists among publications
twice once in pgoutput_column_list_init() and then in
check_conflicting_columns(). Can we merge those?

-- 
With Regards,
Amit Kapila.

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-10-24 Thread Daniel Gustafsson

> On 16 Oct 2024, at 17:30, Jacob Champion  
> wrote:

> Other than that, LGTM!

Thanks for all the review work, I went ahead and pushed this patchseries today
after a little bit more polishing of comments and docs.  So far plover has
failed which was expected due to the raised OpenSSL/LibreSSL requirement, I
will reach out to BF animal owners for upgrades.

--
Daniel Gustafsson

Re: Using Expanded Objects other than Arrays from plpgsql

I wrote:
> ... I'm still writing up
> details, but right now I'm envisioning completely separate sets of
> rules for the prosupport case versus the no-prosupport case.

So here is the design I've come up with for optimizing R/W expanded
object updates in plpgsql without any special knowledge from a
prosupport function.  AFAICS this requires no assumptions at all
about the behavior of called functions, other than the bare minimum
"you can't corrupt the object to the point where it wouldn't be
cleanly free-able".  In particular that means it can work for
user-written called functions in plpgsql, SQL, or whatever, not
only for C-coded functions.

There are two requirements to apply the optimization:

* If the assignment statement is within a BEGIN ... EXCEPTION block,
its target variable must be declared inside the most-closely-nested
such block.  This ensures that if an error is thrown from within the
assignment statement's expression, we do not care about the value
of the target variable, except to the extent of being able to clean
it up.

* The target variable must be referenced exactly once within the
RHS expression.  This avoids aliasing hazards such as we discussed
upthread.  But unlike the existing rule, that reference can be
anywhere in the RHS --- it doesn't have to be an argument of the
topmost function.  So for example we can optimize
foo := fee(fi(fo(fum(foo), other_variable), ...));

While I've not tried to write any code yet, I think both of these
conditions should be reasonably easy to verify.

Given that those conditions are met and the current value of the
assignment target variable is a R/W expanded pointer, we can
execute the assignment as follows:

* Provide the R/W expanded pointer as the value of the Param
representing the sole reference.  At the same time, apply
TransferExpandedObject to reparent the object under the transient
eval_mcontext memory context that's being used to evaluate the
RHS expression, and then set the target variable's value to NULL.
(At this point, the R/W object has exactly the same logical
status as any intermediate calculation result that's palloc'd
in the eval_mcontext.)

* At successful completion of the RHS, assign the result to the
target variable normally.  This includes, if it's an R/W expanded
object, reparenting it under the calling function's main context.

If the RHS expression winds up returning the same expanded object
(probably mutated), then the outer function regains ownership of it
after no more than a couple of TransferExpandedObject calls, which
are cheap.  If the RHS returns something different, then either the
original expanded object got discarded during RHS evaluation, or it
will be cleaned up when we reset the eval_mcontext, so that it won't
be leaked.

I didn't originally foresee the need to transfer the object
into the transient memory context.  But this design avoids any
possibility of double-free attempts, which would be likely to
happen if we allow the outer function's variable to hold onto
a reference to the object while the RHS is evaluated.  A function
receiving an R/W reference is entitled to assume that that value
is not otherwise referenced and can be freed when it's no longer
needed, so it might well get freed during RHS evaluation.  By
converting the original R/W object into (effectively) a temporary
value within the RHS evaluation, we make that assumption valid.

So, while this design greatly expands the set of cases we can
optimize, it does lose some cases that the old approach could
support.  I envision addressing that by allowing a prosupport
function attached to the RHS' topmost function to "bless"
other cases as safe, using reasoning similar to the old rules.
(Or different rules, even, but it's on the prosupport function
to be sure it's safe.)  I don't have a detailed design in mind,
but I'm thinking along the lines of just passing the whole RHS
expression to the prosupport function and letting it decide
what's safe.  In any case, we don't need to even call the
prosupport function unless there's an exception block or
multiple RHS references to the target variable.

regards, tom lane

Re: Inconsistent use of relpages = -1

2024-10-24 Thread Laurenz Albe

On Wed, 2024-10-23 at 10:05 -0700, Jeff Davis wrote:
> On Wed, 2024-10-23 at 04:47 +0200, Laurenz Albe wrote:
> > On Tue, 2024-10-22 at 10:41 -0700, Jeff Davis wrote:
> > > I attached a patch that creates partitioned tables with relpages=-
> > > 1,
> > > and updates the docs.
> > 
> > Does this need any changes for pg_upgrade?
> 
> pg_upgrade would go through the same DDL path, so I expect it would be
> set to -1 in the upgraded cluster anyway. Are you seeing something
> different?

No; I was too lazy to check, so I asked.

> In any case, the change is just meant to improve consistency and
> documentation. I didn't intend to create a hard guarantee that it would
> always be -1, so perhaps I should make the documentation more vague
> with "may be" instead of "is"?

Reading the thread an Tom's comments again, I get the impression that there
are two states that we consider OK:

- "relpages" and "reltuples" are both 0
- "relpages" is -1 and "reltuples" is greater than 0

What you write above indicates that "relpages" = 0 and "reltuples" > 0
would also be acceptable.

As long as the code does not mistakenly assume that a partitioned table
is empty, and as long as the documentation is not confusing, anything
is fine with me.  Currently, I am still a bit confused.

Yours,
Laurenz Albe

Re: general purpose array_sort

2024-10-24 Thread Aleksander Alekseev

Hi,

> It's hardly "general purpose" if it randomly refuses to
> sort certain types.  I would say it should be able to sort
> anything that ORDER BY will handle --- and that certainly
> includes the cases shown here.

I wonder how useful / convenient the new function will be considering
that we already have CTEs and can do:

SELECT array_agg(x ORDER BY x) FROM unnest(ARRAY[5,1,3,2,4]) AS x;

Perhaps there are use cases I didn't consider?

-- 
Best regards,
Aleksander Alekseev

Re: general purpose array_sort

2024-10-24 Thread Aleksander Alekseev

Hi David,

>> > It's hardly "general purpose" if it randomly refuses to
>> > sort certain types.  I would say it should be able to sort
>> > anything that ORDER BY will handle --- and that certainly
>> > includes the cases shown here.
>>
>> I wonder how useful / convenient the new function will be considering
>> that we already have CTEs and can do:
>>
>> SELECT array_agg(x ORDER BY x) FROM unnest(ARRAY[5,1,3,2,4]) AS x;
>>
>> Perhaps there are use cases I didn't consider?
>>
>
> Succinctness of expression.  Plus I'm under the impression that a function 
> doing this is going to be somewhat faster than composing two functions 
> together within a multi-node subtree.
>
> I feel like the same observation could have been made for array_shuffle but 
> we added that.  This function being added feels to me like just completing 
> the set.

Right. To clarify, I'm not opposed to array_sort(). In fact personally
I would prefer using it instead of array_agg() + unnest().

Just making an observation / thinking out loud that the requirement to
support everything ORDER BY handles / supports (and will handle /
support?) might make this function impractical to maintain.
array_shuffle() or a recently proposed array_reverse() [1] are rather
simple since they just move the array elements without looking at
them. array_sort() does look at the elements and thus is very
different.

Particularly, does the function really need to support dir => asc |
desc | asc nulls first | etc... ? Maybe array_sort() + array_reverse(
array_sort() ) will handle most of the practical cases? I don't know.

[1]: https://commitfest.postgresql.org/50/5314/
--
Best regards,
Aleksander Alekseev

Re: Changing the default random_page_cost value

2024-10-24 Thread David Rowley

On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane  wrote:
>
> On Mon, Oct 14, 2024 at 10:20 PM David Rowley  wrote:
>>
>> Yeah, I think any effort to change the default value for this setting would 
>> require some analysis to prove that the newly proposed default
>> is a more suitable setting than the current default. I mean, why 1.2 and not 
>> 1.1 or 1.3? Where's the evidence that 1.2 is the best value
>> for this?
>
> As I said, I was just throwing that 1.2 number out there. It felt right, 
> although perhaps a tad high (which seems right as we keep things very 
> conservative). I agree we should make a best effort to have an accurate, 
> defendable default. We all know (I hope) that 4.0 is wrong for SSDs.

I don't think we're going to find the correct new value for this
setting by throwing randomly chosen numbers at each other on an email
thread. Unfortunately, someone is going to have to do some work to
figure out what the number should be, and then hopefully someone else
can verify that work to check that person is correct.

I'm not trying to be smart or funny here, but I just am failing to
comprehend why you think you offering a number without any information
about how you selected that number to set as the new default
random_page_cost would be acceptable.  Are you expecting someone else
to go and do the work to prove that your selected number is the
correct one? It's been 4 weeks since your first email and nobody has
done that yet, so maybe you might need to consider other ways to
achieve your goal.

>> I don't think just providing evidence that random read times are closer to 
>> sequential read times on SSDs are closer than they are with
>> HDDs is going to be enough.
>
> ...
>>
>> It would be nice to have this as a script so that other people could easily 
>> run it on their hardware to ensure that random_page_cost we
>> choose as the new default is representative of the average hardware.
>
>
> Heh, this is starting to feel like belling the cat (see 
> https://fablesofaesop.com/belling-the-cat.html)

I don't see the similarity. Changing the default random_page_cost
requires analysis to find what the new default should be. The
execution of the actual change in default is dead simple.  With
belling the cat, it seems like the execution is the hard part and
nobody is debating the idea itself.

> Remember this is still just a default, and we should encourage people to 
> tweak it themselves based on their own workloads. I just want people to start 
> in the right neighborhood. I'll see about working on some more research / 
> generating a script, but help from others is more than welcome.

You might be mistakenly thinking that the best random_page_cost is an
exact ratio of how much slower a random seek and read is from a
sequential read. There are unfortunately many other factors to
consider. The correct setting is going to be the one where the chosen
plan uses the scan method that's the fastest and knowing the answer to
that is going to take some benchmarks on PostgreSQL. Our cost model
simply just isn't perfect enough for you to assume that I/O is the
only factor that changes between an Index Scan and a Seq Scan.

I'd say it's not overly difficult to come up with test cases that go
to prove the value you select is "correct". I've done this before for
CPU-related costs. I think with I/O the main difference will be that
your tests should be much larger, and doing that will mean getting the
results takes much more time. Here's a link to some analysis I did to
help solve a problem relating to partition-wise aggregates [1]. Maybe
you can use a similar method to determine random_page_cost.

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com

Re: POC: make mxidoff 64 bits

2024-10-24 Thread wenhui qiu

HI Maxim Orlov
> After a bit of thought, I've realized that to be conservative here is the
way to go.
>We can reuse a maximum of existing logic. I mean, we can remove offset
wraparound "error logic" and reuse "warning logic". But set the threshold
for "warning >logic" to a much higher value. For now, I choose 2^32-1. In
other world, legit logic, in my view, here would be to trigger autovacuum
if the number of offsets (i.e. >difference nextOffset - oldestOffset)
exceeds 2^32-1. PFA patch set.
good point ,Couldn't agree with you more. xid64 is the solution to the
wraparound problem,The previous error log is no longer meaningful ,But we
might want to refine the output waring log a little(For example, checking
the underlying reasons why age has been increasing),Though we don't have to
worry about xid wraparound

+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0x)
Can we refine this annotation a bit? for example
+/*
+ * Multixact members warning threshold.
+ *
+ * If difference bettween nextOffset and oldestOffset exceed this value, we
+ * trigger autovacuum in order to release the disk space ,reduce table
bloat if possible.
+ */
+#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0x)

Thanks

Maxim Orlov  于2024年10月23日周三 23:55写道：

> After a bit of thought, I've realized that to be conservative here is the
> way to go.
>
> We can reuse a maximum of existing logic. I mean, we can remove offset
> wraparound "error logic" and reuse "warning logic". But set the threshold
> for "warning logic" to a much higher value. For now, I choose 2^32-1. In
> other world, legit logic, in my view, here would be to trigger autovacuum
> if the number of offsets (i.e. difference nextOffset - oldestOffset)
> exceeds 2^32-1. PFA patch set.
>
> --
> Best regards,
> Maxim Orlov.
>

Re: Changing the default random_page_cost value

2024-10-24 Thread wenhui qiu

HI Greg Sabino Mullane
Another thing is that you simply change the configuration template is
not effective,

need to modify the DEFAULT_RANDOM_PAGE_COST values
{
{"random_page_cost", PGC_USERSET, QUERY_TUNING_COST,
gettext_noop("Sets the planner's estimate of the cost of a "
"nonsequentially fetched disk page."),
NULL,
GUC_EXPLAIN
},
&random_page_cost,
DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX,
NULL, NULL, NULL
},

src/include/optimizer/cost.h
/* defaults for costsize.c's Cost parameters */
/* NB: cost-estimation code should use the variables, not these constants!
*/
/* If you change these, update backend/utils/misc/postgresql.conf.sample */
#define DEFAULT_SEQ_PAGE_COST 1.0
#define DEFAULT_RANDOM_PAGE_COST 4.0
#define DEFAULT_CPU_TUPLE_COST 0.01
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0

Thanks

>
>
>
>

Re: Changing the default random_page_cost value

2024-10-24 Thread Greg Sabino Mullane

On Mon, Oct 14, 2024 at 10:20 PM David Rowley  wrote:

> Yeah, I think any effort to change the default value for this setting
> would require some analysis to prove that the newly proposed default
> is a more suitable setting than the current default. I mean, why 1.2 and
> not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
> for this?
>

As I said, I was just throwing that 1.2 number out there. It felt right,
although perhaps a tad high (which seems right as we keep things very
conservative). I agree we should make a best effort to have an accurate,
defendable default. We all know (I hope) that 4.0 is wrong for SSDs.

> I don't think just providing evidence that random read times are closer to
> sequential read times on SSDs are closer than they are with
> HDDs is going to be enough.

...

> It would be nice to have this as a script so that other people could
> easily run it on their hardware to ensure that random_page_cost we
> choose as the new default is representative of the average hardware.

Heh, this is starting to feel like belling the cat (see
https://fablesofaesop.com/belling-the-cat.html)

Remember this is still just a default, and we should encourage people to
tweak it themselves based on their own workloads. I just want people to
start in the right neighborhood. I'll see about working on some more
research / generating a script, but help from others is more than welcome.

Cheers,
Greg

Re: Missing installation of Kerberos.pm and AdjustUpgrade.pm

On Thu, Oct 24, 2024 at 02:51:05PM +0900, Michael Paquier wrote:
> While catching up with some other thing, I've been reminded that this
> has not been addressed.  Any objections if I were to apply the
> attached to add install and uninstall rules for Kerberos.pm and
> AdjustUpgrade.pm?
> 
> If people would prefer a backpatch, please let me know.

Hearing nothing, applied this one on HEAD for now.
--
Michael


signature.asc
Description: PGP signature

Re: pgbench: Improve result outputs related to failed transactinos

2024-10-24 Thread Alexander Lakhin


Hello Tatsuo-san,

11.10.2024 07:54, Tatsuo Ishii wrote:

...
  - number of transactions actually pocessed: 1 (tps = 410.846343)
...

Patch pushed.


Please consider fixing a typo sneaked in that commit: pocessed -> processed?

Best regards,
Alexander

Re: Refactor to use common function 'get_publications_str'.

2024-10-24 Thread Peter Smith

On Fri, Oct 25, 2024 at 10:00 AM Michael Paquier  wrote:
>
> On Fri, Oct 25, 2024 at 09:28:47AM +1100, Peter Smith wrote:
> > I've attached the patch v4.
>
> Looks OK to me.  Thanks.  I'll see to get that done through the day.
> --

Thanks for pushing!

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: proposal: schema variables

2024-10-24 Thread Pavel Stehule

st 23. 10. 2024 v 6:11 odesílatel Laurenz Albe 
napsal:

> I have gone over patch 3 from the set and worked on the comments.
>
> Apart from that, I have modified your patch as follows:
>
> > +/*
> > + * pg_session_variables - designed for testing
> > + *
> > + * This is a function designed for testing and debugging.  It returns
> the
> > + * content of sessionvars as-is, and can therefore display entries about
> > + * session variables that were dropped but for which this backend didn't
> > + * process the shared invalidations yet.
> > + */
> > +Datum
> > +pg_session_variables(PG_FUNCTION_ARGS)
> > +{
> > +#define NUM_PG_SESSION_VARIABLES_ATTS 8
> > +
> > + elog(DEBUG1, "pg_session_variables start");
>
> I don't think that message is necessary, particularly with DEBUG1.
> I have removed this message and the "end" message as well.
>

removed


>
> > + while ((svar = (SVariable) hash_seq_search(&status)) !=
> NULL)
> > + {
> > + Datum
>  values[NUM_PG_SESSION_VARIABLES_ATTS];
> > + bool
> nulls[NUM_PG_SESSION_VARIABLES_ATTS];
> > + HeapTuple   tp;
> > + boolvar_is_valid = false;
> > +
> > + memset(values, 0, sizeof(values));
> > + memset(nulls, 0, sizeof(nulls));
>
> Instead of explicitly zeroing out the arrays, I have used an empty
> initializer
> in the definition, like
>
>   bool  nulls[NUM_PG_SESSION_VARIABLES_ATTS] = {};
>
> That should have the same effect.
> If you don't like that, I have no real problem with your original code.
>

I prefer the original way - minimally it is a common pattern. I didn't find
any usage of `= {} ` in code


> > + values[0] = ObjectIdGetDatum(svar->varid);
> > + values[3] = ObjectIdGetDatum(svar->typid);
>
> You are using the type ID without checking if it exists in the catalog.
> I think that is a bug.
>

The original idea was using typid as hint identification of deleted
variables. The possibility that this id will not be consistent for the
current catalogue was expected. And it
is a reason why the result type is just Oid and not regtype. Without it,
pg_session_variables shows just empty rows (except oid) for dropped not yet
purged variables.


>
> There is a dependency between the variable and the type, but if a
> concurrent
> session drops both the variable and the data type, the non-existing type ID
> would still show up in the function output.
> Even worse, the OID could have been reused for a different type since.
>

I agree so usage `select typid::regtype, ... from pg_session_variables(.. `

can be dangerous, but it is the reason why we have there typname column.

Showing typid has some information value, but I don't think it is
absolutely necessary. I see some possible changes:

1. no change
2. remove typid column
3. show typid only when variable is valid, and using regtype as output
type, remove typname

What do you prefer?

I'll send the attachment with merging changes for patch 4

Regards

Pavel


>
> I am attaching just patch number 3 and leave you to adapt the patch set,
> but I don't think any of the other patches should be affected.
>

The comments from your patch are merged



>
> Yours,
> Laurenz Albe
>

Re: Docs Build in CI failing with "failed to load external entity"

2024-10-24 Thread Thomas Munro

On Fri, Oct 25, 2024 at 4:44 AM Tom Lane  wrote:
> Melanie Plageman  writes:
> > I know in the past docs builds failing with "failed to load external
> > entity" have happened on macos. But, recently I've noticed this
> > failure for docs build on CI (not on macos) -- docs build is one of
> > the jobs run under the "Compiler Warnings" task.
>
> It looks to me like a broken docbook installation on (one of?)
> the CI machines.  Note that the *first* complaint is
>
> [19:23:20.590] file:///etc/xml/catalog:1: parser error : Document is empty

Yeah.  That CI job runs on a canned Debian image that is rebuilt and
republished every couple of days to make sure it's using up to date
packages and kernel etc.  Perhaps the package installation silently
corrupted /etc/xml/catalog, given that multiple packages probably mess
with it, though I don't have a specific theory for how that could
happen, given that package installation seems to be serial...  The
installation log doesn't seem to show anything suspicious.

https://github.com/anarazel/pg-vm-images/
https://cirrus-ci.com/github/anarazel/pg-vm-images
https://cirrus-ci.com/build/5427240429682688
https://api.cirrus-ci.com/v1/task/6621385303261184/logs/build_image.log

I tried simply reinstalling docbook-xml in my own github account
(which is showing the problem), and it cleared the error:

   setup_additional_packages_script: |
-#apt-get update
-#DEBIAN_FRONTEND=noninteractive apt-get -y install ...
+apt-get update
+DEBIAN_FRONTEND=noninteractive apt-get -y install --reinstall docbook-xml

https://cirrus-ci.com/task/6458406242877440

I wonder if this will magically fix itself when the next CI image
build cron job kicks off.  I have no idea what time zone this page is
showing but it should happen in another day or so, unless Andres is
around to kick it sooner:

https://cirrus-ci.com/github/anarazel/pg-vm-images

Re: Retire support for OpenSSL 1.1.1 due to raised API requirements

2024-10-24 Thread Daniel Gustafsson

This has now been committed via the TLS 1.3 ciphersuite patchset.

--
Daniel Gustafsson

Re: Docs Build in CI failing with "failed to load external entity"

Melanie Plageman  writes:
> I know in the past docs builds failing with "failed to load external
> entity" have happened on macos. But, recently I've noticed this
> failure for docs build on CI (not on macos) -- docs build is one of
> the jobs run under the "Compiler Warnings" task.

It looks to me like a broken docbook installation on (one of?)
the CI machines.  Note that the *first* complaint is

[19:23:20.590] file:///etc/xml/catalog:1: parser error : Document is empty

I suspect that the subsequent "failed to load external entity"
complaints happen because the XML processor doesn't find any DTD
objects in the local catalog, so it tries to go out to the net for
them, and is foiled by the --no-net switch we use.

regards, tom lane

Re: Wrong results with grouping sets

2024-10-24 Thread Richard Guo

On Thu, Oct 10, 2024 at 6:51 PM Richard Guo  wrote:
> On Thu, Oct 10, 2024 at 4:06 PM Richard Guo  wrote:
> > While we can fix this issue by propagating the hasGroupRTE mark from
> > the EXISTS subquery to the parent, a better fix might be to remove the
> > subquery's RTE_GROUP entry, since we have dropped the subquery's
> > groupClause before the pull-up (see simplify_EXISTS_query).
>
> Here is the patch.

I've pushed this patch with minor tweaks.  Thanks again for the
report!

Thanks
Richard

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-24 Thread Dilip Kumar

On Thu, Oct 24, 2024 at 7:11 PM Melanie Plageman 
wrote:

> Thanks for the reply, Dilip!
>
> On Thu, Oct 24, 2024 at 12:50 AM Dilip Kumar 
> wrote:
> >
> > On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman <
> melanieplage...@gmail.com> wrote:
> >>
> >> HeapScanDescData->rs_cindex (the current index into the array of
> >> visible tuples stored in the heap scan descriptor) is used for
> >> multiple scan types, but for bitmap heap scans, AFAICT, it should
> >> never be < 0.
> >
> > You are right it can never be < 0 in this case at least.
>
> Cool, thanks for confirming. I will commit this change then.
>

+1


>
> > In fact you don't need to explicitly set it to 0 in initscan[1], because
> before calling heapam_scan_bitmap_next_tuple() we must call
> heapam_scan_bitmap_next_block() and this function is initializing this to 0
> (hscan->rs_cindex = 0;).  Anyway no object even if you prefer to initialize
> in initscan(), just the point is that we are already doing it for each
> block before fetching tuples from the block.
>
> I am inclined to still initialize it to 0 in initscan(). I was
> refactoring the bitmap heap scan code to use the read stream API and
> after moving some things around, this field was no longer getting
> initialized in heapam_scan_bitmap_next_block(). While that may not be
> a good reason to change it in this patch, most of the other fields in
> the heap scan descriptor (like rs_cbuf and rs_cblock) are
> re-initialized in initscan(), so it seems okay to do that there even
> though it isn't strictly necessary on master.
>

 make sense

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: cache lookup failed when \d t concurrent with DML change column data type

2024-10-24 Thread Andrei Lepikhov


On 10/25/24 10:05, Andrei Lepikhov wrote:

On 10/24/24 22:30, jian he wrote:

hi. I think I found a bug.
PostgreSQL 18devel_debug_build_45188c2ea2 on x86_64-linux, compiled by
gcc-14.1.0, 64-bit
commit at 45188c2ea2.
Ubuntu 22.04.4 LTS


setup:
drop table t cascade;
create table t(a int PRIMARY key);

IN session1:
step "change data type" {begin; alter table t alter column a set data
type int4;}
step "s1" {commit;}

IN session2:
step "psql_another_session" {\d t}

permutation "change data type" "psql_another_session" "s1"



ERROR:  cache lookup failed for attribute 1 of relation 34418
Yes, it looks like a bug existing for a long time, at least since PG11 
(I didn't trace further down).
It seems that the backend didn't apply invalidation messages before 
touching system caches. Backtrace:

After a short discovery, I found the origins:
The pg_get_indexdef has an incoming index oid and gets all the stuff 
needed just by looking up sys-caches. But it wants to build a list of 
relation column names at a specific moment and opens the heap relation. 
After that operation, we already have syscaches updated and the old 
index oid replaced with the new one.
It may be have made sense to lock the row of replaced index in  pg_class 
and pg_index until the transaction, altered it will be  commmitted. But, 
because ALTER TABLE is not fully MVCC-safe, it may be  expected (or 
acceptable) behaviour.


--
regards, Andrei Lepikhov

Re: Retire support for OpenSSL 1.1.1 due to raised API requirements

On Thu, Oct 24, 2024 at 07:00:52PM +0200, Daniel Gustafsson wrote:
> This has now been committed via the TLS 1.3 ciphersuite patchset.

Nice, thanks for 6c66b7443ceb!
--
Michael


signature.asc
Description: PGP signature

execute prepared statement passing parameter expression with COLLATE clause

hi.

$Subject setup

CREATE COLLATION case_insensitive (provider = icu, locale =
'@colStrength=secondary', deterministic = false);
CREATE COLLATION ignore_accents (provider = icu, locale =
'@colStrength=primary;colCaseLevel=yes', deterministic = false);
DROP TABLE IF EXISTS pktable cascade;
CREATE TABLE pktable (x text COLLATE case_insensitive);
INSERT INTO pktable VALUES ('A');
DEALLOCATE q6;
PREPARE q6 AS SELECT * FROM pktable WHERE x = $1;


select * from pktable where x = 'Å' collate ignore_accents;
--return one row

execute q6('Å' collate ignore_accents);
--return zero rows

not sure return zero rows is desired.

Re: proposal: schema variables

2024-10-24 Thread Laurenz Albe

... and here is a review for patch 4

I didn't change any code, just added the odd article to a comment.

While running the regression tests with "make installcheck", I noticed two 
problems:

  --- /home/laurenz/postgresql/src/test/regress/expected/session_variables.out  
  2024-10-24 11:14:06.717663613 +0300
  +++ /home/laurenz/postgresql/src/test/regress/results/session_variables.out 
2024-10-24 11:15:37.999286228 +0300
  @@ -30,6 +30,7 @@
   GRANT ALL ON SCHEMA svartest TO regress_variable_owner;
   CREATE VARIABLE svartest.var1 AS int;
   CREATE ROLE regress_variable_reader;
  +ERROR:  role "regress_variable_reader" already exists

I suggest that patch 0001 should drop role "regress_variable_reader" again.

  @@ -107,7 +108,7 @@
   CONTEXT:  SQL function "sqlfx" statement 1
   SELECT plpgsqlfx(20);
   ERROR:  permission denied for session variable var1
  -CONTEXT:  SQL expression "$1 + var1"
  +CONTEXT:  PL/pgSQL expression "$1 + var1"

That looks like bit rot from your commit 4af123ad45.

Yours,
Laurenz Albe
From 027fb062ecc0840bc5c2d135ebbc8ddc6b046f96 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Thu, 24 Oct 2024 11:17:12 +0300
Subject: [PATCH v20241024] DISCARD VARIABLES

Implementation of DISCARD VARIABLES commands by removing hash table with session variables
and resetting related memory context.
---
 doc/src/sgml/ref/discard.sgml | 13 +++-
 src/backend/commands/discard.c|  6 ++
 src/backend/commands/session_variable.c   | 28 -
 src/backend/parser/gram.y |  6 ++
 src/backend/tcop/utility.c|  3 +
 src/bin/psql/tab-complete.in.c|  2 +-
 src/include/commands/session_variable.h   |  2 +
 src/include/nodes/parsenodes.h|  1 +
 src/include/tcop/cmdtaglist.h |  1 +
 .../regress/expected/session_variables.out| 63 +++
 src/test/regress/sql/session_variables.sql| 36 +++
 11 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/ref/discard.sgml b/doc/src/sgml/ref/discard.sgml
index bf44c523cac..61b967f9c9b 100644
--- a/doc/src/sgml/ref/discard.sgml
+++ b/doc/src/sgml/ref/discard.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-DISCARD { ALL | PLANS | SEQUENCES | TEMPORARY | TEMP }
+DISCARD { ALL | PLANS | SEQUENCES | TEMPORARY | TEMP | VARIABLES }
 
  
 
@@ -66,6 +66,16 @@ DISCARD { ALL | PLANS | SEQUENCES | TEMPORARY | TEMP }
 

 
+   
+VARIABLES
+
+ 
+  Resets the value of all session variables. If a variable
+  is later reused, it is re-initialized to NULL.
+ 
+
+   
+

 TEMPORARY or TEMP
 
@@ -93,6 +103,7 @@ SELECT pg_advisory_unlock_all();
 DISCARD PLANS;
 DISCARD TEMP;
 DISCARD SEQUENCES;
+DISCARD VARIABLES;
 
 

diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 92d983ac748..d06ebaba6f4 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -18,6 +18,7 @@
 #include "commands/async.h"
 #include "commands/discard.h"
 #include "commands/prepare.h"
+#include "commands/session_variable.h"
 #include "commands/sequence.h"
 #include "utils/guc.h"
 #include "utils/portal.h"
@@ -48,6 +49,10 @@ DiscardCommand(DiscardStmt *stmt, bool isTopLevel)
 			ResetTempTableNamespace();
 			break;
 
+		case DISCARD_VARIABLES:
+			ResetSessionVariables();
+			break;
+
 		default:
 			elog(ERROR, "unrecognized DISCARD target: %d", stmt->target);
 	}
@@ -75,4 +80,5 @@ DiscardAll(bool isTopLevel)
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
+	ResetSessionVariables();
 }
diff --git a/src/backend/commands/session_variable.c b/src/backend/commands/session_variable.c
index 19f772a9fb6..eedce691bc0 100644
--- a/src/backend/commands/session_variable.c
+++ b/src/backend/commands/session_variable.c
@@ -94,7 +94,13 @@ pg_variable_cache_callback(Datum arg, int cacheid, uint32 hashvalue)
 
 	elog(DEBUG1, "pg_variable_cache_callback %u %u", cacheid, hashvalue);
 
-	Assert(sessionvars);
+	/*
+	 * There is no guarantee of session variables being initialized, even when
+	 * receiving an invalidation callback, as DISCARD [ ALL | VARIABLES ]
+	 * destroys the hash table entirely.
+	 */
+	if (!sessionvars)
+		return;
 
 	/*
 	 * If the hashvalue is not specified, we have to recheck all currently
@@ -658,3 +664,23 @@ pg_session_variables(PG_FUNCTION_ARGS)
 
 	return (Datum) 0;
 }
+
+/*
+ * Fast drop of the complete content of the session variables hash table, and
+ * cleanup of any list that wouldn't be relevant anymore.
+ * This is used by the DISCARD VARIABLES (and DISCARD ALL) command.
+ */
+void
+ResetSessionVariables(void)
+{
+	/* destroy hash table and reset related memory context */
+	if (sessionvars)
+	{
+		hash_destroy(sessionvars);
+		sessionvars = NULL;
+	}
+
+	/* release memory allocated by session variables */
+	if (SVariableMemoryContext != NULL)
+		MemoryContextReset(SVariableMemoryContext

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

Amit Langote  于2024年10月24日周四 14:33写道：

> Hi,
>
> On Thu, Oct 24, 2024 at 1:46 PM Tender Wang  wrote:
> > Tender Wang  于2024年10月23日周三 21:48写道：
> >>
> >> Hi all,
> >>
> >> I find another issue as $SUBJECT when I work on [1].
> >
> > When I continue to work on this, I find below issue. But I'm not sure
> whether it is a bug.
> >
> > postgres=# create table part_index(a text primary key) partition by list
> ( a collate "POSIX");
> > ERROR:  unique constraint on partitioned table must include all
> partitioning columns
> > DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a"
> which is part of the partition key.
> > postgres=# create table part_index(a text) partition by list ( a collate
> "POSIX");
> > CREATE TABLE
> > postgres=# alter table part_index add primary key (a);
> > ERROR:  unique constraint on partitioned table must include all
> partitioning columns
> > DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a"
> which is part of the partition key.
> >
> > It seems we can't create a primary key if the collation is different
> between columnDef and PartitionKey.
>
> Yeah, you don't want to have the PK index and the partitioning logic
> to not be in sync about the collation rules applied to the individual
> rows.
>
> > By the way, I think the error message is misleading to users.
> > ostgres=# alter table part_index add primary key (a);
> > ERROR:  unique constraint on partitioned table must include all
> partitioning columns
> > DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a"
> which is part of the partition key.
>
> I think it's kind of similar to the message you get when a GROUP BY
> column's collation doesn't match the column appearing in the SELECT
> list:
>
> explain SELECT c collate case_insensitive, count(c) FROM
> pagg_tab_case_s GROUP BY c collate "C";
> ERROR:  column "pagg_tab_case_s.c" must appear in the GROUP BY clause
> or be used in an aggregate function
> LINE 1: explain SELECT c collate case_insensitive, count(c) FROM pag...
>
> Perhaps it would be more helpful for the error message or hint or
> detail to mention the actual discrepancy (collation mismatch) that's
> causing the error.
>
> There might be other instances of such an error and I am not sure it
> would be worthwhile to find and fix them all.
>
>
Thanks for the explanation.   We had better focus on the wrong result issue.

I feel that it's hard only to use one struct(for example, X), which just
calls equal(X, expr)
can check both the expression match and the collation match.

Maybe we should add another collation match checks in
match_clause_to_partition_key(), like
partition pruning logic does.

Any thoughts?

-- 
Thanks,
Tender Wang

Re: [BUG FIX] Fix validation of COPY options FORCE_NOT_NULL/FORCE_NULL

On Thu, Oct 24, 2024 at 09:41:03AM +0300, Joel Jacobson wrote:
> Therefore closing this patch and marking it as Committed.

Thanks.  I have managed to miss the CF entry.
--
Michael


signature.asc
Description: PGP signature

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

2024-10-24 Thread Amit Langote

Hi,

On Thu, Oct 24, 2024 at 1:46 PM Tender Wang  wrote:
> Tender Wang  于2024年10月23日周三 21:48写道：
>>
>> Hi all,
>>
>> I find another issue as $SUBJECT when I work on [1].
>
> When I continue to work on this, I find below issue. But I'm not sure whether 
> it is a bug.
>
> postgres=# create table part_index(a text primary key) partition by list ( a 
> collate "POSIX");
> ERROR:  unique constraint on partitioned table must include all partitioning 
> columns
> DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a" which 
> is part of the partition key.
> postgres=# create table part_index(a text) partition by list ( a collate 
> "POSIX");
> CREATE TABLE
> postgres=# alter table part_index add primary key (a);
> ERROR:  unique constraint on partitioned table must include all partitioning 
> columns
> DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a" which 
> is part of the partition key.
>
> It seems we can't create a primary key if the collation is different between 
> columnDef and PartitionKey.

Yeah, you don't want to have the PK index and the partitioning logic
to not be in sync about the collation rules applied to the individual
rows.

> By the way, I think the error message is misleading to users.
> ostgres=# alter table part_index add primary key (a);
> ERROR:  unique constraint on partitioned table must include all partitioning 
> columns
> DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a" which 
> is part of the partition key.

I think it's kind of similar to the message you get when a GROUP BY
column's collation doesn't match the column appearing in the SELECT
list:

explain SELECT c collate case_insensitive, count(c) FROM
pagg_tab_case_s GROUP BY c collate "C";
ERROR:  column "pagg_tab_case_s.c" must appear in the GROUP BY clause
or be used in an aggregate function
LINE 1: explain SELECT c collate case_insensitive, count(c) FROM pag...

Perhaps it would be more helpful for the error message or hint or
detail to mention the actual discrepancy (collation mismatch) that's
causing the error.

There might be other instances of such an error and I am not sure it
would be worthwhile to find and fix them all.

-- 
Thanks, Amit Langote

Re: Pgoutput not capturing the generated columns

2024-10-24 Thread Amit Kapila

On Wed, Oct 23, 2024 at 11:51 AM Amit Kapila  wrote:
>
> Thanks. One more thing that I didn't like about the patch is that it
> used column_list to address the "publish_generated_columns = false"
> case such that we build column_list without generated columns for the
> same. The first problem is that it will add overhead to always probe
> column_list during proto.c calls (for example during
> logicalrep_write_attrs()), then it makes the column_list code complex
> especially the handling in pgoutput_column_list_init(), and finally
> this appears to be a misuse of column_list.
>
> So, I suggest remembering this information in RelationSyncEntry and
> then using it at the required places. We discussed above that
> contradictory values of "publish_generated_columns" across
> publications for the same relations are not accepted, so we can detect
> that during get_rel_sync_entry() and give an ERROR for the same.
>

The changes in tablesync look complicated and I am not sure whether it
handles the conflicting publish_generated_columns option correctly. I
have few thoughts for the same.
* The handling of contradictory options in multiple publications needs
to be the same as for column lists. I think it is handled (a) during
subscription creation, (b) during copy in fetch_remote_table_info(),
and (c) during replication. See Peter's email
(https://www.postgresql.org/message-id/CAHut%2BPs985rc95cB2x5yMF56p6Lf192AmCJOpAtK_%2BC5YGUF2A%40mail.gmail.com)
to understand why this new option has to be handled in the same way as
the column list.

* While fetching column list via pg_get_publication_tables(), we
should detect contradictory publish_generated_columns options similar
to column lists, and then after we get publish_generated_columns as
return value, we can even use that while fetching attribute
information.

A few additional comments:
1.
- /* Regular table with no row filter */
- if (lrel.relkind == RELKIND_RELATION && qual == NIL)
+ /*
+ * Check if the remote table has any generated columns that should be
+ * copied.
+ */
+ for (int i = 0; i < relmapentry->remoterel.natts; i++)
+ {
+ if (lrel.attremotegen[i])
+ {
+ gencol_copy_needed = true;
+ break;
+ }
+ }

Can't we get this information from fetch_remote_table_info() instead
of traversing the entire column list again?

2.
@@ -1015,7 +1110,6 @@ fetch_remote_table_info(char *nspname, char *relname,
  ExecDropSingleTupleTableSlot(slot);

  lrel->natts = natt;
-
  walrcv_clear_result(res);

Spurious line removal.

3. Why do we have to specifically exclude generated columns of a
subscriber-side table in make_copy_attnamelist()? Can't we rely on
fetch_remote_table_info() and logicalrep_rel_open() that the final
remote attrlist will contain the generated column only if the
subscriber doesn't have a generated column otherwise it would have
given an error in logicalrep_rel_open()?

-- 
With Regards,
Amit Kapila.

Typo spotted in GetFileBackupMethod comment(?)

2024-10-24 Thread Haoran Zhang

Hey Robert,

I'm investigating how PostgresQL 17 does incremental backup and find this
comment of GetFileBackupMethod a bit off. relative_block_numbers being an
array of at *most* RELSEG_SIZE makes more sense to me. So I made this patch
to address it.

Note that this is the first time I made a patch for the community. So
kindly bear with me if I did it wrong somehow.

Best,
Haoran


v01-fix_GetFileBackupMethod_comment.patch
Description: Binary data

Re: Using read_stream in index vacuum

2024-10-24 Thread Andrey M. Borodin

I've also added GiST vacuum to the patchset.

> On 24 Oct 2024, at 01:04, Melanie Plageman  wrote:
> 
> On Wed, Oct 23, 2024 at 4:29 PM Andrey M. Borodin  
> wrote:
>> 
>>> On 23 Oct 2024, at 20:57, Andrey M. Borodin  wrote:
>>> 
>>> I'll think how to restructure flow there...
>> 
>> OK, I've understood how it should be structured. PFA v5. Sorry for the noise.
> 
> I think this would be a bit nicer:
> 
> while (BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
> {
>   block = btvacuumpage(&vstate, buf);
>   if (info->report_progress)
>  pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE, block);
> }

OK.

> Maybe change btvacuumpage() to return the block number to avoid the
> extra BufferGetBlockNumber() calls (those add up).
Makes sense... now we have non-obvious function prototype, but I think it worth 
it.

> While looking at this, I started to wonder if it isn't incorrect that
> we are not calling pgstat_progress_update_param() for the blocks that
> we backtrack and read in btvacuumpage() too (on master as well).
> btvacuumpage() may actually have scanned more than one block, so...
It's correct, user wants to know progress and backtracks do not count towards 
progress.
Though, it must be relatevely infrequent case.

> Unrelated to code review, but btree index vacuum has the same issues
> that kept us from committing streaming heap vacuum last release --
> interactions between the buffer access strategy ring buffer size and
> the larger reads -- one of which is an increase in the number of WAL
> syncs and writes required. Thomas mentions it here [1] and here [2] is
> the thread where he proposes adding vectored writeback to address some
> of these issues.

Do I need to do anything about it?

Thank you!


Best regards, Andrey Borodin.


v6-0001-Prototype-B-tree-vacuum-streamlineing.patch
Description: Binary data


v6-0002-Use-read_stream-in-GiST-vacuum.patch
Description: Binary data

Re: Enhancing Memory Context Statistics Reporting

2024-10-24 Thread Rahila Syed

Hi Torikoshia,

Thank you for reviewing the patch!

On Wed, Oct 23, 2024 at 9:28 AM torikoshia 
wrote:

> On 2024-10-22 03:24, Rahila Syed wrote:
> > Hi,
> >
> > PostgreSQL provides following capabilities for reporting memory
> > contexts statistics.
> > 1. pg_get_backend_memory_contexts(); [1]
> > 2. pg_log_backend_memory_contexts(pid); [2]
> >
> > [1]  provides a view of memory context statistics for a local backend,
> > while [2] prints the memory context statistics of any backend or
> > auxiliary
> > process to the PostgreSQL logs. Although [1] offers detailed
> > statistics,
> > it is limited to the local backend, restricting its use to PostgreSQL
> > client backends only.
> > On the other hand, [2] provides the statistics for all backends but
> > logs them in a file,
> > which may not be convenient for quick access.
> >
> > I propose enhancing memory context statistics reporting by combining
> > these
> > capabilities and offering a view of memory statistics for all
> > PostgreSQL backends
> > and auxiliary processes.
>
> Thanks for working on this!
>
> I originally tried to develop something like your proposal in [2], but
> there were some difficulties and settled down to implement
> pg_log_backend_memory_contexts().
>
> Yes. I am revisiting this problem :)


> > Attached is a patch that implements this functionality. It introduces
> > a SQL function
> > that takes the PID of a backend as an argument, returning a set of
> > records,
> > each containing statistics for a single memory context. The underlying
> > C function
> > sends a signal to the backend and waits for it to publish its memory
> > context statistics
> >  before returning them to the user. The publishing backend copies
> > these statistics
> > during the next CHECK_FOR_INTERRUPTS call.
>
> I remember waiting for dumping memory contexts stats could cause trouble
> considering some erroneous cases.
>
> For example, just after the target process finished dumping stats,
> pg_get_remote_backend_memory_contexts() caller is terminated before
> reading the stats, calling pg_get_remote_backend_memory_contexts() has
> no response any more:
>
> [session1]$ psql
> (40699)=#
>
> $ kill -s SIGSTOP 40699
>
> [session2] psql
>(40866)=# select * FROM
> pg_get_remote_backend_memory_contexts('40699', false); -- waiting
>
> $ kill -s SIGSTOP 40866
>
> $ kill -s SIGCONT 40699
>
> [session3] psql
> (47656) $ select pg_terminate_backend(40866);
>
> $ kill -s SIGCONT 40866 -- session2 terminated
>
> [session3] (47656)=# select * FROM
> pg_get_remote_backend_memory_contexts('47656', false); -- no response
>
> It seems the reason is memCtxState->in_use is now and
> memCtxState->proc_id is 40699.
> We can continue to use pg_get_remote_backend_memory_contexts() after
> specifying 40699, but it'd be hard to understand for users.
>
> Thanks for testing and reporting. While I am not able to reproduce this
problem,
I think this may be happening because the requesting backend/caller is
terminated
before it gets a chance to mark  memCtxState->in_use as false.

In this case memCtxState->in_use should be marked as
'false' possibly during the processing of ProcDiePending in
ProcessInterrupts().

> This approach facilitates on-demand publication of memory statistics
> > for a specific backend, rather than collecting them at regular
> > intervals.
> > Since past memory context statistics may no longer be relevant,
> > there is little value in retaining historical data. Any collected
> > statistics
> > can be discarded once read by the client backend.
> >
> > A fixed-size shared memory block, currently accommodating 30 records,
> >  is used to store the statistics. This number was chosen arbitrarily,
> >  as it covers all parent contexts at level 1 (i.e., direct children of
> > the top memory context)
> > based on my tests.
> > Further experiments are needed to determine the optimal number
> > for summarizing memory statistics.
> >
> > Any additional statistics that exceed the shared memory capacity
> > are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> > backend
> >  first reads from the shared memory, and if necessary, retrieves the
> > remaining data from the file,
> > combining everything into a unified view. The files are cleaned up
> > automatically
> > if a backend crashes or during server restarts.
> >
> > The statistics are reported in a breadth-first search order of the
> > memory context tree,
> >  with parent contexts reported before their children. This provides a
> > cumulative summary
> > before diving into the details of each child context's consumption.
> >
> > The rationale behind the shared memory chunk is to ensure that the
> > majority of contexts which are the direct children of
> > TopMemoryContext,
> > fit into memory
> > This allows a client to request a summary of memory statistics,
> > which can be served from memory without the overhead of file access,
> > unless necessary.
> >
> > A publi

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-10-24 Thread Dilip Kumar

On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman 
wrote:

> Hi,
>
> HeapScanDescData->rs_cindex (the current index into the array of
> visible tuples stored in the heap scan descriptor) is used for
> multiple scan types, but for bitmap heap scans, AFAICT, it should
> never be < 0.
>
> As such, I find this test in heapam_scan_bitmap_next_tuple() pretty
> confusing.
>
>  if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
>
> I know it seems innocuous, but I find it pretty distracting. Am I
> missing something? Could it somehow be < 0?
>
> If not, I propose removing that part of the if statement like in the
> attached patch.
>
>
You are right it can never be < 0 in this case at least.  In fact you don't
need to explicitly set it to 0 in initscan[1], because before
calling heapam_scan_bitmap_next_tuple() we must
call heapam_scan_bitmap_next_block() and this function is initializing this
to 0 (hscan->rs_cindex = 0;).  Anyway no object even if you prefer to
initialize in initscan(), just the point is that we are already doing it
for each block before fetching tuples from the block.

[1]
@@ -378,6 +378,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool
keep_startblock)
  ItemPointerSetInvalid(&scan->rs_ctup.t_self);
  scan->rs_cbuf = InvalidBuffer;
  scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_cindex = 0;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [BUG FIX] Fix validation of COPY options FORCE_NOT_NULL/FORCE_NULL

2024-10-24 Thread Joel Jacobson

On Sat, Oct 19, 2024, at 09:52, Joel Jacobson wrote:
> However, since my last email, I've found some other problems in this area,
> and think we should do a more ambitious improvement, by rearranging the
> incorrect options tests into three categories:
>
> 1. incorrect COPY {FROM|TO} options
> 2. incorrect COPY FROM options
> 3. incorrect COPY TO options
>
> Also, I've found two new problems:
>
> 1. Incorrect options tests are missing for the QUOTE and ESCPAE options.
>This was discovered by Jian He in a different thread.
>
> 2. One of the ON_ERROR incorrect options tests also depend on the order
>of checks in ProcessCopyOptions.
>
> To explain my current focus on the COPY tests, it's because we're currently
> working on the new raw format for COPY, and it would be nice to cleanup these
> tests as a preparatory step first.

I realize these suggested changes are out of scope for $subject, that was about 
a bug fix.
Therefore closing this patch and marking it as Committed.

/Joel

Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.

Tender Wang  于2024年10月23日周三 21:48写道：

> Hi all,
>
> I find another issue as $SUBJECT when I work on [1].
>

When I continue to work on this, I find below issue. But I'm not sure
whether it is a bug.

postgres=# create table part_index(a text primary key) partition by list (
a collate "POSIX");
ERROR:  unique constraint on partitioned table must include all
partitioning columns
DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a"
which is part of the partition key.
postgres=# create table part_index(a text) partition by list ( a collate
"POSIX");
CREATE TABLE
postgres=# alter table part_index add primary key (a);
ERROR:  unique constraint on partitioned table must include all
partitioning columns
DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a"
which is part of the partition key.

It seems we can't create a primary key if the collation is different
between columnDef and PartitionKey.

By the way, I think the error message is misleading to users.
ostgres=# alter table part_index add primary key (a);
ERROR:  unique constraint on partitioned table must include all
partitioning columns
DETAIL:  PRIMARY KEY constraint on table "part_index" lacks column "a"
which is part of the partition key.



-- 
Thanks,
Tender Wang

Re: Using Expanded Objects other than Arrays from plpgsql